Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Adding observations to a dataset with .csv files and a loop

    Hello Stata users,

    I am using Stata 17. I want to add observations and variables in a dataset A.

    I have the following dataset A
    Code:
    Date    Year    urbanname  urbancode etc..
    
    2010m1  2010     Beijing  1029 etc.
    2010m2  2010     Beijing  1029 ...
    2010m3  2010     Beijing  1029 ...
    ....
    2015m1  2015     Paris    1030 etc
    For different cities and different time periods. I would like to add observations of other cities (that are not in the initial dataset) which I have in different .csv files (datasets B,C, etc..). For each city that I want to add I have the following variables

    1) the dates
    2) the urbanname
    3) the urbancode
    4) other variables which I do not yet have in my initial dataset but that I want to add

    What would be your advice on how to proceed ? I thought of doing it with R but the dataset A does not open well in RStudio and the variable Date is not well imported.
    Ideally, i'd like to use a loop as I have 17 .csv files, each containing the information of a city.

    Thank you very much for your help,

    Benoît
    Last edited by Benoit Decoco; 05 Jul 2022, 08:07.

  • #2
    Please use dataex to present your data.

    Comment


    • #3
      Dear Jared,

      Unfortunately I have more than 1M observations I tried to use dataex but the result seems less clear than what I put in the first post.

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input float(makeid year) strL urbanname long urbancode
        5 1854 "Bielefeld" 1
        1 1854 "Bielefeld" 1
        4 1854 "Bielefeld" 1
        6 1854 "Bielefeld" 1
        7 1854 "Bielefeld" 1
        3 1854 "Bielefeld" 1
        2 1854 "Bielefeld" 1
       17 1855 "Bielefeld" 1
        8 1855 "Bielefeld" 1
       12 1855 "Bielefeld" 1
       15 1855 "Bielefeld" 1
       13 1855 "Bielefeld" 1
       19 1855 "Bielefeld" 1
       11 1855 "Bielefeld" 1
       14 1855 "Bielefeld" 1
       16 1855 "Bielefeld" 1
        9 1855 "Bielefeld" 1
       18 1855 "Bielefeld" 1
       10 1855 "Bielefeld" 1
       21 1856 "Bielefeld" 1
       20 1856 "Bielefeld" 1
       22 1856 "Bielefeld" 1
       30 1856 "Bielefeld" 1
       25 1856 "Bielefeld" 1
       23 1856 "Bielefeld" 1
       27 1856 "Bielefeld" 1
       29 1856 "Bielefeld" 1
       24 1856 "Bielefeld" 1
       28 1856 "Bielefeld" 1
       26 1856 "Bielefeld" 1
       31 1856 "Bielefeld" 1
       32 1857 "Bielefeld" 1
       39 1857 "Bielefeld" 1
       42 1857 "Bielefeld" 1
       34 1857 "Bielefeld" 1
       43 1857 "Bielefeld" 1
       41 1857 "Bielefeld" 1
       37 1857 "Bielefeld" 1
       38 1857 "Bielefeld" 1
       33 1857 "Bielefeld" 1
       36 1857 "Bielefeld" 1
       35 1857 "Bielefeld" 1
       40 1857 "Bielefeld" 1
       53 1858 "Bielefeld" 1
       45 1858 "Bielefeld" 1
       44 1858 "Bielefeld" 1
       55 1858 "Bielefeld" 1
       50 1858 "Bielefeld" 1
       46 1858 "Bielefeld" 1
       52 1858 "Bielefeld" 1
       47 1858 "Bielefeld" 1
       48 1858 "Bielefeld" 1
       49 1858 "Bielefeld" 1
       51 1858 "Bielefeld" 1
       54 1858 "Bielefeld" 1
       61 1859 "Bielefeld" 1
       66 1859 "Bielefeld" 1
       58 1859 "Bielefeld" 1
       56 1859 "Bielefeld" 1
       65 1859 "Bielefeld" 1
       62 1859 "Bielefeld" 1
       59 1859 "Bielefeld" 1
       63 1859 "Bielefeld" 1
       60 1859 "Bielefeld" 1
       64 1859 "Bielefeld" 1
       67 1859 "Bielefeld" 1
       57 1859 "Bielefeld" 1
       70 1860 "Bielefeld" 1
       77 1860 "Bielefeld" 1
       74 1860 "Bielefeld" 1
       72 1860 "Bielefeld" 1
       79 1860 "Bielefeld" 1
       71 1860 "Bielefeld" 1
       75 1860 "Bielefeld" 1
       73 1860 "Bielefeld" 1
       78 1860 "Bielefeld" 1
       68 1860 "Bielefeld" 1
       69 1860 "Bielefeld" 1
       76 1860 "Bielefeld" 1
       91 1861 "Bielefeld" 1
       89 1861 "Bielefeld" 1
       88 1861 "Bielefeld" 1
       82 1861 "Bielefeld" 1
       84 1861 "Bielefeld" 1
       86 1861 "Bielefeld" 1
       80 1861 "Bielefeld" 1
       85 1861 "Bielefeld" 1
       83 1861 "Bielefeld" 1
       87 1861 "Bielefeld" 1
       90 1861 "Bielefeld" 1
       81 1861 "Bielefeld" 1
       94 1862 "Bielefeld" 1
      102 1862 "Bielefeld" 1
       99 1862 "Bielefeld" 1
       93 1862 "Bielefeld" 1
      100 1862 "Bielefeld" 1
       95 1862 "Bielefeld" 1
       92 1862 "Bielefeld" 1
       97 1862 "Bielefeld" 1
       98 1862 "Bielefeld" 1
      end

      Comment


      • #4
        I don't understand. Do these csv files already have the variables you've described?

        Comment


        • #5
          I edited my initial post to make it clearer.
          I have an initial dataset A which is described in the initial post. I have other datasets (B,C, ...) , each one describing a city that is not in dataset A. The datasets B,C,... have 3 variables (or columns) in common with the dataset A : date, urbancode, urbanname. So I want to add the other variables to dataset A.

          My goal is to merge the rows of datasets B,C,... with the dataset A.


          This is the dataex description of one of the datasets (B,C,...) which is on Jakarta.


          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str12 date str8 urbanname int urbancode float pm25
          "15-août"     "Jarkarta" 1017         .
          "15-décembre" "Jarkarta" 1017  84.16666
          "15-juillet"   "Jarkarta" 1017         .
          "15-octobre"   "Jarkarta" 1017         .
          "15-septembre" "Jarkarta" 1017         .
          "16-août"     "Jarkarta" 1017 117.98387
          "16-avril"     "Jarkarta" 1017 105.28333
          "16-décembre" "Jarkarta" 1017 68.935486
          "16-février"  "Jarkarta" 1017  89.82758
          "16-janvier"   "Jarkarta" 1017 102.48276
          "16-juillet"   "Jarkarta" 1017 118.93549
          "16-juin"      "Jarkarta" 1017 128.51666
          "16-mai"       "Jarkarta" 1017 123.35484
          "16-mars"      "Jarkarta" 1017 103.32258
          "16-novembre"  "Jarkarta" 1017 118.73333
          "16-octobre"   "Jarkarta" 1017  111.3387
          "16-septembre" "Jarkarta" 1017 108.01667
          "17-août"     "Jarkarta" 1017  98.37615
          "17-avril"     "Jarkarta" 1017  78.38889
          "17-décembre" "Jarkarta" 1017      51.1
          "17-février"  "Jarkarta" 1017  78.10053
          "17-janvier"   "Jarkarta" 1017     68.25
          "17-juillet"   "Jarkarta" 1017 100.51613
          "17-juin"      "Jarkarta" 1017  84.63333
          "17-mai"       "Jarkarta" 1017 109.27586
          "17-mars"      "Jarkarta" 1017  78.10789
          "17-novembre"  "Jarkarta" 1017  67.02778
          "17-octobre"   "Jarkarta" 1017   86.3023
          "17-septembre" "Jarkarta" 1017    102.75
          end
          I hope this is clearer.

          Comment


          • #6
            See, this is why we ask for example data. Now that I have a good visual representation of what the issue is, I can understand what's going on much better. I think I know the issue, but my question for you now, is why is your data monthly in the first and third dataexes, but not in the second one?


            Either way, you don't want to "merge" the rows. What you wanna do, is append them to your first dataset. Consider the full worked example
            Code:
            clear
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input float(time gdp) long region
            132 .064902 22
            133 .065123 22
            134 .067379 22
            135 .069164 22
            136 .069452 22
            137 .070135 22
            138 .069298 22
            139 .076049 22
            140 .071104 22
            141 .069214 22
            142 .063417 22
            143 .048752 22
            144 .054664 22
            145 .059415 22
            146 .059393 22
            147 .070151 22
            148 .063613 22
            149 .060897 22
            150 .064957 22
            151 .065703 22
            152 .057225 22
            153 .049896 22
            154 .039404 22
            155 .032541 22
            156 .046425 22
            157 .065049 22
            158 .047784 22
            159 .063896 22
            end
            format %tq time
            
            
            tempfile Taiwan
            
            sa `Taiwan', replace
            
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input float(time gdp) long region
            132  .062 9
            133  .059 9
            134  .058 9
            135  .062 9
            136  .079 9
            137  .068 9
            138  .046 9
            139  .052 9
            140  .037 9
            141  .029 9
            142  .012 9
            143  .015 9
            144  .025 9
            145  .036 9
            146  .047 9
            147  .059 9
            148  .058 9
            149  .072 9
            150  .061 9
            151  .014 9
            152 -.032 9
            153 -.061 9
            154 -.081 9
            155 -.065 9
            156 -.029 9
            157  .005 9
            158  .039 9
            159  .083 9
            end
            format %tq time
            
            
            tempfile HongKong
            
            sa `HongKong', replace
            
            clear
            
            foreach x in Taiwan HongKong {
            
            ap using ``x''    
                
                
            }
            
            label values region region
            label def region 9 "HongKong" 22 "Taiwan", modify
            
            br
            i save datasets as tempfiles (same variables names, all that jazz) and then I append them on top of one another.

            Comment


            • #7
              Thank you Jared !

              The second dataex is also monthly data but it is written in french because of a R function I used, I am going to modify it to match the format of the first dataex.

              I still have a little question : there are variables in the .csv files that do not exist in dataset A. Won't this be a problem when I append the .csv files to dataset A ?
              In your example, imagine there is an extra variable in the Hongkong dataset that does not exist in the Taiwan dataset. How should I do ?

              Thank you again for your help !
              Last edited by Benoit Decoco; 05 Jul 2022, 08:55.

              Comment


              • #8
                Code:
                clear
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input float(time gdp) long region
                132 .064902 22
                133 .065123 22
                134 .067379 22
                135 .069164 22
                136 .069452 22
                137 .070135 22
                138 .069298 22
                139 .076049 22
                140 .071104 22
                141 .069214 22
                142 .063417 22
                143 .048752 22
                144 .054664 22
                145 .059415 22
                146 .059393 22
                147 .070151 22
                148 .063613 22
                149 .060897 22
                150 .064957 22
                151 .065703 22
                152 .057225 22
                153 .049896 22
                154 .039404 22
                155 .032541 22
                156 .046425 22
                157 .065049 22
                158 .047784 22
                159 .063896 22
                end
                format %tq time
                
                
                tempfile Taiwan
                
                sa `Taiwan', replace
                
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input float(time gdp) long region
                132  .062 9
                133  .059 9
                134  .058 9
                135  .062 9
                136  .079 9
                137  .068 9
                138  .046 9
                139  .052 9
                140  .037 9
                141  .029 9
                142  .012 9
                143  .015 9
                144  .025 9
                145  .036 9
                146  .047 9
                147  .059 9
                148  .058 9
                149  .072 9
                150  .061 9
                151  .014 9
                152 -.032 9
                153 -.061 9
                154 -.081 9
                155 -.065 9
                156 -.029 9
                157  .005 9
                158  .039 9
                159  .083 9
                end
                format %tq time
                
                g taino =1
                
                
                tempfile HongKong
                
                sa `HongKong', replace
                
                clear
                
                foreach x in Taiwan HongKong {
                
                ap using ``x''    
                    
                    
                }
                
                label values region region
                label def region 9 "HongKong" 22 "Taiwan", modify
                
                br
                Nothing bad will happen, you'll just have lots of missing data. My advice to you is to not use R. Only use R for analysis, there are few things Stata can't do that R can.

                Comment


                • #9
                  Cross-posted at https://stackoverflow.com/questions/...files-in-stata

                  Please note our policy on cross-posting, which is that you should tell us about it. https://www.statalist.org/forums/help#crossposting

                  Comment

                  Working...
                  X