Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help merging excel files in stata with 3 variables

    I have 3 excel files with poverty data at the county level for the years 1980,1990, and 2000. I have tried using the merge function to no avail. The counties correspond by state, countyname and code, but I can't seem to figure out how to properly merge them. Here is a screenshot of what they look like once I have processed them as STATA files. I am trying keep the identifying variables and only add the unique columns side-by-side.
    Click image for larger version

Name:	Screenshot (191).png
Views:	1
Size:	175.5 KB
ID:	1616284 Click image for larger version

Name:	Screenshot (190).png
Views:	1
Size:	170.7 KB
ID:	1616285




  • #2
    1. please read the FAQ since your pictures are quite unreadable on my machine

    2. you need to import the data before you do anything else; see
    Code:
    help import excel
    3. then provide data samples using -dataex- (read the FAQ)

    Comment


    • #3
      Thank you for the clarification. I have already imported the data from excel. The columns I want to add correspond to the same counties but with poverty data for different years. I just want to add the counties match to each other with an additional column.

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str3 State str347 CountyName long(Codes Persons1990 Belowpovertylevel1990) double Percentbelow1990
      ""   "United States, total"    . 241977859 31742864 13.118086146881728
      ""   ""                        .         .        .                  .
      ""   "Alabama, total"          .   3945798   723614 18.338850594987377
      ""   ""                        .         .        .                  .
      "AL" "Autauga County"       1001     34009     5324               15.7
      "AL" "Baldwin County"       1003     97227    13947               14.3
      "AL" "Barbour County"       1005     24785     6244               25.2
      "AL" "Bibb County"          1007     16304     3454               21.2
      "AL" "Blount County"        1009     38924     5938               15.3
      "AL" "Bullock County"       1011     10340     3776               36.5
      "AL" "Butler County"        1013     21648     6815               31.5
      "AL" "Calhoun County"       1015    110981    17385               15.7
      "AL" "Chambers County"      1017     36424     6849               18.8
      "AL" "Cherokee County"      1019     19464     3424               17.6
      "AL" "Chilton County"       1021     32113     5495               17.1
      "AL" "Choctaw County"       1023     15929     4809               30.2
      "AL" "Clarke County"        1025     26896     6960               25.9
      "AL" "Clay County"          1027     13025     2272               17.4
      "AL" "Cleburne County"      1029     12653     1936               15.3
      "AL" "Coffee County"        1031     39780     6180               15.5
      end

      Comment


      • #4
        Based on your data example and screenshots, your data are problematic because it is an import from Excel which contains summary data (US and state totals at the top) and the county-level data below. Good practice with statistical software is to keep like with like -- typically this means keeping raw data in the dataset, or sometimes you may want summary-level data in its own dataset, but mixing the two types together will usually cause you pain down the road.

        When datasets are combined, merge and append have very specific meanings. Merge is used when the data need to be combined horizontally in someway, most often linked by some shared characteristic(s), such as a persons name or unique identifier.code. When data simply need to be combined vertically, this is called appending. When appending data, usually we place one dataset right after the next, keeping data in common variables aligned and keeping or discarding variables that are not in common among them. In your case, -append-ing data is what you seek.

        Here is a sketch of a solution that you can generalize with your own data after fixing it up. Before you can -append- the datasets, you need to create a new variable that identifies the data year. This will show the steps needed to merge 2 datasets, but you can in princiciple combine many more than 2. Depending on the structure of your Excel files and the cleaning steps needed, you could automate this process, but those details are lacking and would otherwise obscure the core question about -append-.

        Code:
        tempfile data1 data2
        
        frame create Data1
        frame change Data1
        .... import excel ...
        gen int year = 2000
        save `data1'
        
        frame create Data2
        frame change Data2
        .... import excel ...
        gen int year = 2010
        save `data2'
        
        frame create Data
        frame change Data
        append using `data1' `data2'

        Comment


        • #5
          As a user of this sort of data, I would make just a few changes to Leonardo's example code. In particular, I agree that append is the appropriate command, rather than merge or joinby.
          Code:
          tempfile data1 data2
          
          frame create Data1
          frame change Data1
          .... import excel ...
          // keep only the county data
          drop if missing(Codes)
          gen int year = 2000
          // remove the year suffix on the variable names
          rename (*2000) (*)
          save `data1'
          
          frame create Data2
          frame change Data2
          .... import excel ...
          // keep only the county data
          drop if missing(Codes)
          gen int year = 2010
          // remove the year suffix on the variable names
          rename (*2010) (*)
          save `data2'
          
          frame create Data
          frame change Data
          append using `data1' `data2'
          After this, there will be two observations for each county (that appears in both years, some will not) and the variable Persons will have the values of Persons2000 in 2000 and Persons2010 in 2010.

          Comment

          Working...
          X