Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • merge several datasets using an ¨"linking file"

    Dear Stata Forum,

    I need to merge several datasets that together will make an (unbalanced) panel data, and to do so, I have several ¨"linking files":

    On the one hand I have 10 cross-section samples with about 20.000 households each, and only about half of them continue the in survey in the next wave (so that household is tracked up to 2 years). The structure of the waves would be like this:
    ID YEAR X1 X2 X3
    1 2006 x1 ... ...
    2 2006 ... ... ...
    3 2006 ... ... ...
    4 2006 ... .. ...
    and 2007 wave:
    ID YEAR X1 X2 X3
    1 2007 x1 ... ...
    2 2007 ... ... ...
    3 2007 ... ... ...
    4 2007 ... .. ...

    Then, I have 9 linking files that should allow connecting the different waves: this "linking file" connects observations across waves. For instance, the linking file 2006-2006 provide the identification number of 2007 wave and the identification number that corresponds to the previous wave 2006. This is how it look like:

    ID_2007 participate_in_next _wave ID_2006
    1 0 .
    2 1 3
    3 1 2
    4 0 .
    Can any one give some hints or direction to take to make this the more efficient possible?.



  • #2
    So assuming that this linking file contains a whole series of ID variables, ID2006, ID2007, ID2008,...,ID2015 for all ten years of your data, I would do something like this:

    Code:
    // USE THE LINKING FILE TO CREATE A UNIQUE ID FOR EACH PERSON
    // THAT WILL APPLY IN ALL YEARS
    use linking_file, clear
    gen long unique_id = _n
    tempfile links
    save `links'
    
    // MERGE THE UNIQUE ID INTO THE YEARLY FILES
    forvalues y = 2006/2015 {
        use wave`y', clear
        rename ID ID`y'
        merge 1:1 ID`y' using `links', keep(master match) keepusing(unique_id)
        save linkable_wave_`y', replace
    }
    This will leave you with ten files each of which contains the same data as the original ten files, but with each observation identified by a unique ID that is the same for the same person in whichever files he/she appears in. The next step is to clean those files individually: even the most professionally curated survey data usually contains errors and inconsistencies. It is usually easiest to clean those problems up in the individual files before you try to put them together. Once that is done, you can then -append- all the cleaned files together.

    Comment

    Working...
    X