Egen - multiple datasets

Ahmed A Arif

Join Date: Dec 2018
Posts: 17

Egen - multiple datasets

20 Jan 2019, 14:25

I am using NHIS data from 1986-2009. I want to find a simpler way to generate ID variable for datasets labeled personsx_year.dta (year 1986-2009). The id variable - publicid - for years 1986-94 is generated using seven variabes , for years 1995-96 using three, years 1997-2003 using four.,,

I'm using the following the syntax.

Code:

*1986-1994 NHIS
* 1986
use personsx_1986, clear
egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
save personsx_1986, replace
clear

*1987
use personsx_1987, clear
egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
save personsx_1987, replace
clear

*1988
use personsx_1988, clear
egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
save personsx_1988, replace
clear

*1989
use personsx_1989, clear
egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
save personsx_1989, replace
clear

*1990
use personsx_1990, clear
egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
save personsx_1990, replace
clear

*1991
use personsx_1991, clear
egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
save personsx_1991, replace
clear

*1992
use personsx_1992, clear
egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
save personsx_1992, replace
clear

*1993
use personsx_1993, clear
egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
save personsx_1993, replace
clear

*1994
use personsx_1994, clear
egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
save personsx_1994, replace
clear


*1995-1996 NHIS
*1995
use personsx_1995, clear
egen publicid=concat(year hhid pnum)
save personsx_1995, replace
clear

*1996
use personsx_1996, clear
egen publicid=concat(year hhid pnum)
save personsx_1996, replace
clear

*1997-2003 NHIS
*1997
use personsx_1997, clear
egen publicid = concat (srvy_yr hhx fmx px)
save personsx_1997, replace
clear

Once I create the ID variable I want to append them using the syntax shared with me earlier.

Code:

forvalues y = 1986/2009 {
    append using personsx_`y
    replace year = `y' if missing(year)
}

Thanks

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30177

20 Jan 2019, 15:02

I think this is the simplest way:

Code:

clear
tempfile building
save `building', emptyok

forvalues y = 1986/2009 {
    use personsx_`y', clear
    if `y' <= 1994 {
        egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
    }
    else {
        egen publicid = concat (srvy_yr hhx fmx px)
    }
    gen year = `y'
    append using `building'
    save `"`building'"', replace
}

Note: Not tested; beware of errors, especially typos.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30177

20 Jan 2019, 15:06

Just realized there is a problem with the code in #2 due to the variable year, which apparently already exists in the 1986-2009 data sets, but not in the later ones. So revising the code to account for this:

Code:

clear
tempfile building
save `building', emptyok

forvalues y = 1986/2009 {
    use personsx_`y', clear
    if `y' <= 1994 {
        egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
    }
    else {
        egen publicid = concat (srvy_yr hhx fmx px)
    }
    capture confirm var year, exact
    if c(rc) == 0 {
        replace year = `y' if missing(year)
    }
    else {
        gen year = `y'
    }
    append using `building'
    save `"`building'"', replace
}

Comment

Ahmed A Arif

Join Date: Dec 2018

Posts: 17
#4

20 Jan 2019, 16:28

Thanks, Clyde.
Comment
Ahmed A Arif

Join Date: Dec 2018

Posts: 17
#5

20 Jan 2019, 16:42

Starting 1997 NHIS added a separate data file for sampled adults- samadult. I renamed these files as samadult_1997, samadult_1998 .....samadult_2009. Is there a way to incorporate the merge of personsx and samadult within the code in #3? Thanks
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30177
#6

20 Jan 2019, 17:06

I don't understand the question. Do you mean that you want to -merge- samadult_year with personsx_year before appending? If so, what variables constitute the merge key? And is it a 1:1 merge or something else?

Assuming the merge key is the same set of variables you use to create the publicid variable and that it's a 1:1 merge, then you could do this:

Code:

clear tempfile building save `building', emptyok forvalues y = 1986/2009 { use personsx_`y', clear if `y' <= 1994 { egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum) } else { egen publicid = concat (srvy_yr hhx fmx px) } if year >= 1997 { merge 1:1 srvy_yr hhx fmx px using samadult_`y' } capture confirm var year, exact if c(rc) == 0 { replace year = `y' if missing(year) } else { gen year = `y' } append using `building' save `"`building'"', replace }

Added: I would be very reluctant to do this in my own work. You have already identified many ways in which the data sets from the different years differ. I suspect that these differences are just the tip of an iceberg. It is likely that there are lots of other differences, including differences in the way particular variables are coded or different names for corresponding variables, that change over these 25 years. Using code like this to automate these merges and appends is an invitation to problems. If you are lucky, some of the incompatibilities will cause Stata to break, finding, for example, a variable that is a string in one file but numeric in another. If you are unlucky, everything will appear to go swimingly, and your resulting combined data set will look OK at first glance, but will prove to be quite wrong when subjected to serious scrutiny.

In my experience, attempting a single gross process to combine data sets like this ends badly. My approach is, rather, to explore each data set separately a bit, identifying the variables that I will be working with, their storage types, their encoding, etc. Then I set about deciding on a data design that I want for my working data set. I then write a file that converts each data set to that common data design: it includes things like renaming and recoding variables, merging where necessary, and then saving them under a new name (never destroy the original data). If the 25 years of files actually boil down to only three or four different setups that have to be converted to your common data design, then you can do that in three or four different do-files. Sometimes though, the worst case scenario comes true and you have to do this separately for each year's worth of data. But only after all of this work conforming each data set to the desired common data design is done do I then glue the pieces together.

Last edited by Clyde Schechter; 20 Jan 2019, 17:13.
Comment
Ahmed A Arif

Join Date: Dec 2018

Posts: 17
#7

20 Jan 2019, 17:37

Thanks for the advice, Clyde. I've been extremely cautious as I merge and append NHIS public use and mortality data sets. I agree with you, I would avoid doing it in a single step. BTW the merge is one to one in datasets post 1997. Thanks

Last edited by Ahmed A Arif; 20 Jan 2019, 17:56.
Comment

Announcement

Egen - multiple datasets

Comment

Comment

Comment

Comment

Comment

Comment