Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Egen - multiple datasets

    I am using NHIS data from 1986-2009. I want to find a simpler way to generate ID variable for datasets labeled personsx_year.dta (year 1986-2009). The id variable - publicid - for years 1986-94 is generated using seven variabes , for years 1995-96 using three, years 1997-2003 using four.,,

    I'm using the following the syntax.

    Code:
    *1986-1994 NHIS
    * 1986
    use personsx_1986, clear
    egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
    save personsx_1986, replace
    clear
    
    *1987
    use personsx_1987, clear
    egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
    save personsx_1987, replace
    clear
    
    *1988
    use personsx_1988, clear
    egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
    save personsx_1988, replace
    clear
    
    *1989
    use personsx_1989, clear
    egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
    save personsx_1989, replace
    clear
    
    *1990
    use personsx_1990, clear
    egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
    save personsx_1990, replace
    clear
    
    *1991
    use personsx_1991, clear
    egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
    save personsx_1991, replace
    clear
    
    *1992
    use personsx_1992, clear
    egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
    save personsx_1992, replace
    clear
    
    *1993
    use personsx_1993, clear
    egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
    save personsx_1993, replace
    clear
    
    *1994
    use personsx_1994, clear
    egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
    save personsx_1994, replace
    clear
    
    
    *1995-1996 NHIS
    *1995
    use personsx_1995, clear
    egen publicid=concat(year hhid pnum)
    save personsx_1995, replace
    clear
    
    *1996
    use personsx_1996, clear
    egen publicid=concat(year hhid pnum)
    save personsx_1996, replace
    clear
    
    *1997-2003 NHIS
    *1997
    use personsx_1997, clear
    egen publicid = concat (srvy_yr hhx fmx px)
    save personsx_1997, replace
    clear
    Once I create the ID variable I want to append them using the syntax shared with me earlier.

    Code:
    forvalues y = 1986/2009 {
        append using personsx_`y
        replace year = `y' if missing(year)
    }
    Thanks

  • #2
    I think this is the simplest way:

    Code:
    clear
    tempfile building
    save `building', emptyok
    
    forvalues y = 1986/2009 {
        use personsx_`y', clear
        if `y' <= 1994 {
            egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
        }
        else {
            egen publicid = concat (srvy_yr hhx fmx px)
        }
        gen year = `y'
        append using `building'
        save `"`building'"', replace
    }
    Note: Not tested; beware of errors, especially typos.

    Comment


    • #3
      Just realized there is a problem with the code in #2 due to the variable year, which apparently already exists in the 1986-2009 data sets, but not in the later ones. So revising the code to account for this:

      Code:
      clear
      tempfile building
      save `building', emptyok
      
      forvalues y = 1986/2009 {
          use personsx_`y', clear
          if `y' <= 1994 {
              egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
          }
          else {
              egen publicid = concat (srvy_yr hhx fmx px)
          }
          capture confirm var year, exact
          if c(rc) == 0 {
              replace year = `y' if missing(year)
          }
          else {
              gen year = `y'
          }
          append using `building'
          save `"`building'"', replace
      }

      Comment


      • #4
        Thanks, Clyde.

        Comment


        • #5
          Starting 1997 NHIS added a separate data file for sampled adults- samadult. I renamed these files as samadult_1997, samadult_1998 .....samadult_2009. Is there a way to incorporate the merge of personsx and samadult within the code in #3? Thanks

          Comment


          • #6
            I don't understand the question. Do you mean that you want to -merge- samadult_year with personsx_year before appending? If so, what variables constitute the merge key? And is it a 1:1 merge or something else?

            Assuming the merge key is the same set of variables you use to create the publicid variable and that it's a 1:1 merge, then you could do this:

            Code:
            clear
            tempfile building
            save `building', emptyok
            
            forvalues y = 1986/2009 {
                use personsx_`y', clear
                if `y' <= 1994 {
                    egen publicid = concat(year quarter psunumr weekcen segnum hhnum pnum)
                }
                else {
                    egen publicid = concat (srvy_yr hhx fmx px)
                }
               if year >= 1997 {
                    merge 1:1 srvy_yr hhx fmx px using samadult_`y'
                }
                capture confirm var year, exact
                if c(rc) == 0 {
                    replace year = `y' if missing(year)
                }
                else {
                    gen year = `y'
                }
                append using `building'
                save `"`building'"', replace
            }
            Added: I would be very reluctant to do this in my own work. You have already identified many ways in which the data sets from the different years differ. I suspect that these differences are just the tip of an iceberg. It is likely that there are lots of other differences, including differences in the way particular variables are coded or different names for corresponding variables, that change over these 25 years. Using code like this to automate these merges and appends is an invitation to problems. If you are lucky, some of the incompatibilities will cause Stata to break, finding, for example, a variable that is a string in one file but numeric in another. If you are unlucky, everything will appear to go swimingly, and your resulting combined data set will look OK at first glance, but will prove to be quite wrong when subjected to serious scrutiny.

            In my experience, attempting a single gross process to combine data sets like this ends badly. My approach is, rather, to explore each data set separately a bit, identifying the variables that I will be working with, their storage types, their encoding, etc. Then I set about deciding on a data design that I want for my working data set. I then write a file that converts each data set to that common data design: it includes things like renaming and recoding variables, merging where necessary, and then saving them under a new name (never destroy the original data). If the 25 years of files actually boil down to only three or four different setups that have to be converted to your common data design, then you can do that in three or four different do-files. Sometimes though, the worst case scenario comes true and you have to do this separately for each year's worth of data. But only after all of this work conforming each data set to the desired common data design is done do I then glue the pieces together.
            Last edited by Clyde Schechter; 20 Jan 2019, 17:13.

            Comment


            • #7
              Thanks for the advice, Clyde. I've been extremely cautious as I merge and append NHIS public use and mortality data sets. I agree with you, I would avoid doing it in a single step. BTW the merge is one to one in datasets post 1997. Thanks
              Last edited by Ahmed A Arif; 20 Jan 2019, 17:56.

              Comment

              Working...
              X