Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • manipulating dataset: linking family and individual records

    Hi,

    I am new to Stata and was wondering if someone could help me with following question:

    I imported a census sample file using insheet, and now have a variable (v1) equal to one if the observation is at the family-level, and equal to zero if the observation is at the personal-level. So the data looks something like this (where . denotes missing information for a given variable):
    v1 v2 v3 v4 v5
    1 1 2 . .
    0 3 4 5 6
    0 7 8 9 10
    1 11 12 . .
    0 13 14 15 16

    Hence, family-level observations contain 2 variables (v2, v3), while personal-level observations contain 4 variables (v2, v3, v4, v5). Note that v2 and v3 for family-level records (v1==1) are not the same as v2 and v3 for personal level records (v1==0).

    In the example above, there are 2 families: one with two individuals, and another one with one individual. I do not know how to link family-level characteristics to personal-level characteristics, but I want the dataset to look something like this:
    v1 v2 v3 v4 v5 v6 v7
    1 1 2 . . . .
    0 3 4 5 6 1 2
    0 7 8 9 10 1 2
    1 11 12 . . . .
    0 13 14 15 16 11 12
    That is, personal records now have a v6 that corresponds to v2 for family records, and a v7 that corresponds to v3 for family records.

    I do not have id variables for families or individuals; the only way I know that a person is in a given family by looking at the dataset, is since a row containing family information is followed by the personal-level information of each of it's members. Any ideas how to do this in Stata?

  • #2
    You need to create a household id variable. You can do this with the sum() function, which returns a rolling sum. Then you can separate individual and household variables into different data sets and merge them back together. The code will look like this:

    Code:
    gen hhid = sum(v1)
    
    save temp_data, replace
    
    * Household observations
    keep if v1==1
    rename v2 v6
    rename v3 v7
    save household.dta
    
    * Individual observations
    use temp_data, clear
    keep if v1==0
    * Make a person id variable: not strictly necessary
    bysort hhid: gen pid = _n
    
    
    merge m:1 hhid using household.dta

    Comment


    • #3
      One quick suggestion about Mike Barker's code. Since this data comes from the Census, the number of families involved may be large enough that a -float-, the default data type with -gen-, may not have the precision needed to hold distinct IDs for each household. It would be safer to change the first line to: -gen long hhid = sum(v1)-. The -long- data type can safely hold distinct IDs for a much greater number of families than exist in the US.

      In general, when generating IDs in this way, it is safer to use a -long-. And there is no downside to specifying this since -long- and -float- use the same amount of storage. The advantage of -long- is that all its bits except the sign bit add precision, whereas -float- achieves its larger range of values by using some of those bits for the exponent, at the expense of precision.

      Comment


      • #4
        I agree, that is a good improvement.
        And thanks for the tip on float vs. long. I never knew the difference between those two data types.

        Comment


        • #5
          Dear Mark and Clyde,

          Thank you very much for your help. This worked well.

          Comment

          Working...
          X