manipulating dataset: linking family and individual records

lili91

Join Date: Sep 2014

Posts: 11
#1

manipulating dataset: linking family and individual records

27 Sep 2014, 10:22

Hi,

I am new to Stata and was wondering if someone could help me with following question:

I imported a census sample file using insheet, and now have a variable (v1) equal to one if the observation is at the family-level, and equal to zero if the observation is at the personal-level. So the data looks something like this (where . denotes missing information for a given variable):

v1 v2 v3 v4 v5

1 1 2 . .

0 3 4 5 6

0 7 8 9 10

1 11 12 . .

0 13 14 15 16

Hence, family-level observations contain 2 variables (v2, v3), while personal-level observations contain 4 variables (v2, v3, v4, v5). Note that v2 and v3 for family-level records (v1==1) are not the same as v2 and v3 for personal level records (v1==0).

In the example above, there are 2 families: one with two individuals, and another one with one individual. I do not know how to link family-level characteristics to personal-level characteristics, but I want the dataset to look something like this:

v1 v2 v3 v4 v5 v6 v7

1 1 2 . . . .

0 3 4 5 6 1 2

0 7 8 9 10 1 2

1 11 12 . . . .

0 13 14 15 16 11 12

That is, personal records now have a v6 that corresponds to v2 for family records, and a v7 that corresponds to v3 for family records.

I do not have id variables for families or individuals; the only way I know that a person is in a given family by looking at the dataset, is since a row containing family information is followed by the personal-level information of each of it's members. Any ideas how to do this in Stata?
Tags: None
Mike Barker

Join Date: Apr 2014

Posts: 37
#2

27 Sep 2014, 11:32

You need to create a household id variable. You can do this with the sum() function, which returns a rolling sum. Then you can separate individual and household variables into different data sets and merge them back together. The code will look like this:

Code:

gen hhid = sum(v1) save temp_data, replace * Household observations keep if v1==1 rename v2 v6 rename v3 v7 save household.dta * Individual observations use temp_data, clear keep if v1==0 * Make a person id variable: not strictly necessary bysort hhid: gen pid = _n merge m:1 hhid using household.dta
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30057
#3

27 Sep 2014, 12:46

One quick suggestion about Mike Barker's code. Since this data comes from the Census, the number of families involved may be large enough that a -float-, the default data type with -gen-, may not have the precision needed to hold distinct IDs for each household. It would be safer to change the first line to: -gen long hhid = sum(v1)-. The -long- data type can safely hold distinct IDs for a much greater number of families than exist in the US.

In general, when generating IDs in this way, it is safer to use a -long-. And there is no downside to specifying this since -long- and -float- use the same amount of storage. The advantage of -long- is that all its bits except the sign bit add precision, whereas -float- achieves its larger range of values by using some of those bits for the exponent, at the expense of precision.
Comment
Mike Barker

Join Date: Apr 2014

Posts: 37
#4

27 Sep 2014, 13:10

I agree, that is a good improvement.
And thanks for the tip on float vs. long. I never knew the difference between those two data types.
Comment
lili91

Join Date: Sep 2014

Posts: 11
#5

28 Sep 2014, 01:20

Dear Mark and Clyde,

Thank you very much for your help. This worked well.
Comment

Announcement

manipulating dataset: linking family and individual records

Comment

Comment

Comment

Comment