Identifying original and split-off households

Kevin Marian

Join Date: Jan 2020
Posts: 14

Identifying original and split-off households

27 Feb 2020, 16:08

Hi Statalist,

I am trying to merge 3 different data sets that follow the same households over time.
Some households were interviewed in all 3 rounds, others in just 2.

My problem is: the household ID changed slightly between the first two rounds to incorporate split-off households.
A sample of the data can be found below where hhid_y1 represents the first year, hhid_y2 represents the second year and hhid_y3 represents the third year.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str16(hhid_y1 hhid_y2) str8 hhid_y3
"53020330040006" "5302033004000601" "3720-001"
"53020330040006" "5302033004000601" "3720-001"
"53020330040006" "5302033004000601" "3720-001"
"53020330040006" "5302033004000603" "-"       
"53020330040006" "5302033004000603" "-"       
"53020330040006" "5302033004000604" "3722-001"
"53020330040006" "5302033004000604" "3722-001"
"55020010050048" "5502001005004801" "3892-001"
"55020010050048" "5502001005004801" "3892-001"
"55020010050048" "5502001005004801" "3892-001"
"55020010050048" "5502001005004801" "3892-001"
"55020010050048" "5502001005004803" "3893-001"
"55020010050048" "5502001005004803" "3893-001"
"55010040320002" "5501004032000201" "3833-001"
"55010040320002" "5501004032000201" "3833-001"
"55010040320002" "5501004032000201" "3833-001"
"55010040320002" "5501004032000201" "3833-001"
"55010040320002" "5501004032000201" "3833-001"
"55010040320002" "5501004032000201" "3833-001"
"55010040320002" "5501004032000201" "3833-001"
"55010040320002" "5501004032000201" "3833-001"
"55010040320002" "5501004032000201" "3833-001"
"55010040320002" "5501004032000203" "3834-001"
"55010040320002" "5501004032000203" "3834-001"
"01010140020171" "0101014002017101" "0001-001"
"01010140020171" "0101014002017101" "0001-001"
"01010140020171" "0101014002017101" "0001-001"
"01010140020171" "0101014002017101" "0001-001"
"01010140020171" "0101014002017101" "0001-001"
"01010140020409" "0101014002040901" "0005-001"
"01010140020409" "0101014002040901" "0005-001"
"01010140020409" "0101014002040901" "0005-001"
"01010140020409" "0101014002040901" "0005-001"
"01010140020409" "0101014002040901" "0005-001"
"01010140020409" "0101014002040901" "0005-001"
"01040173040034" "0104017304003401" "0049-001"
"01040173040034" "0104017304003401" "0049-001"
"01040173040034" "0104017304003401" "0049-001"
"01040173040034" "0104017304003401" "0049-001"
"01040173040034" "0104017304003401" "0049-001"
"01040173040034" "0104017304003401" "0049-001"
"01040173040034" "0104017304003406" "0050-001"       
"01040173040034" "0104017304003407" "0051-001"
"01040173040034" "0104017304003407" "0051-001"
"01040173040034" "0104017304003407" "0051-001"
"01040173040034" "0104017304003407" "0051-001"

end

If you take a look at the blue observations, you can see that there was only 1 household in the first year (53020330040006) and that in the second year of the survey, this household had split into a total of 3 households. In the third year, only 2 of the total 3 households were interviewed hence the missing hhid_y3 variable for household 5302033004000603 (hhid_y2 variable).

I would like to construct and use a balanced panel for my analysis.

In order to do this, I need to create a variable that helps me identify the split off households from the original households.
In the documentation of the survey, they mention that the last 2 digits of split off households (using the hhid_y2 variable) are greater than the last two digits of the original households.

For example,

Year 1: HHID = "01040173040034"
Year 2: HHID = "0104017304003401" , "0104017304003406" and "0104017304003407"

06 and 07 are both greater than 01 hence these 2 households are split-off households.

In the majority of the cases, having 01 as the last 2 digits in the hhid_y2 variable signals an original household but I suspect that there may be cases where an original household has e.g. 02 as the last 2 digits and a split off household may have 05.

There are also households that remained the same in all 3 rounds (they did not split up at all)

Question:

Is it possible to create a variable that will tell me exactly which households are original households and which are split-off households using the hhid_y2 variable?

I was thinking something along the lines of:

bysort y2_hhid: egen new_var if last 2 digits are the "smallest"

Sorry for the long post, I hope this makes sense.
I'll appreciate any feeback on how I can solve this.

Thank you

Tags: households, merge, string, survey

Clyde Schechter

Join Date: Apr 2014
Posts: 30117

27 Feb 2020, 16:35

Try this:

Code:

assert hhid_y1 == substr(hhid_y2, 1, length(hhid_y1))
by hhid_y1 (hhid_y2), sort: gen wanted = sum(hhid_y2 != hhid_y2[_n-1])
replace wanted = (wanted == 1)

Added: On further reflection, the following one-liner is simpler and produces the same result:

Code:

by hhid_y1 (hhid_y2), sort: gen wanted = (hhid_y2 == hhid_y2[1])

Comment

Kevin Marian

Join Date: Jan 2020

Posts: 14
#3

27 Feb 2020, 17:13

Clyde, thank you so much for your quick and helpful reply!
I have tried both codes and they both work.

Now, to keep only one observation per original household I ran the following code:

Code:

keep if wanted == 1 //(to keep only the original hhs) sort hhid_y2 quietly by hhid_y2: gen dup = cond(_N==1, 0, _n) keep if dup ==1

This particular data set only included their IDs so I did not lose any other valuable information, if anyone is wondering.

Thank you once again! I appreciate it.
Comment

Announcement

Identifying original and split-off households

Comment

Comment