Hi Statalist,
I am trying to merge 3 different data sets that follow the same households over time.
Some households were interviewed in all 3 rounds, others in just 2.
My problem is: the household ID changed slightly between the first two rounds to incorporate split-off households.
A sample of the data can be found below where hhid_y1 represents the first year, hhid_y2 represents the second year and hhid_y3 represents the third year.
If you take a look at the blue observations, you can see that there was only 1 household in the first year (53020330040006) and that in the second year of the survey, this household had split into a total of 3 households. In the third year, only 2 of the total 3 households were interviewed hence the missing hhid_y3 variable for household 5302033004000603 (hhid_y2 variable).
I would like to construct and use a balanced panel for my analysis.
In order to do this, I need to create a variable that helps me identify the split off households from the original households.
In the documentation of the survey, they mention that the last 2 digits of split off households (using the hhid_y2 variable) are greater than the last two digits of the original households.
For example,
Year 1: HHID = "01040173040034"
Year 2: HHID = "0104017304003401" , "0104017304003406" and "0104017304003407"
06 and 07 are both greater than 01 hence these 2 households are split-off households.
In the majority of the cases, having 01 as the last 2 digits in the hhid_y2 variable signals an original household but I suspect that there may be cases where an original household has e.g. 02 as the last 2 digits and a split off household may have 05.
There are also households that remained the same in all 3 rounds (they did not split up at all)
Question:
Is it possible to create a variable that will tell me exactly which households are original households and which are split-off households using the hhid_y2 variable?
I was thinking something along the lines of:
bysort y2_hhid: egen new_var if last 2 digits are the "smallest"
Sorry for the long post, I hope this makes sense.
I'll appreciate any feeback on how I can solve this.
Thank you
I am trying to merge 3 different data sets that follow the same households over time.
Some households were interviewed in all 3 rounds, others in just 2.
My problem is: the household ID changed slightly between the first two rounds to incorporate split-off households.
A sample of the data can be found below where hhid_y1 represents the first year, hhid_y2 represents the second year and hhid_y3 represents the third year.
Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str16(hhid_y1 hhid_y2) str8 hhid_y3
"53020330040006" "5302033004000601" "3720-001"
"53020330040006" "5302033004000601" "3720-001"
"53020330040006" "5302033004000601" "3720-001"
"53020330040006" "5302033004000603" "-"
"53020330040006" "5302033004000603" "-"
"53020330040006" "5302033004000604" "3722-001"
"53020330040006" "5302033004000604" "3722-001"
"55020010050048" "5502001005004801" "3892-001"
"55020010050048" "5502001005004801" "3892-001"
"55020010050048" "5502001005004801" "3892-001"
"55020010050048" "5502001005004801" "3892-001"
"55020010050048" "5502001005004803" "3893-001"
"55020010050048" "5502001005004803" "3893-001"
"55010040320002" "5501004032000201" "3833-001"
"55010040320002" "5501004032000201" "3833-001"
"55010040320002" "5501004032000201" "3833-001"
"55010040320002" "5501004032000201" "3833-001"
"55010040320002" "5501004032000201" "3833-001"
"55010040320002" "5501004032000201" "3833-001"
"55010040320002" "5501004032000201" "3833-001"
"55010040320002" "5501004032000201" "3833-001"
"55010040320002" "5501004032000201" "3833-001"
"55010040320002" "5501004032000203" "3834-001"
"55010040320002" "5501004032000203" "3834-001"
"01010140020171" "0101014002017101" "0001-001"
"01010140020171" "0101014002017101" "0001-001"
"01010140020171" "0101014002017101" "0001-001"
"01010140020171" "0101014002017101" "0001-001"
"01010140020171" "0101014002017101" "0001-001"
"01010140020409" "0101014002040901" "0005-001"
"01010140020409" "0101014002040901" "0005-001"
"01010140020409" "0101014002040901" "0005-001"
"01010140020409" "0101014002040901" "0005-001"
"01010140020409" "0101014002040901" "0005-001"
"01010140020409" "0101014002040901" "0005-001"
"01040173040034" "0104017304003401" "0049-001"
"01040173040034" "0104017304003401" "0049-001"
"01040173040034" "0104017304003401" "0049-001"
"01040173040034" "0104017304003401" "0049-001"
"01040173040034" "0104017304003401" "0049-001"
"01040173040034" "0104017304003401" "0049-001"
"01040173040034" "0104017304003406" "0050-001"
"01040173040034" "0104017304003407" "0051-001"
"01040173040034" "0104017304003407" "0051-001"
"01040173040034" "0104017304003407" "0051-001"
"01040173040034" "0104017304003407" "0051-001"
end
I would like to construct and use a balanced panel for my analysis.
In order to do this, I need to create a variable that helps me identify the split off households from the original households.
In the documentation of the survey, they mention that the last 2 digits of split off households (using the hhid_y2 variable) are greater than the last two digits of the original households.
For example,
Year 1: HHID = "01040173040034"
Year 2: HHID = "0104017304003401" , "0104017304003406" and "0104017304003407"
06 and 07 are both greater than 01 hence these 2 households are split-off households.
In the majority of the cases, having 01 as the last 2 digits in the hhid_y2 variable signals an original household but I suspect that there may be cases where an original household has e.g. 02 as the last 2 digits and a split off household may have 05.
There are also households that remained the same in all 3 rounds (they did not split up at all)
Question:
Is it possible to create a variable that will tell me exactly which households are original households and which are split-off households using the hhid_y2 variable?
I was thinking something along the lines of:
bysort y2_hhid: egen new_var if last 2 digits are the "smallest"
Sorry for the long post, I hope this makes sense.
I'll appreciate any feeback on how I can solve this.
Thank you
Comment