Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying original and split-off households

    Hi Statalist,

    I am trying to merge 3 different data sets that follow the same households over time.
    Some households were interviewed in all 3 rounds, others in just 2.

    My problem is: the household ID changed slightly between the first two rounds to incorporate split-off households.
    A sample of the data can be found below where hhid_y1 represents the first year, hhid_y2 represents the second year and hhid_y3 represents the third year.


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str16(hhid_y1 hhid_y2) str8 hhid_y3
    "53020330040006" "5302033004000601" "3720-001"
    "53020330040006" "5302033004000601" "3720-001"
    "53020330040006" "5302033004000601" "3720-001"
    "53020330040006" "5302033004000603" "-"       
    "53020330040006" "5302033004000603" "-"       
    "53020330040006" "5302033004000604" "3722-001"
    "53020330040006" "5302033004000604" "3722-001"
    "55020010050048" "5502001005004801" "3892-001"
    "55020010050048" "5502001005004801" "3892-001"
    "55020010050048" "5502001005004801" "3892-001"
    "55020010050048" "5502001005004801" "3892-001"
    "55020010050048" "5502001005004803" "3893-001"
    "55020010050048" "5502001005004803" "3893-001"
    "55010040320002" "5501004032000201" "3833-001"
    "55010040320002" "5501004032000201" "3833-001"
    "55010040320002" "5501004032000201" "3833-001"
    "55010040320002" "5501004032000201" "3833-001"
    "55010040320002" "5501004032000201" "3833-001"
    "55010040320002" "5501004032000201" "3833-001"
    "55010040320002" "5501004032000201" "3833-001"
    "55010040320002" "5501004032000201" "3833-001"
    "55010040320002" "5501004032000201" "3833-001"
    "55010040320002" "5501004032000203" "3834-001"
    "55010040320002" "5501004032000203" "3834-001"
    "01010140020171" "0101014002017101" "0001-001"
    "01010140020171" "0101014002017101" "0001-001"
    "01010140020171" "0101014002017101" "0001-001"
    "01010140020171" "0101014002017101" "0001-001"
    "01010140020171" "0101014002017101" "0001-001"
    "01010140020409" "0101014002040901" "0005-001"
    "01010140020409" "0101014002040901" "0005-001"
    "01010140020409" "0101014002040901" "0005-001"
    "01010140020409" "0101014002040901" "0005-001"
    "01010140020409" "0101014002040901" "0005-001"
    "01010140020409" "0101014002040901" "0005-001"
    "01040173040034" "0104017304003401" "0049-001"
    "01040173040034" "0104017304003401" "0049-001"
    "01040173040034" "0104017304003401" "0049-001"
    "01040173040034" "0104017304003401" "0049-001"
    "01040173040034" "0104017304003401" "0049-001"
    "01040173040034" "0104017304003401" "0049-001"
    "01040173040034" "0104017304003406" "0050-001"       
    "01040173040034" "0104017304003407" "0051-001"
    "01040173040034" "0104017304003407" "0051-001"
    "01040173040034" "0104017304003407" "0051-001"
    "01040173040034" "0104017304003407" "0051-001"
    
    end
    If you take a look at the blue observations, you can see that there was only 1 household in the first year (53020330040006) and that in the second year of the survey, this household had split into a total of 3 households. In the third year, only 2 of the total 3 households were interviewed hence the missing hhid_y3 variable for household 5302033004000603 (hhid_y2 variable).

    I would like to construct and use a balanced panel for my analysis.

    In order to do this, I need to create a variable that helps me identify the split off households from the original households.
    In the documentation of the survey, they mention that the last 2 digits of split off households (using the hhid_y2 variable) are greater than the last two digits of the original households.

    For example,

    Year 1: HHID = "01040173040034"
    Year 2: HHID = "0104017304003401" , "0104017304003406" and "0104017304003407"

    06 and 07 are both greater than 01 hence these 2 households are split-off households.

    In the majority of the cases, having 01 as the last 2 digits in the hhid_y2 variable signals an original household but I suspect that there may be cases where an original household has e.g. 02 as the last 2 digits and a split off household may have 05.

    There are also households that remained the same in all 3 rounds (they did not split up at all)

    Question:

    Is it possible to create a variable that will tell me exactly which households are original households and which are split-off households using the hhid_y2 variable?

    I was thinking something along the lines of:

    bysort y2_hhid: egen new_var if last 2 digits are the "smallest"

    Sorry for the long post, I hope this makes sense.
    I'll appreciate any feeback on how I can solve this.

    Thank you

  • #2
    Try this:
    Code:
    assert hhid_y1 == substr(hhid_y2, 1, length(hhid_y1))
    by hhid_y1 (hhid_y2), sort: gen wanted = sum(hhid_y2 != hhid_y2[_n-1])
    replace wanted = (wanted == 1)
    Added: On further reflection, the following one-liner is simpler and produces the same result:
    Code:
    by hhid_y1 (hhid_y2), sort: gen wanted = (hhid_y2 == hhid_y2[1])

    Comment


    • #3
      Clyde, thank you so much for your quick and helpful reply!
      I have tried both codes and they both work.

      Now, to keep only one observation per original household I ran the following code:

      Code:
       keep if wanted == 1 //(to keep only the original hhs)
      sort hhid_y2
      quietly by hhid_y2: gen dup = cond(_N==1, 0, _n)
      keep if dup ==1
      This particular data set only included their IDs so I did not lose any other valuable information, if anyone is wondering.

      Thank you once again! I appreciate it.

      Comment

      Working...
      X