Dear Statalisters,
I would like to get some help with selective duplicates removals.
I have read some other posts, but my case is a bit different.
There are a_hidp(household identifier number in wave 1) and pidp (personal identifier number). Each person has their own unique pidp, but they may share the same a_hidp if they live with someone else (presumably their parents)
I am only interested in those aged 16-18 and also their household income.
My initial code looks the following way
sort a_hidp
quietly by a_hidp: gen dup =cond(_N==1,0,_n)
sort a_hidp
I got the following outcome (Please see the screen-shot below).
However, it seems very odd. I suppose row 1~4 should have dup=0, not dup=1 as they do not have any duplicates unlike row 5~6 who have exactly the same a_hidp (so they live together) and have dup=1 and dup=2 which makes sense.
This seems really odd (please tell me if you know the reason).
After this I have no idea what to do.
As I have mentioned I am interested in households that include at least one person aged 16~18.
In other words, there are household A, household B and household C.
Household A includes person A (aged 16), person B (aged 39) and person C (aged 43).
Household B includes person D (aged 24) and person E (aged 35)
Household C includes person F (aged 17) and person G (aged 18).
I want to keep Household A and C only and drop Household B.
Let us look at the screen-shot below. I want to keep row 5~6 as they both live together and row 5 is 16 years old. I want to keep both row 5 and 6 as I need to look into their household income.
Likewise, I want to keep row 13~15 as the household (68028563) has a person aged 17 and I am interested in their household income. I assume row 14 is a girl and 13 & 15 are her parents.
On the other hand, I do not want to keep row 17~18 although they have dup=1 and dup=2 which means they live together (the same a_hidp). This is because there are not any people aged 16~18 in their household. Therefore, I want to drop these rows.
Are there any ways to do this? There are 30,000 observations, so I would need a special code to do this to all the households.
Maybe I should not have used "dup" code. What do you reckon?
I would like to get some help with selective duplicates removals.
I have read some other posts, but my case is a bit different.
There are a_hidp(household identifier number in wave 1) and pidp (personal identifier number). Each person has their own unique pidp, but they may share the same a_hidp if they live with someone else (presumably their parents)
I am only interested in those aged 16-18 and also their household income.
My initial code looks the following way
sort a_hidp
quietly by a_hidp: gen dup =cond(_N==1,0,_n)
sort a_hidp
I got the following outcome (Please see the screen-shot below).
However, it seems very odd. I suppose row 1~4 should have dup=0, not dup=1 as they do not have any duplicates unlike row 5~6 who have exactly the same a_hidp (so they live together) and have dup=1 and dup=2 which makes sense.
This seems really odd (please tell me if you know the reason).
After this I have no idea what to do.
As I have mentioned I am interested in households that include at least one person aged 16~18.
In other words, there are household A, household B and household C.
Household A includes person A (aged 16), person B (aged 39) and person C (aged 43).
Household B includes person D (aged 24) and person E (aged 35)
Household C includes person F (aged 17) and person G (aged 18).
I want to keep Household A and C only and drop Household B.
Let us look at the screen-shot below. I want to keep row 5~6 as they both live together and row 5 is 16 years old. I want to keep both row 5 and 6 as I need to look into their household income.
Likewise, I want to keep row 13~15 as the household (68028563) has a person aged 17 and I am interested in their household income. I assume row 14 is a girl and 13 & 15 are her parents.
On the other hand, I do not want to keep row 17~18 although they have dup=1 and dup=2 which means they live together (the same a_hidp). This is because there are not any people aged 16~18 in their household. Therefore, I want to drop these rows.
Are there any ways to do this? There are 30,000 observations, so I would need a special code to do this to all the households.
Maybe I should not have used "dup" code. What do you reckon?
Comment