Duplicates Removals

Guest
#1

Duplicates Removals

27 Jul 2018, 11:42

Dear Statalisters,

I would like to get some help with selective duplicates removals.
I have read some other posts, but my case is a bit different.

There are a_hidp(household identifier number in wave 1) and pidp (personal identifier number). Each person has their own unique pidp, but they may share the same a_hidp if they live with someone else (presumably their parents)
I am only interested in those aged 16-18 and also their household income.

My initial code looks the following way

sort a_hidp
quietly by a_hidp: gen dup =cond(_N==1,0,_n)
sort a_hidp

I got the following outcome (Please see the screen-shot below).
However, it seems very odd. I suppose row 1~4 should have dup=0, not dup=1 as they do not have any duplicates unlike row 5~6 who have exactly the same a_hidp (so they live together) and have dup=1 and dup=2 which makes sense.
This seems really odd (please tell me if you know the reason).

After this I have no idea what to do.
As I have mentioned I am interested in households that include at least one person aged 16~18.
In other words, there are household A, household B and household C.
Household A includes person A (aged 16), person B (aged 39) and person C (aged 43).
Household B includes person D (aged 24) and person E (aged 35)
Household C includes person F (aged 17) and person G (aged 18).

I want to keep Household A and C only and drop Household B.

Let us look at the screen-shot below. I want to keep row 5~6 as they both live together and row 5 is 16 years old. I want to keep both row 5 and 6 as I need to look into their household income.

Likewise, I want to keep row 13~15 as the household (68028563) has a person aged 17 and I am interested in their household income. I assume row 14 is a girl and 13 & 15 are her parents.

On the other hand, I do not want to keep row 17~18 although they have dup=1 and dup=2 which means they live together (the same a_hidp). This is because there are not any people aged 16~18 in their household. Therefore, I want to drop these rows.

Are there any ways to do this? There are 30,000 observations, so I would need a special code to do this to all the households.
Maybe I should not have used "dup" code. What do you reckon?
Tags: duplicates, households, longitudinal
Bruce Weaver

Join Date: May 2014

Posts: 1132
#2

27 Jul 2018, 12:00

Hello Guest. You say you are only interested in people aged 16-18. So why not just do something like this?

Code:

keep if inrange(age,16,18)

Or if you don't want to delete records from the working dataset, but want to use that subset for a particular analysis, use an if qualifier. E.g.,

Code:

summarize income if inrange(age,16,18)

Why do you need something more complicated than that? What am I missing?

p.s. - Please review the FAQ, especially item 12.

Last edited by sladmin; 08 Apr 2019, 09:13. Reason: anonymize original poster

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
2 likes
Comment
Guest
#3

28 Jul 2018, 16:05

Thank you for your reply Bruce
Comment

Announcement

Duplicates Removals

Comment

Comment