Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicates Removals

    Dear Statalisters,

    I would like to get some help with selective duplicates removals.
    I have read some other posts, but my case is a bit different.

    There are a_hidp(household identifier number in wave 1) and pidp (personal identifier number). Each person has their own unique pidp, but they may share the same a_hidp if they live with someone else (presumably their parents)
    I am only interested in those aged 16-18 and also their household income.

    My initial code looks the following way

    sort a_hidp
    quietly by a_hidp: gen dup =cond(_N==1,0,_n)
    sort a_hidp

    I got the following outcome (Please see the screen-shot below).
    However, it seems very odd. I suppose row 1~4 should have dup=0, not dup=1 as they do not have any duplicates unlike row 5~6 who have exactly the same a_hidp (so they live together) and have dup=1 and dup=2 which makes sense.
    This seems really odd (please tell me if you know the reason).

    After this I have no idea what to do.
    As I have mentioned I am interested in households that include at least one person aged 16~18.
    In other words, there are household A, household B and household C.
    Household A includes person A (aged 16), person B (aged 39) and person C (aged 43).
    Household B includes person D (aged 24) and person E (aged 35)
    Household C includes person F (aged 17) and person G (aged 18).

    I want to keep Household A and C only and drop Household B.



    Let us look at the screen-shot below. I want to keep row 5~6 as they both live together and row 5 is 16 years old. I want to keep both row 5 and 6 as I need to look into their household income.

    Likewise, I want to keep row 13~15 as the household (68028563) has a person aged 17 and I am interested in their household income. I assume row 14 is a girl and 13 & 15 are her parents.

    On the other hand, I do not want to keep row 17~18 although they have dup=1 and dup=2 which means they live together (the same a_hidp). This is because there are not any people aged 16~18 in their household. Therefore, I want to drop these rows.

    Are there any ways to do this? There are 30,000 observations, so I would need a special code to do this to all the households.
    Maybe I should not have used "dup" code. What do you reckon?



    Click image for larger version

Name:	1.jpg
Views:	1
Size:	115.1 KB
ID:	1455355





  • #2
    Hello Guest. You say you are only interested in people aged 16-18. So why not just do something like this?

    Code:
    keep if inrange(age,16,18)
    Or if you don't want to delete records from the working dataset, but want to use that subset for a particular analysis, use an if qualifier. E.g.,

    Code:
    summarize income if inrange(age,16,18)
    Why do you need something more complicated than that? What am I missing?

    p.s. - Please review the FAQ, especially item 12.
    Last edited by sladmin; 08 Apr 2019, 09:13. Reason: anonymize original poster
    --
    Bruce Weaver
    Email: [email protected]
    Web: http://sites.google.com/a/lakeheadu.ca/bweaver/
    Version: Stata/MP 18.0 (Windows)

    Comment


    • #3
      Thank you for your reply Bruce

      Comment

      Working...
      X