Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identify and delete duplicate observations from two different files

    I have two data sets; 1st one is household (HH) and other one is individual. The HH and individual files have one unique ID.
    There are more than 55,000 observation in the HH file and the individual file has 120,000 observations.

    There are many duplicate observations in both HH and individual files. I can identify duplicate and delete from the HH file. But, the problem is that I have to identify the corresponding duplicates in the individual file too.
    Can you suggest ..... How to identify and delete the corresponding duplicate observations from the individual file. Thank you...

    Given below is the sample for the data. The bold observations are duplicate and need to identified and deleted.
    HH File Individual File
    UID V1 V2 V3 V4 V5 UID V1 V2 V3 V4 V5
    1 10 20 30 40 50 1 10 20 30 40 50
    2 11 21 31 41 51 1 9 8 7 6 5
    3 12 22 32 42 52 1 6 5 4 3 2
    4 13 23 33 43 53 2 11 21 31 41 51
    5 14 24 34 44 54 2 3 2 1 3 2
    6 15 25 35 45 55 3 12 22 32 42 52
    7 16 26 36 46 56 3 7 6 5 4 3
    8 17 27 37 47 57 3 6 5 4 3 2
    1 10 20 30 40 50 4 13 23 33 43 53
    3 12 22 32 42 52 4 5 4 3 2 1
    2 11 21 31 41 51 4 4 3 2 1 4


  • #2
    Chandrashekhar:
    welcome to this forum.
    I'd start with -append- ing the two datasets and then -sort-ing them according to -UID- keyword.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      I am reluctant to advise until I understand more. In #1 UID appears to be a household identifier in both cases, but observation 1 in the HH file corresponds to observation 1 in the individual file and the same variables occur in both, while observations 2 and 3 don't occur in the HH file.

      Also, in an individual file you can't tell what is duplicate or not without an individual identifier. E.g. twins might have the same age, gender, and so forth.

      This puzzlement could arise because #1 is just based on invented values and names, but it's why Carlo Lazzaro is thinking of append when usually such problems call for merge.

      I would say this is unclear without a more realistic example. It doesn't have to be real data, just realistic.

      Comment


      • #4
        I agree with #3 in needing more information to understand your issue.

        You may also want to confirm that you do in fact have actual duplicates. Is the household ID supposed to uniquely identify the household, or is there another variable that jointly identifies unique households? As one example of a standard well-known dataset where this can happen, the IHDS-2 (Indian Human Development Survey) has a separate household "split" ID variable that helps separate two households that were together in round 1 but split by the time of round 2.

        Comment

        Working...
        X