Identify and delete duplicate observations from two different files

Chandrashekhar Chandra

Join Date: Jan 2023
Posts: 1

Identify and delete duplicate observations from two different files

17 Jan 2023, 02:51

I have two data sets; 1st one is household (HH) and other one is individual. The HH and individual files have one unique ID.
There are more than 55,000 observation in the HH file and the individual file has 120,000 observations.

There are many duplicate observations in both HH and individual files. I can identify duplicate and delete from the HH file. But, the problem is that I have to identify the corresponding duplicates in the individual file too.
Can you suggest ..... How to identify and delete the corresponding duplicate observations from the individual file. Thank you...

Given below is the sample for the data. The bold observations are duplicate and need to identified and deleted.

HH File						Individual File
UID	V1	V2	V3	V4	V5	UID	V1	V2	V3	V4	V5
1	10	20	30	40	50	1	10	20	30	40	50
2	11	21	31	41	51	1	9	8	7	6	5
3	12	22	32	42	52	1	6	5	4	3	2
4	13	23	33	43	53	2	11	21	31	41	51
5	14	24	34	44	54	2	3	2	1	3	2
6	15	25	35	45	55	3	12	22	32	42	52
7	16	26	36	46	56	3	7	6	5	4	3
8	17	27	37	47	57	3	6	5	4	3	2
1	10	20	30	40	50	4	13	23	33	43	53
3	12	22	32	42	52	4	5	4	3	2	1
2	11	21	31	41	51	4	4	3	2	1	4

Tags: None

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17706
#2

17 Jan 2023, 03:51

Chandrashekhar:
welcome to this forum.
I'd start with -append- ing the two datasets and then -sort-ing them according to -UID- keyword.

Kind regards,
Carlo
(Stata 19.0)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35668
#3

17 Jan 2023, 04:03

I am reluctant to advise until I understand more. In #1 UID appears to be a household identifier in both cases, but observation 1 in the HH file corresponds to observation 1 in the individual file and the same variables occur in both, while observations 2 and 3 don't occur in the HH file.

Also, in an individual file you can't tell what is duplicate or not without an individual identifier. E.g. twins might have the same age, gender, and so forth.

This puzzlement could arise because #1 is just based on invented values and names, but it's why Carlo Lazzaro is thinking of append when usually such problems call for merge.

I would say this is unclear without a more realistic example. It doesn't have to be real data, just realistic.
1 like
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1385
#4

17 Jan 2023, 04:29

I agree with #3 in needing more information to understand your issue.

You may also want to confirm that you do in fact have actual duplicates. Is the household ID supposed to uniquely identify the household, or is there another variable that jointly identifies unique households? As one example of a standard well-known dataset where this can happen, the IHDS-2 (Indian Human Development Survey) has a separate household "split" ID variable that helps separate two households that were together in round 1 but split by the time of round 2.
2 likes
Comment

Announcement

Identify and delete duplicate observations from two different files

Comment

Comment

Comment