question about dealing with duplicates

Sunny Xiao

Join Date: Feb 2023

Posts: 2
#1

question about dealing with duplicates

06 Aug 2023, 20:00

In my dataset, I have the variable list: year cusip conm dlc dvt lt oancf revt teq xpr xsga xstf;
when I run duplicates report cusip year
I got about 5000 obs that have 2 copies.
I found that for those copies, the variables have differing missing values (below is an example)

year cusip conm dlc dvt lt oancf revt teq xpr xsga xstf
1998 G9618E107 WHITE MTNS INS GROUP LTD 748.5 13.1 2534.2 704.3 3.8 130.2
1998 G9618E107 WHITE MTNS INS GROUP LTD 806.2 13.1 2534.2 28.8 624.8

I want to merge the copies to minimize the number of missing values for each obs, and delete duplicates. How can I do it?
Thank you
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#2

07 Aug 2023, 09:28

No, no, no! Your problem is much more serious than that. You have conflicting non-missing values as well. In your own example one of the observations has xsga = 704.3 and the other has it as 28.8. One observation has xstf == 3.8, and the other 624.8. One has revt = 748.5 and the other shows it as 806.2. You need to resolve those conflicts. That may entail somehow figuring out which (if either) of the conflicting values is the correct one. Or it might mean combining the conflicting values, such as by averaging, or selecting the larger, or some other way. But the missing values are the least of your worries here.
2 likes
Comment
Sunny Xiao

Join Date: Feb 2023

Posts: 2
#3

07 Aug 2023, 18:14

Thanks. I also notice the problem after I post it. I am trying to know better about my data. Thks again.
Comment

Announcement

question about dealing with duplicates

Comment

Comment