Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • question about dealing with duplicates

    In my dataset, I have the variable list: year cusip conm dlc dvt lt oancf revt teq xpr xsga xstf;
    when I run duplicates report cusip year
    I got about 5000 obs that have 2 copies.
    I found that for those copies, the variables have differing missing values (below is an example)

    year cusip conm dlc dvt lt oancf revt teq xpr xsga xstf
    1998 G9618E107 WHITE MTNS INS GROUP LTD 748.5 13.1 2534.2 704.3 3.8 130.2
    1998 G9618E107 WHITE MTNS INS GROUP LTD 806.2 13.1 2534.2 28.8 624.8

    I want to merge the copies to minimize the number of missing values for each obs, and delete duplicates. How can I do it?
    Thank you

  • #2
    No, no, no! Your problem is much more serious than that. You have conflicting non-missing values as well. In your own example one of the observations has xsga = 704.3 and the other has it as 28.8. One observation has xstf == 3.8, and the other 624.8. One has revt = 748.5 and the other shows it as 806.2. You need to resolve those conflicts. That may entail somehow figuring out which (if either) of the conflicting values is the correct one. Or it might mean combining the conflicting values, such as by averaging, or selecting the larger, or some other way. But the missing values are the least of your worries here.

    Comment


    • #3
      Thanks. I also notice the problem after I post it. I am trying to know better about my data. Thks again.

      Comment

      Working...
      X