Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Delete duplicates with more missing values

    Hello all,

    I have a large dataset with a considerable amount of duplicates (impossible to assess by hand). I am working with mortgage data from the National Archive(HMDA data 2000).

    I have duplicate observations by respondent id, with one of the duplicate observations summarising the data shown in the other duplicates. I would like to keep the duplicate with fewer missing values.

    I cannot drop the observations with missing values because as the blue row shows, some observations are unique but have missing values.

    The other problem is that I cannot tell in which column the value is missing (someone else asked something similar before but for him, the missing value was in only one column for all observations).

    Click image for larger version

Name:	hmda.png
Views:	1
Size:	20.4 KB
ID:	1663536



  • #2
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(respondent_id number_applications volume_applications number_loans_sold volume_loans_sold number_loans_originated volume_loans_originated)
     1 3344 414085 41 23354 1471 203098
     1 3344 414085 41 23354    .      .
     1 3344 414085  .     . 1471 203098
    28    1     50  .     .    1     50
    end
    
    ds respondent_id, not
    foreach v in `r(varlist)' {
    bysort respondent_id (`v'): replace `v' = `v'[1]
    }
    
    bysort respondent_id: keep if _n==1

    Comment


    • #3
      Thanks a lot for the fast reply. Unfortunately, it didn't work, but I managed to do it with replace if missing.

      Comment

      Working...
      X