Delete duplicates with more missing values

Liz Gallego

Join Date: May 2022

Posts: 3
#1

Delete duplicates with more missing values

08 May 2022, 10:47

Hello all,

I have a large dataset with a considerable amount of duplicates (impossible to assess by hand). I am working with mortgage data from the National Archive(HMDA data 2000).

I have duplicate observations by respondent id, with one of the duplicate observations summarising the data shown in the other duplicates. I would like to keep the duplicate with fewer missing values.

I cannot drop the observations with missing values because as the blue row shows, some observations are unique but have missing values.

The other problem is that I cannot tell in which column the value is missing (someone else asked something similar before but for him, the missing value was in only one column for all observations).
Tags: None

Øyvind Snilsberg

Join Date: Oct 2021
Posts: 591

08 May 2022, 11:33

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(respondent_id number_applications volume_applications number_loans_sold volume_loans_sold number_loans_originated volume_loans_originated)
 1 3344 414085 41 23354 1471 203098
 1 3344 414085 41 23354    .      .
 1 3344 414085  .     . 1471 203098
28    1     50  .     .    1     50
end

ds respondent_id, not
foreach v in `r(varlist)' {
bysort respondent_id (`v'): replace `v' = `v'[1]
}

bysort respondent_id: keep if _n==1

Comment

Liz Gallego

Join Date: May 2022

Posts: 3
#3

09 May 2022, 05:49

Thanks a lot for the fast reply. Unfortunately, it didn't work, but I managed to do it with replace if missing.
Comment

Announcement

Delete duplicates with more missing values

Comment

Comment