Dropping duplicate observations with more missing values

Ralph Klijn

Join Date: Apr 2018

Posts: 41
#1

Dropping duplicate observations with more missing values

16 Apr 2018, 11:24

Hello all,

I have a large panel data set with a considerable amount of duplicates (impossible to assess by hand). The reason for this duplicates to exist is not clear to me: I have downloaded the data straight from Compustat / WRDS.

What I get is that I have duplicate observations by year and id, with one of the duplicate observations missing values where the other does not:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float year long id double(at ceq csho) 1991 3514 714.709 76.064 . 1991 3514 714.709 76.064 2.4 1992 3514 824.676 92.352 . 1992 3514 824.676 92.352 2.394 1993 3514 1010.104 116.991 . 1993 3514 1010.104 116.991 2.708 1994 3514 1038.882 125.177 . 1994 3514 1038.882 125.177 6.628 end label values id id label def id 3514 "WSBC", modify

Is there a way I can have Stata drop the observations with more missing values?

Thank you in advance.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#2

16 Apr 2018, 11:43

This is a dangerous approach. If the data set is as large as you say, then you cannot be sure that there aren't instances where a pair of observations with the same year and id contain non-missing but disagreeing values on some variables. Also, you may have a situation where variable X is missing from one observation and variable Y from the other: there is no reason not to retain whichever observation is non missing. So I would first verify that there are no clashes between non-missing values of any variable, and then, I would retain the whichever value is non-missing from each variable.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float year long id double(at ceq csho) 1991 3514 714.709 76.064 . 1991 3514 714.709 76.064 2.4 1992 3514 824.676 92.352 . 1992 3514 824.676 92.352 2.394 1993 3514 1010.104 116.991 . 1993 3514 1010.104 116.991 2.708 1994 3514 1038.882 125.177 . 1994 3514 1038.882 125.177 6.628 end label values id id label def id 3514 "WSBC", modify // VERIFY THAT ALL VARIABLES AGREE WHEN NON MISSING ds year id, not local vbles `r(varlist)' foreach v of varlist `vbles' { by year id (`v'), sort: assert inlist(`v', `v'[1], .) } // RETAIN NON-MISSING VALUE FOR EACH VARIABLE collapse (firstnm) `vbles', by(year id)

Experience shows that even well-curated data sets from usually reliable sources frequently contain glitches that have to be fixed before you can analyze the data. In fact, it is a good practice to approach each new data set assuming it contains glitches and problems. Be pro-active and look for them when you first import the data: don't wait for them to trip you up farther down the line after you have already invested time and effort into work that then has to be re-done with corrected data!
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35677

16 Apr 2018, 11:46

This works for your example:

Code:

. bysort id year (csho) : replace csho = csho[1] if missing(csho)          
(4 real changes made)

. duplicates drop id year csho , force

Duplicates in terms of id year csho

(4 observations deleted)

.
. list

     +------------------------------------------+
     | year     id         at       ceq    csho |
     |------------------------------------------|
  1. | 1991   3514    714.709    76.064     2.4 |
  2. | 1992   3514    824.676    92.352   2.394 |
  3. | 1993   3514   1010.104   116.991   2.708 |
  4. | 1994   3514   1038.882   125.177   6.628 |
     +------------------------------------------+

Comment

bibha dhungel

Join Date: Oct 2020

Posts: 35
#4

02 Feb 2022, 20:53

Just for the records:
I did the following:
In the above example, I used the following codes:
First of all, install rmiss2

Code:

net describe dm31, from(http://www.stata.com/stb/stb26) net install dm31 egen nmis=rmiss2(year id at ceq csho) bysort id year: egen maxx = max(nmis) bysort id year: drop if nmis == maxx
Comment

Announcement

Dropping duplicate observations with more missing values

Comment

Comment

Comment