Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dropping duplicate observations with more missing values

    Hello all,

    I have a large panel data set with a considerable amount of duplicates (impossible to assess by hand). The reason for this duplicates to exist is not clear to me: I have downloaded the data straight from Compustat / WRDS.

    What I get is that I have duplicate observations by year and id, with one of the duplicate observations missing values where the other does not:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float year long id double(at ceq csho)
    1991 3514  714.709  76.064     .
    1991 3514  714.709  76.064   2.4
    1992 3514  824.676  92.352     .
    1992 3514  824.676  92.352 2.394
    1993 3514 1010.104 116.991     .
    1993 3514 1010.104 116.991 2.708
    1994 3514 1038.882 125.177     .
    1994 3514 1038.882 125.177 6.628
    end
    label values id id
    label def id 3514 "WSBC", modify
    Is there a way I can have Stata drop the observations with more missing values?

    Thank you in advance.

  • #2
    This is a dangerous approach. If the data set is as large as you say, then you cannot be sure that there aren't instances where a pair of observations with the same year and id contain non-missing but disagreeing values on some variables. Also, you may have a situation where variable X is missing from one observation and variable Y from the other: there is no reason not to retain whichever observation is non missing. So I would first verify that there are no clashes between non-missing values of any variable, and then, I would retain the whichever value is non-missing from each variable.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float year long id double(at ceq csho)
    1991 3514  714.709  76.064     .
    1991 3514  714.709  76.064   2.4
    1992 3514  824.676  92.352     .
    1992 3514  824.676  92.352 2.394
    1993 3514 1010.104 116.991     .
    1993 3514 1010.104 116.991 2.708
    1994 3514 1038.882 125.177     .
    1994 3514 1038.882 125.177 6.628
    end
    label values id id
    label def id 3514 "WSBC", modify
    
    //    VERIFY THAT ALL VARIABLES AGREE WHEN NON MISSING
    ds year id, not
    local vbles `r(varlist)'
    foreach v of varlist `vbles' {
        by year id (`v'), sort: assert inlist(`v', `v'[1], .)
    }
    
    
    // RETAIN NON-MISSING VALUE FOR EACH VARIABLE
    collapse (firstnm) `vbles', by(year id)
    Experience shows that even well-curated data sets from usually reliable sources frequently contain glitches that have to be fixed before you can analyze the data. In fact, it is a good practice to approach each new data set assuming it contains glitches and problems. Be pro-active and look for them when you first import the data: don't wait for them to trip you up farther down the line after you have already invested time and effort into work that then has to be re-done with corrected data!

    Comment


    • #3
      This works for your example:


      Code:
      . bysort id year (csho) : replace csho = csho[1] if missing(csho)          
      (4 real changes made)
      
      . duplicates drop id year csho , force
      
      Duplicates in terms of id year csho
      
      (4 observations deleted)
      
      .
      . list
      
           +------------------------------------------+
           | year     id         at       ceq    csho |
           |------------------------------------------|
        1. | 1991   3514    714.709    76.064     2.4 |
        2. | 1992   3514    824.676    92.352   2.394 |
        3. | 1993   3514   1010.104   116.991   2.708 |
        4. | 1994   3514   1038.882   125.177   6.628 |
           +------------------------------------------+

      Comment


      • #4
        Just for the records:
        I did the following:
        In the above example, I used the following codes:
        First of all, install rmiss2

        Code:
        net describe dm31, from(http://www.stata.com/stb/stb26)
        net install dm31
        egen nmis=rmiss2(year id at ceq csho)
        bysort id year: egen maxx = max(nmis)
        bysort id year: drop if nmis == maxx

        Comment

        Working...
        X