Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Filling Missing Values Based on Existing Observations

    My dataset has many companies over a span of 7 years. I'm trying to fill in missings for "var" with the values in "newvar." I need to apply 2 main rules to fill in the missing observations for my dataset:
    1. If there are existing values in the years before and after the missing, replace missing with the average of existing values closest in time to the missing year. In the below example, for years 2002 and 2003, I want to replace the missing with the average of 2001 and 2004 (hence, 45 in the newvar).
    2. If there are existing values only in years after the missing, or only in years before the missing, replace the missing year with the existing value most recent in time. In the example, year 2000 is replaced with the value in 2001 and year 2006 is replaced with the value in 2005.


    Code:
    ID  year  var  newvar
    A   2000  .    40
    A   2001  40   40
    A   2002  .    45
    A   2003  .    45
    A   2004  50   50
    A   2005  50   50
    A   2006  .    50
    I'm having trouble with how to implement the first rule. Any help would be very appreciated, thanks!

  • #2
    mipolate (SSC) provides some machinery here, and I quite like it, but it's immensely more instructive just to think this through from first principles. The recipe can be put in words that then suggest equivalent Stata code.

    In any panel: copy non-missing values forward in time to estimate missing values; also do the same but backward in time; then average the two estimates when they both exist; otherwise use the single estimate available.

    Code:
    clear
    input str1 ID  year  var  newvar
    A   2000  .    40
    A   2001  40   40
    A   2002  .    45
    A   2003  .    45
    A   2004  50   50
    A   2005  50   50
    A   2006  .    50
    end
    
    gen forw = var
    gen back = var
    bysort ID (year) : replace forw = forw[_n-1] if missing(forw)
    gen negyear = -year
    bysort ID (negyear): replace back = back[_n-1] if missing(back)
    drop negyear
    sort ID year
    
    gen wanted = cond(missing(forw, back), max(forw, back), (forw + back)/2)
    
    list, sep(0)
    
         +-------------------------------------------------+
         | ID   year   var   newvar   forw   back   wanted |
         |-------------------------------------------------|
      1. |  A   2000     .       40      .     40       40 |
      2. |  A   2001    40       40     40     40       40 |
      3. |  A   2002     .       45     40     50       45 |
      4. |  A   2003     .       45     40     50       45 |
      5. |  A   2004    50       50     50     50       50 |
      6. |  A   2005    50       50     50     50       50 |
      7. |  A   2006     .       50     50      .       50 |
         +-------------------------------------------------+
    The key ideas are all in https://www.stata.com/support/faqs/d...issing-values/

    ​​​​​​​(Please remember to look in the Stata FAQs as our own Statalist FAQ does advise.)

    Comment

    Working...
    X