Filling Missing Values Based on Existing Observations

Grace Liu

Join Date: May 2020

Posts: 1
#1

Filling Missing Values Based on Existing Observations

27 May 2020, 10:05

My dataset has many companies over a span of 7 years. I'm trying to fill in missings for "var" with the values in "newvar." I need to apply 2 main rules to fill in the missing observations for my dataset:
1. If there are existing values in the years before and after the missing, replace missing with the average of existing values closest in time to the missing year. In the below example, for years 2002 and 2003, I want to replace the missing with the average of 2001 and 2004 (hence, 45 in the newvar).
2. If there are existing values only in years after the missing, or only in years before the missing, replace the missing year with the existing value most recent in time. In the example, year 2000 is replaced with the value in 2001 and year 2006 is replaced with the value in 2005.

Code:

ID year var newvar A 2000 . 40 A 2001 40 40 A 2002 . 45 A 2003 . 45 A 2004 50 50 A 2005 50 50 A 2006 . 50

I'm having trouble with how to implement the first rule. Any help would be very appreciated, thanks!
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35688

27 May 2020, 10:21

mipolate (SSC) provides some machinery here, and I quite like it, but it's immensely more instructive just to think this through from first principles. The recipe can be put in words that then suggest equivalent Stata code.

In any panel: copy non-missing values forward in time to estimate missing values; also do the same but backward in time; then average the two estimates when they both exist; otherwise use the single estimate available.

Code:

clear
input str1 ID  year  var  newvar
A   2000  .    40
A   2001  40   40
A   2002  .    45
A   2003  .    45
A   2004  50   50
A   2005  50   50
A   2006  .    50
end

gen forw = var
gen back = var
bysort ID (year) : replace forw = forw[_n-1] if missing(forw)
gen negyear = -year
bysort ID (negyear): replace back = back[_n-1] if missing(back)
drop negyear
sort ID year

gen wanted = cond(missing(forw, back), max(forw, back), (forw + back)/2)

list, sep(0)

     +-------------------------------------------------+
     | ID   year   var   newvar   forw   back   wanted |
     |-------------------------------------------------|
  1. |  A   2000     .       40      .     40       40 |
  2. |  A   2001    40       40     40     40       40 |
  3. |  A   2002     .       45     40     50       45 |
  4. |  A   2003     .       45     40     50       45 |
  5. |  A   2004    50       50     50     50       50 |
  6. |  A   2005    50       50     50     50       50 |
  7. |  A   2006     .       50     50      .       50 |
     +-------------------------------------------------+

The key ideas are all in https://www.stata.com/support/faqs/d...issing-values/

(Please remember to look in the Stata FAQs as our own Statalist FAQ does advise.)

Announcement

Filling Missing Values Based on Existing Observations

Comment