How to repeat observations in a given time period?

Luis Cano

Join Date: Oct 2016

Posts: 2
#1

How to repeat observations in a given time period?

28 Oct 2016, 12:43

Hi listers!

I'm working on a database for my undergraduate thesis, but I'm facing a problem with missing values.

For some variables I do not have observations for the whole time serie, I would like to know if there is a command or a way to keep an observation constant until there is another observation different to a missing value, and keep it doing the same operation for all the observations.

Right now my database looks like this:

Country Year X
Mexico 2000 2
Mexico 2001 .
Mexico 2002 5
Mexico 2003 .
Mexico 2004 6
Mexico 2005 .
Mexico 2006 .
Mexico 2007 4
Mexico 2008 .

I would like it to look like this one:

Country Year X
Mexico 2000 2
Mexico 2001 2
Mexico 2002 5
Mexico 2003 5
Mexico 2004 6
Mexico 2005 6
Mexico 2006 6
Mexico 2007 4
Mexico 2008 4

Thank you!
Tags: None
Chris Larkin

Join Date: Apr 2016

Posts: 296
#2

28 Oct 2016, 13:00

Hi Luis,

See here your data:

Code:

input str6 Country Year X Mexico 2000 2 Mexico 2001 . Mexico 2002 5 Mexico 2003 . Mexico 2004 6 Mexico 2005 . Mexico 2006 . Mexico 2007 4 Mexico 2008 . end

See here code that does what you ask for:

Code:

replace X = X[_n-1] if X[_n]==.

I don't know what the X variable represents (as a side note, might I suggest you name your variables more meaningfully), but it might be more appropriate to impute the missing values as the average of the preceding and successive non-missing values. For example:

Code:

replace X = (X[_n-1]+X[_n+1])/2 if X[_n]==.

Best,
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30153
#3

28 Oct 2016, 14:58

Chris Larkin's first code will perform as promised. His second will not produce a value when there is a run of missing values: it only works when each missing value is surrounded by non-missing values before and after it.

Chris Larkin's other point is worth re-emphasizing: the best method of imputing missing values (if there is any good method at all!) depends on the meaning of the data. Last observation carried forward, as here, is useful for some kinds of data but not for others.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35759
#4

28 Oct 2016, 15:03

See also http://www.statalist.org/forums/foru...-interpolation

Copying the last known value is exactly what you want sometimes, but in general it is a poor interpolation method.
Comment
Chris Larkin

Join Date: Apr 2016

Posts: 296
#5

28 Oct 2016, 15:22

Thanks for spotting that Clyde. It works ok with Luis's data, but it is plausible he has only shared one part of his dataset (quite likely really) and that there are missing observations surrounded by other non-missing observations. I was trying to figure out a way of writing a forvalues loop that looped through observations (less than desirable, I know) and replaced the missing observation(s) with the average of the two non-missing observations either side of it/them. I thought an if else statement and an exit from the loop when the replace is fulfilled would work but i haven't been able to get it up and running. Hopefully my first suggestion will be helpful for Luis though!
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30153

28 Oct 2016, 15:37

Chris Larkin Here's how I would do what you proposed (illustrated with a toy data set):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float x
  -.4868504
 -1.4760038
-.010911492
          .
   .5155788
          .
          .
 -.58777297
          .
          .
  -.9155009
  -.2230017
   .4945496
 -.26231077
   .3661239
end

//    MARK SORT ORDER
gen obs_no = _n

gen previous = x
replace previous = previous[_n-1] if missing(previous)

gsort -obs_no
gen following = x
replace following = following[_n-1] if missing(following)

sort obs_no
gen x_imputed = x
replace x_imputed = 0.5*(previous + following) if missing(x)

Comment

Chris Larkin

Join Date: Apr 2016

Posts: 296
#7

28 Oct 2016, 15:43

Nice! Very original solution
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35759
#8

29 Oct 2016, 02:17

I have to say that I don't see that this method has a better rationale than linear interpolation, or indeed polynomial or spline-based interpolation.

You're filling gaps with the average of the last known and the next known values. To make this graphical, add this command to Clyde's code:

Code:

scatter x_imputed obs if missing(x), ms(+) || scatter x obs, ms(Oh) scheme(s1color) ytitle("x, observed or imputed")
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30153
#9

29 Oct 2016, 09:19

Just to be clear, I wrote the code in #6 not because I think that it's a good way of imputing missing values, but because the code offering to do the same thing in #2 was incorrect. I even would have let that pass, but the question of how to spread values upward arises from time to time on the forum, and the code in #6 shows how that is done. As I emphasized in #3 there are many approaches to imputation. This, I think, is not because there are so many good ones, so much as because none is truly satisfactory. Missing data remains, despite decades of research by some of the best minds in statistics, an intractable problem. In my own work, I generally use several approaches to missing data in every project and try to draw the best conclusions I can from them, given the limitations they all have.

And I have to agree with Nick that it is hard for me to imagine a situation where average of preceding and following would make more sense than linear interpolation.
Comment
Luis Cano

Join Date: Oct 2016

Posts: 2
#10

31 Oct 2016, 12:08

Thank you Chris Larkin and Clyde Schechter! Chris' first code solved the problem I had with the missing values.

Thank you!
Comment

Announcement

How to repeat observations in a given time period?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment