Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to repeat observations in a given time period?

    Hi listers!

    I'm working on a database for my undergraduate thesis, but I'm facing a problem with missing values.

    For some variables I do not have observations for the whole time serie, I would like to know if there is a command or a way to keep an observation constant until there is another observation different to a missing value, and keep it doing the same operation for all the observations.

    Right now my database looks like this:

    Country Year X
    Mexico 2000 2
    Mexico 2001 .
    Mexico 2002 5
    Mexico 2003 .
    Mexico 2004 6
    Mexico 2005 .
    Mexico 2006 .
    Mexico 2007 4
    Mexico 2008 .

    I would like it to look like this one:

    Country Year X
    Mexico 2000 2
    Mexico 2001 2
    Mexico 2002 5
    Mexico 2003 5
    Mexico 2004 6
    Mexico 2005 6
    Mexico 2006 6
    Mexico 2007 4
    Mexico 2008 4

    Thank you!

  • #2
    Hi Luis,

    See here your data:

    Code:
    input str6 Country Year X 
    Mexico 2000 2 
    Mexico 2001 . 
    Mexico 2002 5
    Mexico 2003 .
    Mexico 2004 6
    Mexico 2005 .
    Mexico 2006 .
    Mexico 2007 4
    Mexico 2008 .
    end
    See here code that does what you ask for:

    Code:
    replace X = X[_n-1] if X[_n]==.
    I don't know what the X variable represents (as a side note, might I suggest you name your variables more meaningfully), but it might be more appropriate to impute the missing values as the average of the preceding and successive non-missing values. For example:

    Code:
    replace X = (X[_n-1]+X[_n+1])/2 if X[_n]==.
    Best,

    Comment


    • #3
      Chris Larkin's first code will perform as promised. His second will not produce a value when there is a run of missing values: it only works when each missing value is surrounded by non-missing values before and after it.

      Chris Larkin's other point is worth re-emphasizing: the best method of imputing missing values (if there is any good method at all!) depends on the meaning of the data. Last observation carried forward, as here, is useful for some kinds of data but not for others.

      Comment


      • #4
        See also http://www.statalist.org/forums/foru...-interpolation

        Copying the last known value is exactly what you want sometimes, but in general it is a poor interpolation method.

        Comment


        • #5
          Thanks for spotting that Clyde. It works ok with Luis's data, but it is plausible he has only shared one part of his dataset (quite likely really) and that there are missing observations surrounded by other non-missing observations. I was trying to figure out a way of writing a forvalues loop that looped through observations (less than desirable, I know) and replaced the missing observation(s) with the average of the two non-missing observations either side of it/them. I thought an if else statement and an exit from the loop when the replace is fulfilled would work but i haven't been able to get it up and running. Hopefully my first suggestion will be helpful for Luis though!

          Comment


          • #6
            Chris Larkin Here's how I would do what you proposed (illustrated with a toy data set):

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input float x
              -.4868504
             -1.4760038
            -.010911492
                      .
               .5155788
                      .
                      .
             -.58777297
                      .
                      .
              -.9155009
              -.2230017
               .4945496
             -.26231077
               .3661239
            end
            
            //    MARK SORT ORDER
            gen obs_no = _n
            
            gen previous = x
            replace previous = previous[_n-1] if missing(previous)
            
            gsort -obs_no
            gen following = x
            replace following = following[_n-1] if missing(following)
            
            sort obs_no
            gen x_imputed = x
            replace x_imputed = 0.5*(previous + following) if missing(x)

            Comment


            • #7
              Nice! Very original solution

              Comment


              • #8
                I have to say that I don't see that this method has a better rationale than linear interpolation, or indeed polynomial or spline-based interpolation.

                You're filling gaps with the average of the last known and the next known values. To make this graphical, add this command to Clyde's code:

                Code:
                scatter x_imputed obs if missing(x), ms(+) || scatter x obs, ms(Oh) scheme(s1color) ytitle("x, observed or imputed")
                Click image for larger version

Name:	clyde.png
Views:	1
Size:	8.7 KB
ID:	1362215

                Comment


                • #9
                  Just to be clear, I wrote the code in #6 not because I think that it's a good way of imputing missing values, but because the code offering to do the same thing in #2 was incorrect. I even would have let that pass, but the question of how to spread values upward arises from time to time on the forum, and the code in #6 shows how that is done. As I emphasized in #3 there are many approaches to imputation. This, I think, is not because there are so many good ones, so much as because none is truly satisfactory. Missing data remains, despite decades of research by some of the best minds in statistics, an intractable problem. In my own work, I generally use several approaches to missing data in every project and try to draw the best conclusions I can from them, given the limitations they all have.

                  And I have to agree with Nick that it is hard for me to imagine a situation where average of preceding and following would make more sense than linear interpolation.

                  Comment


                  • #10
                    Thank you Chris Larkin and Clyde Schechter! Chris' first code solved the problem I had with the missing values.

                    Thank you!

                    Comment

                    Working...
                    X