Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • better to interpolate or to replace missing values?

    Hi all!
    Another basic question. When data are available only at x years periods (i.e. indexes released every 5 years) and I need to attribute some values to the missing (i.e. to the four years between 1990 and 1995), is it more correct to replace missing with the first available value (therefore to attribute to years 1991-1994 the value for 1995) or to interpolate missing values? In this case I have data from 1980 to 2000 (at 5 years period) and from 2000 to 2004 (on annual basis).
    ​Looking at the definition of the index in question, I found it is not computed as the average value for the five years (in this case, I would probably have preferred to replace missing values).
    Thank's!

  • #2
    There is no generic answer to this question. And it is really more a question about the scientific content of your research questions and the scientific meaning of the measurement in question than it is a statistical one. If the nature of the construct your variable measures is such that it can reasonably be expected to change linearly over short time intervals, then interpolation is the most sensible approach. If the nature of the construct it measures is such that it stays constant over moderate intervals and then changes abruptly, replacing with the nearest available value might make sense. If the missing values can be reasonably considered to be missing at random, a multiple imputation approach might make the most sense.

    My advice is to check with your professional colleagues in your discipline as to what makes the most sense for this particular variable.

    Comment


    • #3
      I agree 100% with Clyde. That said, do you have rich sources of information for the intervening years? If you do, then multiple imputation or FIML become attractive. Again, as he said, check with people in your discipline. But depending on the model and depending on how much additional information you have available, simple linear interpolation or using the nearest available value may not be the way to go. You *might* be able to do better.

      Comment


      • #4
        ps. This is assuming we're interpolating/imputing an independent variable. If it's your dependent variable, then given a good imputation model, nothing gained, nothing lost. But if your imputation model is not very good, can actually cause harm [citations needed, but I'm too lazy/tired to dig them up other than pointing you in a potentially useful direction]. You *cannot* gain anything by imputing your dependent variable, other than NYT-level stuff.
        Last edited by ben earnhart; 09 Sep 2014, 22:43.

        Comment


        • #5
          Simona:
          your choice between interpolation and multiple imputation might also depend on other variables (if any) you have in your dataset, as they can help you understand whether or not index changes follow a linear trend or jump across the years. As an aside to what Others have already recommended, I would search for previous papers on the same topic and compare my research strategy to the reported methodology.

          Kind regards,
          Carlo
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment


          • #6
            I'd echo the tone of strong caution uniting people who've answered. But I am a little confused on what the terms of the discussion are precisely. Simona [please tell us your other name; see FAQ Advice] didn't ask about imputation, although it is natural that people might want to mention it too.

            To my mind,

            1. carrying next known value backwards

            2. carrying last known value forwards

            3. linear interpolation (joining known values with linear segments)

            are just three varieties of interpolation, here defined as some deterministic rule for replacing missing values. It's clearly not true that interpolation just means linear interpolation. Other varieties include but are not restricted to nearest neighbour, cubic, cubic spline, etc.

            If some variety of interpolation is chosen, it is an interesting question what that means for degrees of freedom. If I interpolate say a 5 yearly series to 1 yearly and feed that to a Stata modelling command, that command neither knows nor cares where the data came from. But I don't really have 5 times more data than before. Perhaps there are discussions of this somewhere in the literature.

            Comment


            • #7
              NYT = New York Times?

              Comment

              Working...
              X