Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extrapolation within a range (ipolate)

    Hi there,

    due to imputation problems, I would like to expolate my data. Especially, variables that have a specific range from 1 to 7. Here is the summary statistic before using ipolate-command:

    Variable Obs Mean Std. dev. Min Max

    plh0212_m 246 5.857724 1.00208 2 7
    plh0213_m 245 5.55102 1.1319 3 7
    plh0214_m 244 3.389344 1.390758 2 7
    plh0215_m 243 4.654321 1.228052 2 7
    plh0216_m 245 4.844898 1.382126 2 7

    plh0217_m 245 5.126531 1.136354 2 7
    plh0218_m 244 3.045082 1.277382 2 7
    plh0219_m 245 5.093878 1.249329 2 7
    plh0220_m 246 4.276423 1.534904 2 7
    plh0221_m 245 4.016327 1.333574 2 7

    plh0222_m 245 5.518367 1.034581 2 7
    plh0223_m 246 3.939024 1.402746 2 7
    plh0224_m 246 5.678862 .8562616 3 7
    plh0225_m 245 4.844898 1.355175 2 7
    plh0226_m 246 4.443089 1.310403 2 7

    plh0255_m 182 5.208791 1.107648 2 7

    If I apply ipolate-command with the expolate option, I yield:

    Variable Obs Mean Std. dev. Min Max

    c_plh0212_m 588 5.815901 1.019097 1.5 8.25
    c_plh0213_m 587 5.522658 1.143996 2 9.25
    c_plh0214_m 586 3.389164 1.439019 -1 7
    c_plh0215_m 575 4.735652 1.26408 -.5 8
    c_plh0216_m 587 4.862862 1.467777 .5 9

    c_plh0217_m 587 5.071976 1.160919 0 9
    c_plh0218_m 580 3.134138 1.31356 .5 9
    c_plh0219_m 587 5.051448 1.254386 -.5 9.5
    c_plh0220_m 588 4.403316 1.547585 .75 9.5
    c_plh0221_m 581 4.027194 1.413046 -1 7.75

    c_plh0222_m 583 5.513379 1.038425 1.25 8.5
    c_plh0223_m 588 3.964796 1.448953 1 9
    c_plh0224_m 588 5.646599 .958845 1.5 9.5
    c_plh0225_m 581 4.895353 1.412477 1 8
    c_plh0226_m 588 4.343537 1.305607 1 8

    c_plh0255_m 439 5.241458 1.178702 1.5 8

    You can clearly see that the values exceed the min and max. Any ideas to solve the problem? I thought about restricting the range by:

    foreach var in c_plh0212_m c_plh0213_m c_plh0214_m c_plh0215_m c_plh0216_m c_plh0217_m c_plh0218_m c_plh0219_m c_plh0220_m c_plh0221_m c_plh0222_m c_plh0223_m c_plh0224_m c_plh0225_m c_plh0226_m c_plh0255_m{
    replace `var' = min(7, max(1, `var'))
    }

    What do you think? Is it a common approach?

    Thanks in advance!

    Best,
    Vera

  • #2
    I think you have multiple options here. Either you sanitize afterwards as you suggested. You could also interpolate manually if adjacent values are available:

    Code:
    bysort idcode (year): replace VAR = (VAR[_n-1] + VAR[_n+1]) / 2 if missing(VAR) & !missing(VAR[_n-1]) & !missing(VAR[_n+1])
    All residual values you can fill up with a mean, like:
    Code:
    bysort idcode (year): egen meanval = mean(VAR)
    replace VAR = meanval if missing(VAR)
    Best wishes

    Stata 18.0 MP | ORCID | Google Scholar

    Comment


    • #3
      Let's get terminology straight. The command for linear interpolation is

      Code:
      ipolate yvar xvar [if] [in] , generate(newvar) [epolate]
      and so epolate is the name of the option that extrapolates under the
      Code:
      ipolate
      command. Stata's abbreviations are just that, but expolate is neither a standard term nor the name of the Stata option in question.

      That said, pedantically if you prefer, a much more crucial doubt is that you don't give us any detail on the layout of your data, exactly how you are extrapolating, and what predictor(s) you are using.

      I've written a great deal about interpolation on this site, but I've come to think that it is likely to be a good imputation method if and only if you are filling small holes in a very smoothly changing time series. Your set-up does not sound like that at all, and linear extrapolation necessarily is unsuited to bounded outcomes (or to integer outcomes, if that is what you also have).

      So please give us a concrete reproducible example with

      original data

      the interpolation command syntax

      Comment


      • #4
        @Felix Bittmann: I don't see any information in #1 that allows or encourages your specific advice. But perhaps you recognise a standard large publicly available dataset from its variable names.

        On a different level, your first command can be just

        Code:
         
         bysort idcode (year): replace VAR = (VAR[_n-1] + VAR[_n+1]) / 2 if missing(VAR)
        as if either the previous or the following value is missing, their mean will be missing too, hence no change.

        Filling in with the mean used to be a common recommendation decades ago. I have every sense that it causes quite as many problems as it ever solves.

        Comment

        Working...
        X