Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • -mipolate- now available from SSC: new program for interpolation

    Thanks to Kit Baum as usual, a new program mipolate is now available from SSC.

    Use

    Code:
    ssc inst mipolate
    to install.

    Stata version 12 is required (but see below for a note for any people on
    version 10 or 11 who may be interested).

    mipolate is for interpolation, and extrapolation too, for
    one-dimensional series, replacing missing values with interpolated
    values in a copy of a variable. It is in effect an unofficial
    generalisation of the official command ipolate. It builds in the content
    of, and thus supersedes, the previously issued cipolate, csipolate,
    pchipolate and nnipolate (all SSC), and adds yet more methods.

    The by: prefix is allowed, as with ipolate. In particular, that means
    support for panel or longitudinal data.

    mipolate uses one of the following methods: linear, cubic, cubic spline,
    pchip (piecewise cubic Hermite interpolation), idw (inverse distance
    weighted), forward, backward, nearest neighbour, groupwise. The default
    method is linear.

    mipolate does not require tsset or xtset data and makes no check for, or
    use of, any such settings.

    linear specifies linear interpolation using known values
    before and after any missing values. This is the default method.

    cubic specifies cubic interpolation, using exact fitting of a cubic
    curve to two data points before and two data points after each
    observation for which there is a missing. Missing values are thus produced
    whenever fewer than two data points are present on either side. Note
    that this is not a spline method.

    spline specifies natural cubic spline interpolation. The method uses official
    Mata functions spline3() and spline3eval().

    pchip specifies piecewise cubic Hermite interpolation. This method uses
    piecewise cubics that join smoothly, so that both the interpolated
    function and its first derivative are continuous. In addition, the
    interpolant is shape-preserving in the sense that it cannot overshoot
    locally; sections in which the observed is increasing, decreasing or
    constant remain so after interpolation, and local extremes
    (maxima, maxima) also remain so. This interpolation method also
    extrapolates.

    idw[(power)] specifies inverse distance weighted interpolation. This
    method uses a weighted average of non-missing values, the weights being
    reciprocals of the powered distance between values, the power being zero
    or positive. The default power is 2; any other choice must be specified.
    Thus with power 2, values at distance 1 from a point with unknown values
    have weight 1, values at distance 2 from a point have weight 1/4,
    distance 3 weight 1/9, and so forth. If the power is 0, all known
    points have equal weight and the interpolant reduces to the average of
    all values. As the power becomes large, only those values that are
    nearest have appreciable weight. This interpolation method also
    extrapolates.

    forward specifies forward interpolation, so that any known value just
    before one or more missing values is copied in cascade to provide
    interpolated values, constant within any such block.

    backward specifies backward interpolation, so that any known value just
    after one or more missing values is copied in cascade to provide
    interpolated values, constant within any such block.

    nearest specifies nearest neighbour interpolation, which means using
    known values either before or after missing values, depending on
    which is nearer. When values before and after are
    equally distant from a known value, there is a choice of rules that may
    be applied. The default rule uses the mean of the two values. The
    ties() option provides alternative rules. This method also
    extrapolates, as unknown values before the first known value and unknown
    values after the last known value are replaced by those respective known
    values.

    groupwise specifies that non-missing values be copied to missing values
    if, and only if, just one distinct non-missing value occurs in each
    group. Thus a group of values ., 42, ., . qualifies as 42 is not missing
    and is the only non-missing value in the group. Hence the missing values
    in the group will be replaced with 42 in the new variable. By the same
    rules 42, ., 42, . qualifies but 42, ., 43, . does not. Normally, but
    not necessarily, this option is used in conjunction with by:, which is
    how groups are specified; otherwise the (single) group is the entire set
    of observations being used.

    (So what about users of version 10 or 11? The code works fine in those
    versions. The problem is that some SMCL directives that work in Stata 12
    up will not work in 10 or 11. Anyone who downloaded the files from SSC,
    edited the version statement in the ado file and edited the help files
    would get a serviceable variant on mipolate if they did that correctly,
    but that's your responsibility.)


  • #2
    Thanks to Kit Baum as always, a new program stripolate for string interpolation has been added to the mipolate package, which is why it is announced here.

    To install, use

    Code:
    ssc install mipolate
    or

    Code:
    ssc install mipolate, replace
    as usual or
    Code:
    adoupdate
    if preferred.

    The essence of the matter should be conveyed by a sandbox and its treatment:

    Code:
     
    clear
    set obs 15
    gen id = ceil(_n/5)
    bysort id: gen time = _n
    gen foo = "A" if inlist(_n, 2, 4)
    replace foo = "B" if inlist(_n, 6, 8, 12)
    replace foo = "C" in 14 
    
    
    list, sepby(id)  
    
         +-----------------+
         | id   time   foo |
         |-----------------|
      1. |  1      1       |
      2. |  1      2     A |
      3. |  1      3       |
      4. |  1      4     A |
      5. |  1      5       |
         |-----------------|
      6. |  2      1     B |
      7. |  2      2       |
      8. |  2      3     B |
      9. |  2      4       |
     10. |  2      5       |
         |-----------------|
     11. |  3      1       |
     12. |  3      2     B |
     13. |  3      3       |
     14. |  3      4     C |
     15. |  3      5       |
         +-----------------+
    So we have a string variable with gaps. Interpolation is here about filling the gaps with neighbouring values, and stripolate supports three ways to do it, which it calls forward, backward and groupwise. Often, but optionally, this is to be done within groups. Many people here will think "panels" and that's fine so long as they know that stripolate neither requires nor uses any tsset or xtset specifications.
    .
    Code:
    . by id: stripolate foo time, gen(barf) forward
    (2 missing values generated)
    
    . by id: stripolate foo time, gen(barb) backward
    (4 missing values generated)
    
    . by id: stripolate foo time, gen(barg) groupwise
    (3 missing values generated)
    
    . 
    . list, sepby(id)
    
         +--------------------------------------+
         | id   time   foo   barf   barb   barg |
         |--------------------------------------|
      1. |  1      1                   A      A |
      2. |  1      2     A      A      A      A |
      3. |  1      3            A      A      A |
      4. |  1      4     A      A      A      A |
      5. |  1      5            A             A |
         |--------------------------------------|
      6. |  2      1     B      B      B      B |
      7. |  2      2            B      B      B |
      8. |  2      3     B      B      B      B |
      9. |  2      4            B             B |
     10. |  2      5            B             B |
         |--------------------------------------|
     11. |  3      1                   B        |
     12. |  3      2     B      B      B      B |
     13. |  3      3            B      C        |
     14. |  3      4     C      C      C      C |
     15. |  3      5            C               |
         +--------------------------------------+
    I resisted writing this program because what it calls the forward and backward methods typically require only one or two lines of Stata code, as http://www.stata.com/support/faqs/da...ues/index.html has been explaining since 2000! I think it is (usually) a disservice to Stata users to supply programs to do what is otherwise available in as simple a form.

    What swung me was what is here called the groupwise method, or filling in gaps in a block of observations if and only if there is just one distinct non-missing value within that block. (In the example above, the method is applied for identifiers 1 and 2 but not 3, as different values are present.)
    Checking for that condition is trickier than many users want to have to work out each time, and not checking for it might be problematic too.

    There's perhaps scope for forward-backward and backward-forward methods, namely

    1. copy forward, but also backward from the first non-missing value

    2. copy backward, but also forward from the last non-missing value

    but I wait for signals that people really want either.

    Comment


    • #3
      There's perhaps scope for forward-backward and backward-forward methods, namely

      1. copy forward, but also backward from the first non-missing value

      2. copy backward, but also forward from the last non-missing value

      but I wait for signals that people really want either.
      But wouldn't 1 be just two applications of -stripolate-, first forward followed by backward, and 2 the same in reverse order? Am I missing something here?

      Comment


      • #4
        Yes, I think that's right. The small question is whether either is implemented as a direct option.

        Comment


        • #5
          I just want to thank Nick Cox for the very useful mipolate command. It would be great to see a "Speaking Stata" article in the Stata Journal on these various interpolation / extrapolation methods at some time.
          https://twitter.com/Kripfganz

          Comment


          • #6
            Thanks for the appreciation. It is definitely something I might write given the time. Meanwhile, there is stuff not in the help file at http://www.stata.com/meeting/columbu...mbus15_cox.ppt

            Comment


            • #7
              -mipolate- looks like a very useful function. Is there any way to constrain the interpolated or extrapolated function to be monotone increasing or nondecreasing? I don't see any mention of this in the documentation.

              Comment


              • #8
                mipolate is a command, not a function. But yes: if the data are monotone, then the pchipolate option will respect that. This was mentioned in #1 and in the help.

                Comment


                • #9
                  Nick Cox Thanks for this very useful command.

                  Would it be possible to include the option to replace the original series rather than to generate a new one?

                  Comment


                  • #10
                    It is possible. I have to say that’s not part of my intentions at all, even as an option at user discretion. The models include generate and egen being quite separate from replace. You could write your own wrapper command.

                    Comment

                    Working...
                    X