-mipolate- now available from SSC: new program for interpolation

Nick Cox

Join Date: Mar 2014

Posts: 35724
#1

-mipolate- now available from SSC: new program for interpolation

03 Sep 2015, 09:49

Thanks to Kit Baum as usual, a new program mipolate is now available from SSC.

Use

Code:

ssc inst mipolate

to install.

Stata version 12 is required (but see below for a note for any people on
version 10 or 11 who may be interested).

mipolate is for interpolation, and extrapolation too, for
one-dimensional series, replacing missing values with interpolated
values in a copy of a variable. It is in effect an unofficial
generalisation of the official command ipolate. It builds in the content
of, and thus supersedes, the previously issued cipolate, csipolate,
pchipolate and nnipolate (all SSC), and adds yet more methods.

The by: prefix is allowed, as with ipolate. In particular, that means
support for panel or longitudinal data.

mipolate uses one of the following methods: linear, cubic, cubic spline,
pchip (piecewise cubic Hermite interpolation), idw (inverse distance
weighted), forward, backward, nearest neighbour, groupwise. The default
method is linear.

mipolate does not require tsset or xtset data and makes no check for, or
use of, any such settings.

linear specifies linear interpolation using known values
before and after any missing values. This is the default method.

cubic specifies cubic interpolation, using exact fitting of a cubic
curve to two data points before and two data points after each
observation for which there is a missing. Missing values are thus produced
whenever fewer than two data points are present on either side. Note
that this is not a spline method.

spline specifies natural cubic spline interpolation. The method uses official
Mata functions spline3() and spline3eval().

pchip specifies piecewise cubic Hermite interpolation. This method uses
piecewise cubics that join smoothly, so that both the interpolated
function and its first derivative are continuous. In addition, the
interpolant is shape-preserving in the sense that it cannot overshoot
locally; sections in which the observed is increasing, decreasing or
constant remain so after interpolation, and local extremes
(maxima, maxima) also remain so. This interpolation method also
extrapolates.

idw[(power)] specifies inverse distance weighted interpolation. This
method uses a weighted average of non-missing values, the weights being
reciprocals of the powered distance between values, the power being zero
or positive. The default power is 2; any other choice must be specified.
Thus with power 2, values at distance 1 from a point with unknown values
have weight 1, values at distance 2 from a point have weight 1/4,
distance 3 weight 1/9, and so forth. If the power is 0, all known
points have equal weight and the interpolant reduces to the average of
all values. As the power becomes large, only those values that are
nearest have appreciable weight. This interpolation method also
extrapolates.

forward specifies forward interpolation, so that any known value just
before one or more missing values is copied in cascade to provide
interpolated values, constant within any such block.

backward specifies backward interpolation, so that any known value just
after one or more missing values is copied in cascade to provide
interpolated values, constant within any such block.

nearest specifies nearest neighbour interpolation, which means using
known values either before or after missing values, depending on
which is nearer. When values before and after are
equally distant from a known value, there is a choice of rules that may
be applied. The default rule uses the mean of the two values. The
ties() option provides alternative rules. This method also
extrapolates, as unknown values before the first known value and unknown
values after the last known value are replaced by those respective known
values.

groupwise specifies that non-missing values be copied to missing values
if, and only if, just one distinct non-missing value occurs in each
group. Thus a group of values ., 42, ., . qualifies as 42 is not missing
and is the only non-missing value in the group. Hence the missing values
in the group will be replaced with 42 in the new variable. By the same
rules 42, ., 42, . qualifies but 42, ., 43, . does not. Normally, but
not necessarily, this option is used in conjunction with by:, which is
how groups are specified; otherwise the (single) group is the entire set
of observations being used.

(So what about users of version 10 or 11? The code works fine in those
versions. The problem is that some SMCL directives that work in Stata 12
up will not work in 10 or 11. Anyone who downloaded the files from SSC,
edited the version statement in the ado file and edited the help files
would get a serviceable variant on mipolate if they did that correctly,
but that's your responsibility.)
Tags: None

2 likes

Nick Cox

Join Date: Mar 2014
Posts: 35724

20 Dec 2016, 04:22

Thanks to Kit Baum as always, a new program stripolate for string interpolation has been added to the mipolate package, which is why it is announced here.

To install, use

Code:

ssc install mipolate

Code:

ssc install mipolate, replace

as usual or

Code:

adoupdate

if preferred.

The essence of the matter should be conveyed by a sandbox and its treatment:

Code:

 
clear
set obs 15
gen id = ceil(_n/5)
bysort id: gen time = _n
gen foo = "A" if inlist(_n, 2, 4)
replace foo = "B" if inlist(_n, 6, 8, 12)
replace foo = "C" in 14 


list, sepby(id)  

     +-----------------+
     | id   time   foo |
     |-----------------|
  1. |  1      1       |
  2. |  1      2     A |
  3. |  1      3       |
  4. |  1      4     A |
  5. |  1      5       |
     |-----------------|
  6. |  2      1     B |
  7. |  2      2       |
  8. |  2      3     B |
  9. |  2      4       |
 10. |  2      5       |
     |-----------------|
 11. |  3      1       |
 12. |  3      2     B |
 13. |  3      3       |
 14. |  3      4     C |
 15. |  3      5       |
     +-----------------+

So we have a string variable with gaps. Interpolation is here about filling the gaps with neighbouring values, and stripolate supports three ways to do it, which it calls forward, backward and groupwise. Often, but optionally, this is to be done within groups. Many people here will think "panels" and that's fine so long as they know that stripolate neither requires nor uses any tsset or xtset specifications.
.

Code:

. by id: stripolate foo time, gen(barf) forward
(2 missing values generated)

. by id: stripolate foo time, gen(barb) backward
(4 missing values generated)

. by id: stripolate foo time, gen(barg) groupwise
(3 missing values generated)

. 
. list, sepby(id)

     +--------------------------------------+
     | id   time   foo   barf   barb   barg |
     |--------------------------------------|
  1. |  1      1                   A      A |
  2. |  1      2     A      A      A      A |
  3. |  1      3            A      A      A |
  4. |  1      4     A      A      A      A |
  5. |  1      5            A             A |
     |--------------------------------------|
  6. |  2      1     B      B      B      B |
  7. |  2      2            B      B      B |
  8. |  2      3     B      B      B      B |
  9. |  2      4            B             B |
 10. |  2      5            B             B |
     |--------------------------------------|
 11. |  3      1                   B        |
 12. |  3      2     B      B      B      B |
 13. |  3      3            B      C        |
 14. |  3      4     C      C      C      C |
 15. |  3      5            C               |
     +--------------------------------------+

I resisted writing this program because what it calls the forward and backward methods typically require only one or two lines of Stata code, as http://www.stata.com/support/faqs/da...ues/index.html has been explaining since 2000! I think it is (usually) a disservice to Stata users to supply programs to do what is otherwise available in as simple a form.

What swung me was what is here called the groupwise method, or filling in gaps in a block of observations if and only if there is just one distinct non-missing value within that block. (In the example above, the method is applied for identifiers 1 and 2 but not 3, as different values are present.)
Checking for that condition is trickier than many users want to have to work out each time, and not checking for it might be problematic too.

There's perhaps scope for forward-backward and backward-forward methods, namely

1. copy forward, but also backward from the first non-missing value

2. copy backward, but also forward from the last non-missing value

but I wait for signals that people really want either.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#3

20 Dec 2016, 10:00

There's perhaps scope for forward-backward and backward-forward methods, namely

1. copy forward, but also backward from the first non-missing value

2. copy backward, but also forward from the last non-missing value

but I wait for signals that people really want either.

But wouldn't 1 be just two applications of -stripolate-, first forward followed by backward, and 2 the same in reverse order? Am I missing something here?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#4

20 Dec 2016, 11:04

Yes, I think that's right. The small question is whether either is implemented as a direct option.
Comment
Sebastian Kripfganz

Join Date: May 2014

Posts: 2595
#5

27 Apr 2017, 08:03

I just want to thank Nick Cox for the very useful mipolate command. It would be great to see a "Speaking Stata" article in the Stata Journal on these various interpolation / extrapolation methods at some time.

https://www.kripfganz.de/stata/
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#6

27 Apr 2017, 11:01

Thanks for the appreciation. It is definitely something I might write given the time. Meanwhile, there is stuff not in the help file at http://www.stata.com/meeting/columbu...mbus15_cox.ppt
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 502
#7

29 Oct 2018, 15:02

-mipolate- looks like a very useful function. Is there any way to constrain the interpolated or extrapolated function to be monotone increasing or nondecreasing? I don't see any mention of this in the documentation.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#8

29 Oct 2018, 16:56

mipolate is a command, not a function. But yes: if the data are monotone, then the pchipolate option will respect that. This was mentioned in #1 and in the help.
Comment
Roman Goossens

Join Date: Nov 2015

Posts: 11
#9

30 Nov 2018, 14:24

Nick Cox Thanks for this very useful command.

Would it be possible to include the option to replace the original series rather than to generate a new one?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#10

30 Nov 2018, 14:33

It is possible. I have to say that’s not part of my intentions at all, even as an option at user discretion. The models include generate and egen being quite separate from replace. You could write your own wrapper command.
Comment

Announcement