Contradictions in Assertion: trouble with ipolate/mipolate for panel data

Sam Volpe

Join Date: Jun 2021
Posts: 36

Contradictions in Assertion: trouble with ipolate/mipolate for panel data

26 Aug 2021, 09:03

Hi everyone,

dataex sample data is included at the end of this post. I am trying to interpolate panel data using this code:

Code:

 
 ipolate v2pesecsch year, gen(ipolate1) epolate

Code:

 
 mipolate v2pesecsch year, gen(ipolate2) pchip epolate

Code:

 
 mipolate v2pesecsch year, gen(ipolate3) idw epolate

These are the lines of code, I'm not sure which one to use in the end. I'm also not sure which variable to use as my x variable. On the ipolate help page, "interpolation requires that yvar be a function of xvar" and I think year is the closest thing I have to that, but my yvar "v2pesecsch" is just panel data for school enrollment rates. Here is the syntax for ipolate:

Code:

ipolate yvar xvar [if] [in] , generate(newvar) [epolate]

THE PROBLEM I'M HAVING:
Regardless of which line I use, the variable that it generates has made-up data even in the non-missing rows (1999-2010) which does NOT match the original data of the original variable for those rows. You can see this in the first handful of rows: ipolate1 and v2pesecsch have different recorded observations for the years 1999 to 2010 (and in my complete dataset, from 1960-2010). Yet ipolate1, ipolate2, and ipolate3 all record the same observations as each other in those rows. So I cannot really trust the interpolated data in the missing-data rows 2011-2020.

When I try to fill in the missing observations and actually interpolate, I get the error: "5,437 contradictions in 5,437 observations" which makes sense.

Code:

  assert v2pesecsch==ipolate1 if !mi(v2pesecsch)
5,437 contradictions in 5,437 observations

assert v2pesecsch==ipolate1 if mi(v2pesecsch)
4,741 contradictions in 4,741 observations

(I'm not sure what difference the "!" point makes. As far as I understand, "!" means "not", and I WANT to fill in the missing observations, not the "not missing observations". )

When I looked at others' examples using ipolate and mipolate, I noticed that their generated variables have the same observations as their original variable everywhere except the missing sections, as it should be. I have been trying to follow said examples and I am not sure where I am going wrong. I would be much obliged to anyone who could point me in the right direction for how to have my interpolation variable match the original variable in the non-missing rows.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str32 country_name double(year v2pesecsch ipolate1 ipolate2 ipolate3)
"Afghanistan" 1999 23.074 61.86512962962961  61.86512962962964 61.865129629629614
"Afghanistan" 2000 23.618 63.61587155963303 63.615871559633064 63.615871559633035
"Afghanistan" 2001 24.271 64.51291666666668  64.51291666666665   64.5129166666667
"Afghanistan" 2002 24.924 65.55614814814813  65.55614814814814  65.55614814814817
"Afghanistan" 2003 25.576 66.59934259259259  66.59934259259263  66.59934259259259
"Afghanistan" 2004 26.229  67.6425277777778  67.64252777777783  67.64252777777781
"Afghanistan" 2005 26.882 68.92466972477065  68.92466972477061  68.92466972477061
"Afghanistan" 2006  28.21 69.67728703703705  69.67728703703703  69.67728703703705
"Afghanistan" 2007 29.538 70.66885185185188  70.66885185185181  70.66885185185184
"Afghanistan" 2008 30.866 71.66034259259256   71.6603425925926  71.66034259259258
"Afghanistan" 2009 32.194 72.65192592592594  72.65192592592592  72.65192592592592
"Afghanistan" 2010 33.522 73.85347706422019  73.85347706422021   73.8534770642202
"Afghanistan" 2011      . 75.05502820251422  75.24489396645133   71.9520302301615
"Afghanistan" 2012      . 76.25657934080846  76.76587050954683   69.9676005772284
"Afghanistan" 2013      . 77.45813047910269  78.35610057043426  68.40773784965263
"Afghanistan" 2014      . 78.65968161739693  79.95527802604123   67.1291844948158
"Afghanistan" 2015      . 79.86123275569116  81.50309675329525  66.04587261740855
"Afghanistan" 2016      .  81.0627838939854   82.9392506291239  65.10682636237689
"Afghanistan" 2017      . 82.26433503227963  84.20343353045476  64.27936957990778
"Afghanistan" 2018      . 83.46588617057387  85.23533933421537 63.541141519250935
"Afghanistan" 2019      .  84.6674373088681   85.9746619173333  62.87606305245505
"Afghanistan" 2020      . 85.86898844716234  86.36109515673611   62.2721250292932
end

P.S - sorry if you've already seen this, I had to repost it because I had some trouble my previous post and forgot to include some information.

Tags: interpolation, ipolate, mipolate, missing data imputation, missing observations

Nick Cox

Join Date: Mar 2014
Posts: 36057

26 Aug 2021, 09:22

mipolate is from SSC, as you are asked to explain.

It is extraordinary to think that these data justify that number of decimal places.

It seems to me that the biggest problem here in your example -- given extra poignancy by current events -- is the need to extrapolate wildly from 12 years of data over a further 10 years of gaps. To the question which method should be trusted? my answer is none of them. They are all wild guesses in the dark, and qualitative information makes any guesses dubious if not fantastic. I know that won't help your project. But despite being the author of mipolate I will defend interpolation only for filling in small gaps in very well-behaved series. Otherwise it is an analytic gun that can cause injury to any analysis.

But the core of your question is why there are (tiny, tiny) differences between known and interpolated values. The problem is that I can't reproduce what you show.

With this code I get silent assent to the assert

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str32 country_name double(year v2pesecsch ipolate1 ipolate2 ipolate3)
"Afghanistan" 1999 23.074 61.86512962962961  61.86512962962964 61.865129629629614
"Afghanistan" 2000 23.618 63.61587155963303 63.615871559633064 63.615871559633035
"Afghanistan" 2001 24.271 64.51291666666668  64.51291666666665   64.5129166666667
"Afghanistan" 2002 24.924 65.55614814814813  65.55614814814814  65.55614814814817
"Afghanistan" 2003 25.576 66.59934259259259  66.59934259259263  66.59934259259259
"Afghanistan" 2004 26.229  67.6425277777778  67.64252777777783  67.64252777777781
"Afghanistan" 2005 26.882 68.92466972477065  68.92466972477061  68.92466972477061
"Afghanistan" 2006  28.21 69.67728703703705  69.67728703703703  69.67728703703705
"Afghanistan" 2007 29.538 70.66885185185188  70.66885185185181  70.66885185185184
"Afghanistan" 2008 30.866 71.66034259259256   71.6603425925926  71.66034259259258
"Afghanistan" 2009 32.194 72.65192592592594  72.65192592592592  72.65192592592592
"Afghanistan" 2010 33.522 73.85347706422019  73.85347706422021   73.8534770642202
"Afghanistan" 2011      . 75.05502820251422  75.24489396645133   71.9520302301615
"Afghanistan" 2012      . 76.25657934080846  76.76587050954683   69.9676005772284
"Afghanistan" 2013      . 77.45813047910269  78.35610057043426  68.40773784965263
"Afghanistan" 2014      . 78.65968161739693  79.95527802604123   67.1291844948158
"Afghanistan" 2015      . 79.86123275569116  81.50309675329525  66.04587261740855
"Afghanistan" 2016      .  81.0627838939854   82.9392506291239  65.10682636237689
"Afghanistan" 2017      . 82.26433503227963  84.20343353045476  64.27936957990778
"Afghanistan" 2018      . 83.46588617057387  85.23533933421537 63.541141519250935
"Afghanistan" 2019      .  84.6674373088681   85.9746619173333  62.87606305245505
"Afghanistan" 2020      . 85.86898844716234  86.36109515673611   62.2721250292932
end

ipolate v2 year, gen(wanted1) 

assert v2 == wanted1 if v2 < .

Something else is going on that you have not told us. Perhaps you are jumping back and forth between float and double versions of the variable. Note that recast double can never restore information that was lost when a variable was read in as float.

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

26 Aug 2021, 10:51

I am trying to interpolate panel data using this code:

In that case, the command to start with is

Code:

by country: ipolate v2pesecsch year, gen(ipolate1) epolate

following the advice given in the second example presented in the output of help ipolate. The code you have run does not automatically recognize and accommodate panel data, as is the case for most commands that don't begin with xt.
Comment
Sam Volpe

Join Date: Jun 2021

Posts: 36
#4

26 Aug 2021, 11:23

Nick Cox Thank you for your response. In my data browser there aren't that many decimal places. I'm not sure why so many were displayed when I used dataex. I appreciate your opinion about the danger of this method. If it helps, I'm trying to interpolate 10 years of data from 50 years of data, but I only displayed 21 rows here. I'm also looking at mi impute, do you think that would be any better? These are not for final results, it's more of an experiment to see how it alters results I have without interpolation, sort of like a robustness check. I have never heard of float and double versions of variables, I will look into that. I promise I'm not hiding anything - if something else is going on, I'm unaware of it and I'll have to figure that out.

To reiterate, the differences between the known and interpolated values are not tiny. For example, in the first row, the known value is 23.074 and the interpolated value is 61.86. That is what perplexes me most.

As always, very grateful for your insights.
Comment
Sam Volpe

Join Date: Jun 2021

Posts: 36
#5

26 Aug 2021, 11:25

William Lisowski Thank you! Differentiating categories in panel data trips me up sometimes.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36057
#6

26 Aug 2021, 12:12

I see now.

I was eyeballing the interpolated values for years when you have data, which are very similar to each other. -- but not to the original, as you correctly emphasise.

Also, you did give your code clearly in #1. Frankly, and this is certainly my fault, I was not studying it carefully but tacitly assuming that you were using by: as a framework. I didn't need to do that in #2 because my example was for a single panel.

But the mystery is now clear. You were interpolating using all countries and in that case even when values are known the interpolated value is the mean across countries; it is not the original value for each panel.

I need to look at the code for mipolate to see if there is any inconsistency in storage type used.
Comment
Sam Volpe

Join Date: Jun 2021

Posts: 36
#7

26 Aug 2021, 12:51

Nick Cox No worries! You seem to answer an impossibly high number of posts every day, it makes sense that you would recognize patterns and make small and very reasonable assumptions - I should have known by now to use

Code:

by: .

Using

Code:

by country_name:

as William Lisowski suggested has made a world of difference. All the columns (of original and interpolated data) match for the non-missing rows! Now I'm just experimenting with different options of ipolate and mipolate. Thank you both so much for your help so far.
Comment

Announcement

Contradictions in Assertion: trouble with ipolate/mipolate for panel data

Comment

Comment

Comment

Comment

Comment

Comment