Filling in a missing year using interpolation?

Andrew Lin

Join Date: Sep 2014

Posts: 7
#1

Filling in a missing year using interpolation?

26 Jan 2016, 13:48

I have data by different counties for the years 1981 and 2011. Each variable is in it's own column.

For example: One column is Population (1981), with values for multiple counties. Another column is Population (2011) with values for those same counties.

..............| Pop80 | Pop11 | Pop01
--------------------------------------------------
County A | 1234.. | 2234.. | ..?
--------------------------------------------------
County B | 9876.. | 10876 | ..?

Is there some way to fill in an approximation of what the population might have been in 2001?

I've seen some PDFs about linear interpolation and the tsfill command, but those seem to have a separate column for year, which is not how my data is constructed. Would love to hear any suggestions, and please let me know if my situation needs further clarifying
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30163
#2

26 Jan 2016, 14:06

Well, linear interpolation of population is probably not a very good way to go. But putting that aside, the trick is to first -reshape- your data to the long layout that has a separate variable for year. Then you can apply -ipolate-. The names of your variables, where 80 represents 1980 and 11 represents 2011 adds a slight complication to the process. But here's how it would work:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str9 country float(pop80 pop11) "Country A" 1234 2234 "Country B" 9876 10876 end gen pop01 = . reshape long pop, i(country) j(yr) string destring yr, replace recode yr (80 = 1980) (11 = 2011) (1 = 2001), gen(year) drop yr by country, sort:ipolate pop year, gen(pop_interp)

And then, if you really want your data back in wide layout, you can do that. But remember that almost anything you are going to do with your data in Stata will be easier in long layout anyway, so going back to wide layout is likely to be a waste of time.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35773
#3

26 Jan 2016, 14:24

I agree with Clyde. I'd add, however, that linear interpolation is in principle not the only method. But if you really have just two known points, there aren't many alternatives, except that geometric interpolation seems a more natural first approximation.
1 like
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#4

27 Jan 2016, 12:43

If you really find the wide layout for the data most usable, then you can calculate the linear interpolation for any year between 1980 and 2011 as a simple function of pop11 and pop80.
1 like
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35773

27 Jan 2016, 13:30

Let's spell it out. For geometric interpolation, we just interpolate logarithms linearly and then exponentiate:

Code:

clear 
input str4 county pop80  pop11 
A  1234  2234  
B  9876  10876 
test 42 42 
end 
gen pop01_lin = round(pop80 + (21/31) * (pop11 - pop80)) 
gen pop01_geo = round(exp(ln(pop80) + (21/31) * (ln(pop11) - ln(pop80)))) 

list 

     +----------------------------------------------+
     | county   pop80   pop11   pop01_~n   pop01_~o |
     |----------------------------------------------|
  1. |      A    1234    2234       1911       1845 |
  2. |      B    9876   10876      10553      10543 |
  3. |   test      42      42         42         42 |
     +----------------------------------------------+

Comment

Andrew Lin

Join Date: Sep 2014

Posts: 7
#6

02 Feb 2016, 14:58

Thanks all for your help!

Originally posted by Nick Cox View Post

Let's spell it out. For geometric interpolation, we just interpolate logarithms linearly and then exponentiate:

Code:

clear input str4 county pop80 pop11 A 1234 2234 B 9876 10876 test 42 42 end gen pop01_lin = round(pop80 + (21/31) * (pop11 - pop80)) gen pop01_geo = round(exp(ln(pop80) + (21/31) * (ln(pop11) - ln(pop80)))) list +----------------------------------------------+ | county pop80 pop11 pop01_~n pop01_~o | |----------------------------------------------| 1. | A 1234 2234 1911 1845 | 2. | B 9876 10876 10553 10543 | 3. | test 42 42 42 42 | +----------------------------------------------+

When I use those commands, it doesn't give me any decimals - just integers. Is there any way around this?

Also, one variable I'm trying to calculate is students per capita. Would it be best to use this code on the population and the student pop measures, and then derive students per capita from the resulting 2001 variables? Or would it be better to just calculate students per capita for 1980 and 2011, then use those to derive 2001 population using the syntax you provided?

Thanks again for your help!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35773
#7

02 Feb 2016, 16:25

That's because I used round(); just omit it if you want the decimal places!

I would just interpolate on students per capita. It's simpler that way.
Comment

Announcement

Filling in a missing year using interpolation?

Comment

Comment

Comment

Comment

Comment

Comment