Population data question

Mer Kravitz

Join Date: Dec 2016

Posts: 5
#1

Population data question

27 Dec 2016, 21:40

Hi all, I have a panel dataset spanning 10 years (2005-2014), with population data for 2005 and 2010. What is the appropriate method of populating those missing years? Many thanks in advance.
Tags: None
Andrew Castro

Join Date: Nov 2015

Posts: 68
#2

27 Dec 2016, 23:31

Could you get yearly data from elsewhere? That'd be the most accurate, of course. If not, you'll have to estimate. You could just linearly interpolate, but there are a lot of things to consider. Take a look at the paper Interpolation and forecasting of population census data, Fukuda, Kosei (2010) if you'd like a more in-depth consideration.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#3

28 Dec 2016, 02:17

Mer:
welcome to the list.
If you miss the population size (and Andrew's helpful solution is not feasible), you may also want to consider retrieving the official annual population growth rate from 2010 onwards for the countries you're interested in and estimate the missing vakues for population accordingly.
Otherwise, please detail what is the matter with your panel dataset via an excerpt/example (-search dataex- would be the way to go).

Kind regards,
Carlo
(Stata 19.0)
Comment
Said Jafar

Join Date: Feb 2015

Posts: 109
#4

28 Dec 2016, 02:52

If you are talking about the population of countries, you can extract data for it from the World Development Indicator database (WDI), and CIA's world fact book. WDI is more reliable, and provide population data for more than ten categories, so it should be more useful.
1 like
Comment
Mer Kravitz

Join Date: Dec 2016

Posts: 5
#5

28 Dec 2016, 06:10

Hi all! Thanks for your replies. My data is more disaggregated than country-level (it is a global grid of 55x55 km), so using a different source is not really an option. I will check out the Fukuda piece!
Comment

Adrienne Wold

Join Date: Dec 2016
Posts: 141

04 Jan 2018, 18:07

Hi all,

I an running into a similar problem (in fact, I believe I am using the same data referenced by Mer Kravitz), which is population data for grids defined as 55x55km areas. I have data for 1990, 1995, 2000, 2005, and 2010, and am particularly interested in interpolating values for 2013 and 2014. Here is an example of the data where pop_gpw_sum is population and gid identifies the 55x55km cell.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long gid int year double pop_gpw_sum
49182 1990 5.731053
49182 1991        .
49182 1992        .
49182 1993        .
49182 1994        .
49182 1995 6.682834
49182 1996        .
49182 1997        .
49182 1998        .
49182 1999        .
49182 2000 7.660807
49182 2001        .
49182 2002        .
49182 2003        .
49182 2004        .
49182 2005 8.662068
49182 2006        .
49182 2007        .
49182 2008        .
49182 2009        .
49182 2010 9.683701
49183 1990 8.163606
49183 1991        .
49183 1992        .
49183 1993        .
49183 1994        .
49183 1995 9.519372
49183 1996        .
49183 1997        .
49183 1998        .
end

Realizing the considerations involved with linear interpolation, would I proceed by creating values for 2013/2014 and then use

ipolate pop_gwp_sum year, gen(y1)?

Thank you!

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35676

05 Jan 2018, 05:14

There are many more possibilities than linear interpolation, which doesn't strike me as the best default for population change! (Remember Malthus.)

Searching the forum for mentions of mipolate (SSC) will show some. Here, however, I think I would recommend interpolating linearly in logarithms.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long gid int year double pop_gpw_sum
49182 1990 5.731053
49182 1991        .
49182 1992        .
49182 1993        .
49182 1994        .
49182 1995 6.682834
49182 1996        .
49182 1997        .
49182 1998        .
49182 1999        .
49182 2000 7.660807
49182 2001        .
49182 2002        .
49182 2003        .
49182 2004        .
49182 2005 8.662068
49182 2006        .
49182 2007        .
49182 2008        .
49182 2009        .
49182 2010 9.683701
49183 1990 8.163606
49183 1991        .
49183 1992        .
49183 1993        .
49183 1994        .
49183 1995 9.519372
49183 1996        .
49183 1997        .
49183 1998        .
end

gen log_pop = log(pop)
ipolate log_pop year, by(gid) epolate gen(pop_int)
replace pop_int = exp(pop_int)

scatter pop_g pop_int year, ms(Oh +) by(gid)

Comment

Marry Lee

Join Date: Nov 2020

Posts: 189
#8

08 Mar 2021, 16:07

Dear Nick Cox, can you please tell me why have you prefered to interpolate in logarithms instead of interpolating immediately in the given values. I want to understand this method so that I can decide whether it is possible to apply it for my missing values of population, number of schools, number of employed persons and GDP at the county level, or not?
Thank you.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35676
#9

08 Mar 2021, 16:17

Population growth is expected to be exponential as a first approximation, not linear. This is in essence what Malthus pointed out, although he wasn't the first to say so. Therefore I'd interpolate linearly in the logarithms. This doesn't have to be an article of faith, and one can and should check patterns in the data you have, but even if -- as is common -- the percent increase (or even decrease) is fluctuating over time, you'd still expect exponential change locally to be a better approximation than linear.

More at ttps://www.springer.com/gp/book/9780857291141 -- which I own but haven't read yet.
1 like
Comment
Marry Lee

Join Date: Nov 2020

Posts: 189
#10

08 Mar 2021, 17:52

Thank you so much Nick Cox ! that was really helpful.
Comment

Announcement