Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Population data question

    Hi all, I have a panel dataset spanning 10 years (2005-2014), with population data for 2005 and 2010. What is the appropriate method of populating those missing years? Many thanks in advance.

  • #2
    Could you get yearly data from elsewhere? That'd be the most accurate, of course. If not, you'll have to estimate. You could just linearly interpolate, but there are a lot of things to consider. Take a look at the paper Interpolation and forecasting of population census data, Fukuda, Kosei (2010) if you'd like a more in-depth consideration.

    Comment


    • #3
      Mer:
      welcome to the list.
      If you miss the population size (and Andrew's helpful solution is not feasible), you may also want to consider retrieving the official annual population growth rate from 2010 onwards for the countries you're interested in and estimate the missing vakues for population accordingly.
      Otherwise, please detail what is the matter with your panel dataset via an excerpt/example (-search dataex- would be the way to go).
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        If you are talking about the population of countries, you can extract data for it from the World Development Indicator database (WDI), and CIA's world fact book. WDI is more reliable, and provide population data for more than ten categories, so it should be more useful.

        Comment


        • #5
          Hi all! Thanks for your replies. My data is more disaggregated than country-level (it is a global grid of 55x55 km), so using a different source is not really an option. I will check out the Fukuda piece!

          Comment


          • #6
            Hi all,

            I an running into a similar problem (in fact, I believe I am using the same data referenced by Mer Kravitz), which is population data for grids defined as 55x55km areas. I have data for 1990, 1995, 2000, 2005, and 2010, and am particularly interested in interpolating values for 2013 and 2014. Here is an example of the data where pop_gpw_sum is population and gid identifies the 55x55km cell.
            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input long gid int year double pop_gpw_sum
            49182 1990 5.731053
            49182 1991        .
            49182 1992        .
            49182 1993        .
            49182 1994        .
            49182 1995 6.682834
            49182 1996        .
            49182 1997        .
            49182 1998        .
            49182 1999        .
            49182 2000 7.660807
            49182 2001        .
            49182 2002        .
            49182 2003        .
            49182 2004        .
            49182 2005 8.662068
            49182 2006        .
            49182 2007        .
            49182 2008        .
            49182 2009        .
            49182 2010 9.683701
            49183 1990 8.163606
            49183 1991        .
            49183 1992        .
            49183 1993        .
            49183 1994        .
            49183 1995 9.519372
            49183 1996        .
            49183 1997        .
            49183 1998        .
            end

            Realizing the considerations involved with linear interpolation, would I proceed by creating values for 2013/2014 and then use

            ipolate pop_gwp_sum year, gen(y1)?

            Thank you!


            Comment


            • #7
              There are many more possibilities than linear interpolation, which doesn't strike me as the best default for population change! (Remember Malthus.)

              Searching the forum for mentions of mipolate (SSC) will show some. Here, however, I think I would recommend interpolating linearly in logarithms.

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input long gid int year double pop_gpw_sum
              49182 1990 5.731053
              49182 1991        .
              49182 1992        .
              49182 1993        .
              49182 1994        .
              49182 1995 6.682834
              49182 1996        .
              49182 1997        .
              49182 1998        .
              49182 1999        .
              49182 2000 7.660807
              49182 2001        .
              49182 2002        .
              49182 2003        .
              49182 2004        .
              49182 2005 8.662068
              49182 2006        .
              49182 2007        .
              49182 2008        .
              49182 2009        .
              49182 2010 9.683701
              49183 1990 8.163606
              49183 1991        .
              49183 1992        .
              49183 1993        .
              49183 1994        .
              49183 1995 9.519372
              49183 1996        .
              49183 1997        .
              49183 1998        .
              end
              
              gen log_pop = log(pop)
              ipolate log_pop year, by(gid) epolate gen(pop_int)
              replace pop_int = exp(pop_int)
              
              scatter pop_g pop_int year, ms(Oh +) by(gid)

              Comment


              • #8
                Dear Nick Cox, can you please tell me why have you prefered to interpolate in logarithms instead of interpolating immediately in the given values. I want to understand this method so that I can decide whether it is possible to apply it for my missing values of population, number of schools, number of employed persons and GDP at the county level, or not?
                Thank you.

                Comment


                • #9
                  Population growth is expected to be exponential as a first approximation, not linear. This is in essence what Malthus pointed out, although he wasn't the first to say so. Therefore I'd interpolate linearly in the logarithms. This doesn't have to be an article of faith, and one can and should check patterns in the data you have, but even if -- as is common -- the percent increase (or even decrease) is fluctuating over time, you'd still expect exponential change locally to be a better approximation than linear.

                  More at ttps://www.springer.com/gp/book/9780857291141 -- which I own but haven't read yet.

                  Comment


                  • #10
                    Thank you so much Nick Cox ! that was really helpful.

                    Comment

                    Working...
                    X