Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Filling in a missing year using interpolation?

    I have data by different counties for the years 1981 and 2011. Each variable is in it's own column.

    For example: One column is Population (1981), with values for multiple counties. Another column is Population (2011) with values for those same counties.

    ..............| Pop80 | Pop11 | Pop01
    --------------------------------------------------
    County A | 1234.. | 2234.. | ..?
    --------------------------------------------------
    County B | 9876.. | 10876 | ..?

    Is there some way to fill in an approximation of what the population might have been in 2001?

    I've seen some PDFs about linear interpolation and the tsfill command, but those seem to have a separate column for year, which is not how my data is constructed. Would love to hear any suggestions, and please let me know if my situation needs further clarifying

  • #2
    Well, linear interpolation of population is probably not a very good way to go. But putting that aside, the trick is to first -reshape- your data to the long layout that has a separate variable for year. Then you can apply -ipolate-. The names of your variables, where 80 represents 1980 and 11 represents 2011 adds a slight complication to the process. But here's how it would work:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str9 country float(pop80 pop11)
    "Country A" 1234  2234
    "Country B" 9876 10876
    end
    
    gen pop01 = .
    reshape long pop, i(country) j(yr) string
    destring yr, replace
    recode yr (80 = 1980) (11 = 2011) (1 = 2001), gen(year)
    drop yr
    
    by country, sort:ipolate pop year, gen(pop_interp)
    And then, if you really want your data back in wide layout, you can do that. But remember that almost anything you are going to do with your data in Stata will be easier in long layout anyway, so going back to wide layout is likely to be a waste of time.

    Comment


    • #3
      I agree with Clyde. I'd add, however, that linear interpolation is in principle not the only method. But if you really have just two known points, there aren't many alternatives, except that geometric interpolation seems a more natural first approximation.

      Comment


      • #4
        If you really find the wide layout for the data most usable, then you can calculate the linear interpolation for any year between 1980 and 2011 as a simple function of pop11 and pop80.

        Comment


        • #5
          Let's spell it out. For geometric interpolation, we just interpolate logarithms linearly and then exponentiate:

          Code:
          clear 
          input str4 county pop80  pop11 
          A  1234  2234  
          B  9876  10876 
          test 42 42 
          end 
          gen pop01_lin = round(pop80 + (21/31) * (pop11 - pop80)) 
          gen pop01_geo = round(exp(ln(pop80) + (21/31) * (ln(pop11) - ln(pop80)))) 
          
          list 
          
               +----------------------------------------------+
               | county   pop80   pop11   pop01_~n   pop01_~o |
               |----------------------------------------------|
            1. |      A    1234    2234       1911       1845 |
            2. |      B    9876   10876      10553      10543 |
            3. |   test      42      42         42         42 |
               +----------------------------------------------+

          Comment


          • #6
            Thanks all for your help!

            Originally posted by Nick Cox View Post
            Let's spell it out. For geometric interpolation, we just interpolate logarithms linearly and then exponentiate:

            Code:
            clear
            input str4 county pop80 pop11
            A 1234 2234
            B 9876 10876
            test 42 42
            end
            gen pop01_lin = round(pop80 + (21/31) * (pop11 - pop80))
            gen pop01_geo = round(exp(ln(pop80) + (21/31) * (ln(pop11) - ln(pop80))))
            
            list
            
            +----------------------------------------------+
            | county pop80 pop11 pop01_~n pop01_~o |
            |----------------------------------------------|
            1. | A 1234 2234 1911 1845 |
            2. | B 9876 10876 10553 10543 |
            3. | test 42 42 42 42 |
            +----------------------------------------------+
            When I use those commands, it doesn't give me any decimals - just integers. Is there any way around this?

            Also, one variable I'm trying to calculate is students per capita. Would it be best to use this code on the population and the student pop measures, and then derive students per capita from the resulting 2001 variables? Or would it be better to just calculate students per capita for 1980 and 2011, then use those to derive 2001 population using the syntax you provided?

            Thanks again for your help!

            Comment


            • #7
              That's because I used round(); just omit it if you want the decimal places!

              I would just interpolate on students per capita. It's simpler that way.

              Comment

              Working...
              X