Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Errors when interpolating panel data using mipolate

    Hello, Statalist,

    Long-time listener, first-time caller, here.

    I have panel data at the U.S. county x year level. I have observations of several variables (time-varying county characteristics, namely, the percent of the population in a particular age bucket, ex., percent of population aged 5-14) by decade (in 1940, 1950, 1960, 1970, and 1980) because they are from the Decennial Census. I would like to interpolate values for the years between decades, within county. I have successfully generated linear interpolations using ipolate. However, I would like to try other functional forms, so I would like to interpolate using mipolate and its pchip, spline, and cubic options. For almost each of these options, I run into a different problem.

    Pchip gives me this error:
    Code:
    . mipolate percent5_14 year, by(fcounty1) gen(percent5_14_i_pchip) pchip
               pchipslopes():  3201  vector required
                     pchip():     -  function returned error
                pchipolate():     -  function returned error
                     <istmt>:     -  function returned error

    When I run cubic, it interpolates for some decades, but not all decades. See example scatterplot for Connecticut counties. (A separate, less important problem: the way that the green points are connected with a line seems to be screwy.)

    Code:
    . mipolate percent5_14 year, by(fcounty1) gen(percent5_14_i_cubic) cubic
    (194535 missing values generated)
    
    
    . twoway connected percent5_14_i_cubic year if statefip==9, ms(+) sort || scatter percent5_14 year if
    > statefip==9, ///
    > legend(order(1 "guessed" 2 "known"))  xtitle("") yla(, ang(h)) ytitle("Percent of Pop., Age 5-14")  
    > name( cubicp, replace)

    Click image for larger version

Name:	percent_5_14_i_cubic CT.png
Views:	1
Size:	55.6 KB
ID:	1511748


    Spline seems to run successfully:

    Code:
    mipolate percent5_14 year, by(fcounty1) gen(percent5_14_i_spline) spline
    
    
    twoway connected percent5_14_i_spline year if statefip==9, ms(+) sort || scatter percent5_14 year if statefip==9, ///
    legend(order(1 "guessed" 2 "known"))  xtitle("") yla(, ang(h)) ytitle("Percent of Pop., Age 5-14") name( spline, replace)
    graph export "/home/nwb/hcproductivity/13_2/percent_5_14_i_spline CT.png", as(png) replace
    Click image for larger version

Name:	percent_5_14_i_spline CT.png
Views:	1
Size:	65.4 KB
ID:	1511749


    I am running Stata 14.2 MP on a Linux server.

    Any advice and help to get me out of this jam, I would greatly appreciate!

    Thanks,
    Nate


    Code:
    foreach i in percent5_14 percent1524 percent2534 percent3544 percent4554 percent5564 percent6574 percent75 {
        bysort fcounty1: ipolate `i' year, gen(i`i')
    }
    
    
    xtset fcounty1 year
    format year %ty
    
    sort fcounty1 year
    mipolate percent5_14 year, by(fcounty1) gen(percent5_14_i_spline) spline
    mipolate percent5_14 year, by(fcounty1) gen(percent5_14_i_pchip) pchip
    mipolate percent5_14 year, by(fcounty1) gen(percent5_14_i_cubic) cubic
    
    set scheme s1color
    
    twoway connected percent5_14_i year if statefip==9, ms(+) sort || scatter percent5_14 year if statefip==9, ///
    legend(order(1 "guessed" 2 "known"))  xtitle("") yla(, ang(h)) ytitle("Percent of Pop., Age 5-14") name( spline, replace)
    graph export "/home/nwb/hcproductivity/13_2/percent_5_14_i_spline CT.png", as(png) replace
    
    twoway connected percent5_14_i_pchip year if statefip==9, ms(+) sort || scatter percent5_14 year if statefip==9, ///
    legend(order(1 "guessed" 2 "known"))  xtitle("") yla(, ang(h)) ytitle("Percent of Pop., Age 5-14") name( pchip, replace)
    graph export "/home/nwb/hcproductivity/13_2/percent_5_14_i_pchip CT.png", as(png) replace
    
    twoway connected percent5_14_i_cubic year if statefip==9, ms(+) sort || scatter percent5_14 year if statefip==9, ///
    legend(order(1 "guessed" 2 "known"))  xtitle("") yla(, ang(h)) ytitle("Percent of Pop., Age 5-14")  name( cubicp, replace)
    graph export "/home/nwb/hcproductivity/13_2/percent_5_14_i_cubic CT.png", as(png) replace
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double fcounty1 float(year percent5_14 statefip)
       0    .         . .
    1000    .         . .
    1001 1927         . 1
    1001 1928         . 1
    1001 1929         . 1
    1001 1930         . 1
    1001 1931         . 1
    1001 1932         . 1
    1001 1933         . 1
    1001 1934         . 1
    1001 1935         . 1
    1001 1936         . 1
    1001 1937         . 1
    1001 1938         . 1
    1001 1939         . 1
    1001 1940 .23730753 1
    1001 1941         . 1
    1001 1942         . 1
    1001 1943         . 1
    1001 1944         . 1
    1001 1945         . 1
    1001 1946         . 1
    1001 1947         . 1
    1001 1948         . 1
    1001 1949         . 1
    1001 1950  .2252282 1
    1001 1951         . 1
    1001 1952         . 1
    1001 1953         . 1
    1001 1954         . 1
    1001 1955         . 1
    1001 1956         . 1
    1001 1957         . 1
    1001 1958         . 1
    1001 1959         . 1
    1001 1960  .2381664 1
    1001 1961         . 1
    1001 1962         . 1
    1001 1963         . 1
    1001 1964         . 1
    1001 1965         . 1
    1001 1966         . 1
    1001 1967         . 1
    1001 1968         . 1
    1001 1969         . 1
    1001 1970 .25179887 1
    1001 1971         . 1
    1001 1972         . 1
    1001 1973         . 1
    1001 1974         . 1
    1001 1975         . 1
    1001 1976         . 1
    1001 1977         . 1
    1001 1978         . 1
    1001 1979         . 1
    1001 1980 .19021048 1
    1001 1981         . 1
    1001 1982         . 1
    1001 1983         . 1
    1001 1984         . 1
    1001 1985         . 1
    1001 1986         . 1
    1001 1987         . 1
    1001 1988         . 1
    1001 1989         . 1
    1001 1990         . 1
    1001 1991         . 1
    1001 1992         . 1
    1001 1993         . 1
    1001 1994         . 1
    1001 1995         . 1
    1001 1996         . 1
    1001 1997         . 1
    1001 1998         . 1
    1001 1999         . 1
    1001 2000         . 1
    1001 2001         . 1
    1001 2002         . 1
    1001 2003         . 1
    1001 2004         . 1
    1001 2005         . 1
    1001 2006         . 1
    1001 2007         . 1
    1003 1927         . 1
    1003 1928         . 1
    1003 1929         . 1
    1003 1930         . 1
    1003 1931         . 1
    1003 1932         . 1
    1003 1933         . 1
    1003 1934         . 1
    1003 1935         . 1
    1003 1936         . 1
    1003 1937         . 1
    1003 1938         . 1
    1003 1939         . 1
    1003 1940  .2143299 1
    1003 1941         . 1
    1003 1942         . 1
    1003 1943         . 1
    end

  • #2
    mipolate is from SSC, as you are asked to explain. FAQ Advice #12 explains that you should explain the provenance of community-contributed commands you refer to.

    Thanks for your data example.

    The easiest problem to explain here is with twoway connect:

    the way that the green points are connected with a line seems to be screwy.
    Not so: you've selected one state and so in this example several counties and plotted data for several counties against year and insisted on a sort (by year), when your data and interpolation were separately by county. So, different values for different counties and the same year are connected up and down in most cases.

    In your data example only the data for county 1001 are any use. mipolate can't perform miracles, but there is no bug obvious here.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double fcounty1 float(year percent5_14 statefip)
    1001 1927         . 1
    1001 1928         . 1
    1001 1929         . 1
    1001 1930         . 1
    1001 1931         . 1
    1001 1932         . 1
    1001 1933         . 1
    1001 1934         . 1
    1001 1935         . 1
    1001 1936         . 1
    1001 1937         . 1
    1001 1938         . 1
    1001 1939         . 1
    1001 1940 .23730753 1
    1001 1941         . 1
    1001 1942         . 1
    1001 1943         . 1
    1001 1944         . 1
    1001 1945         . 1
    1001 1946         . 1
    1001 1947         . 1
    1001 1948         . 1
    1001 1949         . 1
    1001 1950  .2252282 1
    1001 1951         . 1
    1001 1952         . 1
    1001 1953         . 1
    1001 1954         . 1
    1001 1955         . 1
    1001 1956         . 1
    1001 1957         . 1
    1001 1958         . 1
    1001 1959         . 1
    1001 1960  .2381664 1
    1001 1961         . 1
    1001 1962         . 1
    1001 1963         . 1
    1001 1964         . 1
    1001 1965         . 1
    1001 1966         . 1
    1001 1967         . 1
    1001 1968         . 1
    1001 1969         . 1
    1001 1970 .25179887 1
    1001 1971         . 1
    1001 1972         . 1
    1001 1973         . 1
    1001 1974         . 1
    1001 1975         . 1
    1001 1976         . 1
    1001 1977         . 1
    1001 1978         . 1
    1001 1979         . 1
    1001 1980 .19021048 1
    1001 1981         . 1
    1001 1982         . 1
    1001 1983         . 1
    1001 1984         . 1
    1001 1985         . 1
    1001 1986         . 1
    1001 1987         . 1
    1001 1988         . 1
    1001 1989         . 1
    1001 1990         . 1
    1001 1991         . 1
    1001 1992         . 1
    1001 1993         . 1
    1001 1994         . 1
    1001 1995         . 1
    1001 1996         . 1
    1001 1997         . 1
    1001 1998         . 1
    1001 1999         . 1
    1001 2000         . 1
    1001 2001         . 1
    1001 2002         . 1
    1001 2003         . 1
    1001 2004         . 1
    1001 2005         . 1
    1001 2006         . 1
    1001 2007         . 1
    end
    
    set scheme s1color 
    
    foreach m in pchip cubic spline { 
       mipolate percent5_14 year, by(fcounty1) gen(percent5_14_i_`m') `m' 
       twoway line *`m'  year || scatter percent5_14 year, name(`m', replace) subtitle(`m') xtitle("") legend(off) 
    }
    
    graph combine pchip cubic spline
    Click image for larger version

Name:	mipolate_1001.png
Views:	1
Size:	23.5 KB
ID:	1511840


    mipolate can disappoint when the number of data points is very small, but as yet I can't put my finger on bugs.

    I'd need to see a data example that allows reproduction of the other problems you're asserting.

    Comment


    • #3
      Thanks for taking a look, Nick, and producing those graphs. I am able to reproduce what you shared, running the interpolations on only one county, FIPS code 1001. My data set has over 3,000 counties, so perhaps one or some of the panels is/are causing the problem. To check, I ran the interpolation on each county individually:

      Code:
      xtset fcounty1 year
      format year %ty
      tsfill
      
      levelsof fcounty1, local(levels)
      foreach l of local levels {
          di "FIPS ="
          di `l'
          foreach m in pchip cubic spline {
              capture noisily mipolate percent5_14 year if fcounty==`l', by(fcounty1) gen(percent5_14_i_`m') `m'
              cap drop percent5_14_i_`m'
          }
      }
      I let it run from 1001 to 26000. I ran some descriptives on various counties to look for a pattern. Many of the counties that get the error

      Code:
                 pchipslopes():  3201  vector required
                       pchip():     -  function returned error
                  pchipolate():     -  function returned error
                       <istmt>:     -  function returned error
      r(3201);
      had only three observations of the variable of interest, percent5_14. For example:

      Code:
      . tab year percent5_14 if fcounty1==2070, m
      
                 |                 percent5_14
            year |  .2042894   .2303678   .3061693          . |     Total
      -----------+--------------------------------------------+----------
            1940 |         0          0          0          1 |         1
            1941 |         0          0          0          1 |         1
            1942 |         0          0          0          1 |         1
            1943 |         0          0          0          1 |         1
            1944 |         0          0          0          1 |         1
            1945 |         0          0          0          1 |         1
            1946 |         0          0          0          1 |         1
            1947 |         0          0          0          1 |         1
            1948 |         0          0          0          1 |         1
            1949 |         0          0          0          1 |         1
            1950 |         0          0          0          1 |         1
            1951 |         0          0          0          1 |         1
            1952 |         0          0          0          1 |         1
            1953 |         0          0          0          1 |         1
            1954 |         0          0          0          1 |         1
            1955 |         0          0          0          1 |         1
            1956 |         0          0          0          1 |         1
            1957 |         0          0          0          1 |         1
            1958 |         0          0          0          1 |         1
            1959 |         0          0          0          1 |         1
            1960 |         0          1          0          0 |         1
            1961 |         0          0          0          1 |         1
            1962 |         0          0          0          1 |         1
            1963 |         0          0          0          1 |         1
            1964 |         0          0          0          1 |         1
            1965 |         0          0          0          1 |         1
            1966 |         0          0          0          1 |         1
            1967 |         0          0          0          1 |         1
            1968 |         0          0          0          1 |         1
            1969 |         0          0          0          1 |         1
            1970 |         0          0          1          0 |         1
            1971 |         0          0          0          1 |         1
            1972 |         0          0          0          1 |         1
            1973 |         0          0          0          1 |         1
            1974 |         0          0          0          1 |         1
            1975 |         0          0          0          1 |         1
            1976 |         0          0          0          1 |         1
            1977 |         0          0          0          1 |         1
            1978 |         0          0          0          1 |         1
            1979 |         0          0          0          1 |         1
            1980 |         1          0          0          0 |         1
               . |         0          0          0          1 |         1
      -----------+--------------------------------------------+----------
           Total |         1          1          1         39 |        42
      Counties that ran but had notes saying "note: at least 3 values needed in any interpolation", lo and behold, only had one observation of percent5_14.

      Counties that ran without any message had observations for all five decade-years: 1940, 1950, 1960, 1970, and 1980. For example:

      Code:
      . tab year percent5_14 if fcounty1==1003, m
      
                 |                            percent5_14
            year |  .1705408   .2143299   .2201172   .2204064   .2318489          . |     Total
      -----------+------------------------------------------------------------------+----------
            1940 |         0          1          0          0          0          0 |         1
            1941 |         0          0          0          0          0          1 |         1
            1942 |         0          0          0          0          0          1 |         1
            1943 |         0          0          0          0          0          1 |         1
            1944 |         0          0          0          0          0          1 |         1
            1945 |         0          0          0          0          0          1 |         1
            1946 |         0          0          0          0          0          1 |         1
            1947 |         0          0          0          0          0          1 |         1
            1948 |         0          0          0          0          0          1 |         1
            1949 |         0          0          0          0          0          1 |         1
            1950 |         0          0          0          1          0          0 |         1
            1951 |         0          0          0          0          0          1 |         1
            1952 |         0          0          0          0          0          1 |         1
            1953 |         0          0          0          0          0          1 |         1
            1954 |         0          0          0          0          0          1 |         1
            1955 |         0          0          0          0          0          1 |         1
            1956 |         0          0          0          0          0          1 |         1
            1957 |         0          0          0          0          0          1 |         1
            1958 |         0          0          0          0          0          1 |         1
            1959 |         0          0          0          0          0          1 |         1
            1960 |         0          0          0          0          1          0 |         1
            1961 |         0          0          0          0          0          1 |         1
            1962 |         0          0          0          0          0          1 |         1
            1963 |         0          0          0          0          0          1 |         1
            1964 |         0          0          0          0          0          1 |         1
            1965 |         0          0          0          0          0          1 |         1
            1966 |         0          0          0          0          0          1 |         1
            1967 |         0          0          0          0          0          1 |         1
            1968 |         0          0          0          0          0          1 |         1
            1969 |         0          0          0          0          0          1 |         1
            1970 |         0          0          1          0          0          0 |         1
            1971 |         0          0          0          0          0          1 |         1
            1972 |         0          0          0          0          0          1 |         1
            1973 |         0          0          0          0          0          1 |         1
            1974 |         0          0          0          0          0          1 |         1
            1975 |         0          0          0          0          0          1 |         1
            1976 |         0          0          0          0          0          1 |         1
            1977 |         0          0          0          0          0          1 |         1
            1978 |         0          0          0          0          0          1 |         1
            1979 |         0          0          0          0          0          1 |         1
            1980 |         1          0          0          0          0          0 |         1
      So I did an experiment. I artificially changed the values of percent5_14 to "." for a county that ran fine, like the one right above, and checked if it would run.

      Code:
      . replace percent5_14=. if fcounty1==1003&year==1940
      (1 real change made, 1 to missing)
      
      . do "/tmp/SD04577.000000"
      
      .         foreach m in pchip cubic spline {
        2.                 capture noisily mipolate percent5_14 year if fcounty==1003, by(fcounty1) gen(percent5_14_i_`m') `m'
        3.                 cap drop percent5_14_i_`m'
        4.         }
      (268055 missing values generated)
      (268123 missing values generated)
      (268105 missing values generated)
      And it did run. I then replaced one more observations with "missing," so that the county only had three observed values of percent5_14, just like the counties that had problems. But the county still ran:

      Code:
      . replace percent5_14=. if fcounty1==1003&year==1950
      (1 real change made, 1 to missing)
      
      . do "/tmp/SD04577.000000"
      
      .         foreach m in pchip cubic spline {
        2.                 capture noisily mipolate percent5_14 year if fcounty==1003, by(fcounty1) gen(percent5_14_i_`m') `m'
        3.                 cap drop percent5_14_i_`m'
        4.         }
      (268055 missing values generated)
      (268133 missing values generated)
      (268115 missing values generated)
      So I think the number of observations is not what's causing the error.

      I'm not sure what else to try. I have tried to upload my data set of all counties and years, but Statalist give me the error "Invalid file data main_13_2 for Statalist 12Aug2019.dta." I tried with a Zip and with a dta and then a zip of a file just containing observations from one state, and I get the same error.

      As for the problem with
      Code:
      twoway connected
      , I understand your explanation, Nick, thank you. I have been able to plot trends for separate counties by using
      Code:
      by(fcounty1)
      , which produces a plot for each county.

      Code:
      preserve
      keep if statefip==9
      set scheme s1color
      sort fcounty1 year
      twoway connected percent5_14_i_spline year if statefip==9, ms(+) sort || scatter percent5_14 year if statefip==9, ///
      legend(order(1 "guessed" 2 "known"))  xtitle("") yla(, ang(h)) ytitle("Percent of Pop., Age 5-14") name( spline, replace) by(fcounty1)
      restore
      But is it possible to plot county-specific trends all on one plot? This article written for Stata suggests that it is, but perhaps this is out of date? Or I am misunderstanding it:

      If you are lucky, you will need to type no more than
      Code:
      . sort baby_id age . graph weight age, c(L)
      We are putting the data in order of babies and, within each baby, the age. Then we are connecting the points from left to right. This will work if the youngest age of each baby is younger than the oldest age of the baby that precedes it....
      "How do I connect points only within groups?" by Nicholas Cox
      https://www.stata.com/support/faqs/g...within-groups/

      Comment


      • #4
        Thanks for your detailed report. There's a bundle of issues here and it's hard to tease them apart.

        Working backwards: I am unsure of what your graphical problem is. Evidently you have a few thousand counties and even looking at individual states still leaves you with a risk of spaghetti. https://www.statalist.org/forums/for...using-linkplot is the most recent summary that I can remember writing. Conversely, the FAQ you cite is flagged "Note: This FAQ is relevant for users of releases prior to Stata 8." which is understatement for saying it's irrelevant to anyone else (unless they are perversely minded to use graph7) .

        If your data example is typical, your use of interpolation is, what shall we say, optimistic as 5 data points must remain ambiguous as they are consistent with many, many different interpolated patterns. Interpoiation necessarily works best with small gaps, a lot of information in the rest of the data, and underlying smooth change.

        That said, you're replicating my occasional experience that pchip falls over sometimes with small datasets. I have poked at this a few times without ever finding a precise problem.

        Rather unusually the code is my translation from MATLAB code, and I don't really know MATLAB, but I still translated it. There are several possibilities and I don't give here my prior probabilities beyond hinting that they aren't equal.

        1. There is a bug in the MATLAB code.

        2. There is a limitation in the algorithm used in the MATLAB code.

        3. I introduced errors in translating to Mata.

        4. I introduced errors in the surrounding Stata code.

        With your data -- and as yet I still can't see most of it -- I might stick to linear interpolation if obliged to use interpolation. At least that's easy to explain.

        Comment


        • #5
          The article you point to on trend line spaghetti is very helpful. Thanks for the reference.

          Your point is of course well taken, that any of these methods would use very little information to draw a lot of intermittent points. I just thought I would explore this. I do think I will just use the linear interpolation, if I use this variable.

          Thanks very much for taking all the time to wade through my example, trying it out, and telling me what you think. I appreciate it.

          Also, thanks very much for all of the questions by other people you have answered that I have stumbled upon through Google searches through the years. I have learned a lot.

          Comment

          Working...
          X