Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Replacing missing value by linear intervals between tow points across time.

    Dear Statalist,

    I am current facing a bit of dilemma with missing variables. I have merge two dataset, with my original dataset having employment and the second having mean disposable income.

    My observation are separated by country, age categories and years across 29 nations. However, mean disposable income data is only available for some countries and some of the years. For example, I only have the mean disposable income for age groups (4 in total) in Australia for 2000, 2004, 2008, 2010, 2012 and 2014. Whilst for Denmark I have data for the years 2000 and then 2005-2014. Since these gaps tend to be only a few years (and thus the mean income difference isn't massive) that I could input mean wages by linear intervals (so if in 2000 the mean income was 1900 euros and in 2002 it was 2100, the mean income estimated for 2001 would be 2000). I hope this description is clear. I have attached my data set below:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double country_n float(age_groups year) double(m_employmentratio converted_inc_2)
    1 1  1       61.905171585 18383.639404296875
    1 2  1        75.74068471      17202.3203125
    1 3  1        78.66788015    18233.490234375
    1 4  1  54.09412395333334    16983.509765625
    1 1  2       61.392848445                  .
    1 2  2  75.91077349666666                  .
    1 3  2        78.38746349                  .
    1 4  2  54.60154921666666                  .
    1 1  3 61.208336079999995                  .
    1 2  3  76.02173251666666                  .
    1 3  3        79.07494859                  .
    1 4  3  55.98275992333333                  .
    1 1  4       61.894535415                  .
    1 2  4  76.63595531666667                  .
    1 3  4       79.375593615                  .
    1 4  4        57.40405967                  .
    1 1  5       62.136433595    21524.943359375
    1 2  5  76.73670893666667    21407.939453125
    1 3  5       79.519111255    21940.904296875
    1 4  5 58.573308759999996              19854
    1 1  6 63.170592385000006                  .
    1 2  6  77.87832071000001                  .
    1 3  6       80.937467405                  .
    1 4  6  60.31463328666666                  .
    1 1  7 63.493052445000004                  .
    1 2  7  78.43671636666667                  .
    1 3  7        81.06239939                  .
    1 4  7        61.89366863                  .
    1 1  8       63.896869845                  .
    1 2  8  79.28298021666667                  .
    1 3  8       81.826284695                  .
    1 4  8  63.00212025666667                  .
    1 1  9       64.198685575    30072.904296875
    1 2  9        79.70918984    29797.384765625
    1 3  9  82.06593212499999          30276.875
    1 4  9  63.93898066666666    29603.291015625
    1 1 10       60.697728655                  .
    1 2 10        78.14557024                  .
    1 3 10        81.02484167                  .
    1 4 10  65.10186506000001                  .
    1 1 11       59.934069255    38245.728515625
    1 2 11  78.33510748666667     37506.33984375
    1 3 11       81.161630255      37909.0390625
    1 4 11        66.59804761      36584.0078125
    1 1 12 59.755601799999994                  .
    1 2 12  79.18128143333333                  .
    1 3 12        81.01241693                  .
    1 4 12  66.88963640666667                  .
    1 1 13        58.92321488     46610.48828125
    1 2 13        78.83483037     47564.55859375
    end
    label values country_n country1
    label def country1 1 "Australia", modify
    label values age_groups age_groups_lbl
    label def age_groups_lbl 1 "15-24", modify
    label def age_groups_lbl 2 "26-39", modify
    label def age_groups_lbl 3 "40-49", modify
    label def age_groups_lbl 4 "50-65", modify
    label values year year_n
    label def year_n 1 "2000", modify
    label def year_n 2 "2001", modify
    label def year_n 3 "2002", modify
    label def year_n 4 "2003", modify
    label def year_n 5 "2004", modify
    label def year_n 6 "2005", modify
    label def year_n 7 "2006", modify
    label def year_n 8 "2007", modify
    label def year_n 9 "2008", modify
    label def year_n 10 "2009", modify
    label def year_n 11 "2010", modify
    label def year_n 12 "2011", modify
    label def year_n 13 "2012", modify
    Thank you,

    Hugo

  • #2
    This is generally called linear interpolation and the magic word is

    Code:
    help ipolate
    Sometimes you can do better, e.g. it may be a better idea to take logarithms, interpolate on that scale, and then exponentiate; or use cubic polynomials, cubic splines, etc.

    The implications for statistical analysis are tricky. Once you've interpolated Stata won't hold that against you, but there is a strong sense in which the number of degrees of freedom are fewer than you or Stata thinks. Goodness knows how would be evaluated except perhaps through simulation.

    (There is an odd division of labour whereby interpolation almost never appears in statistics courses or texts, but is regarded as entirely within the province of numerical analysis.)
    Last edited by Nick Cox; 22 Oct 2021, 09:48.

    Comment


    • #3
      Thank you Nick, that was the all the information I needed, managed to fill in the blanks from there. I had toyed with the idea of logarithms but for the this dataset, linear jumps work just fine.

      Should someone come across this and need the full syntax I used:

      Code:
      ipolate converted_inc_2 year, gen(con_inc) epolate by(country_n age_groups)

      Comment

      Working...
      X