Creating an index with missing data in time series

Ryan Marapin

Join Date: Apr 2017
Posts: 28

Creating an index with missing data in time series

09 Apr 2019, 09:48

Dear Statalist forum,

I would like to ask the following. I have a time series with missing data for certain years, and I would like to calculate the growth rate for the variable median for each year based on a specific base year (the first year for which data for median is available). In other words, I would like create an index series for the median variable (base year=100). I have calculated the growth rate of median using log first differences, where the code looks as follows:

Code:

gen logmedian=log(median)
gen growthmedian= logmedian[_n] - logmedian[_n-1] ///
if country[_n] == country[_n-1]

Which gives me the growth rate of median illustrated in the data example below.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str38 country double year float PPPrealgdi double(gdpcap hhexp median) float(logmedian growthmedian)
"Brazil" 1981             .                 .                .            1531.68  7.33412           .
"Brazil" 1982             .                 .                . 1541.3999999999999 7.340446  .006326199
"Brazil" 1983             .                 .                .            1266.72 7.144186  -.19626045
"Brazil" 1984             .                 .                . 1275.3600000000001 7.150984   .00679779
"Brazil" 1985             .                 .                . 1454.6399999999999 7.282514   .13152981
"Brazil" 1986             .                 .                .            2435.64 7.797965   .51545143
"Brazil" 1987             .                 .                . 1858.3200000000002 7.527428   -.2705369
"Brazil" 1988             .                 .                .            1741.08 7.462261  -.06516743
"Brazil" 1989             .                 .                .            1877.16 7.537515   .07525444
"Brazil" 1990             . 10344.86438961035 720147699317.049 1616.3999999999999 7.387957  -.14955854
"Brazil" 1991             . 10319.92933521456  779750705063.36                  .        .           .
"Brazil" 1992             . 10099.77561998418 774717151172.519            1625.52 7.393583           .
"Brazil" 1993             . 10398.16035761199 809876320756.008            1670.88 7.421106   .02752304
"Brazil" 1994             . 10776.43311211532 870158244588.953                  .        .           .
"Brazil" 1995   8.49598e+11  11072.7252903522 945180706038.132 2219.3999999999996 7.704992           .
"Brazil" 1996  8.860042e+11 11137.58926300037 974140377103.669            2253.48 7.720231  .015238762
"Brazil" 1997  9.292225e+11 11334.82075340726 1003685800287.16            2228.88 7.709254  -.01097679
"Brazil" 1998  9.614585e+11 11197.70193788836 996459125418.385            2311.44 7.745626  .036371708
"Brazil" 1999  9.802891e+11 11081.65462053813 1000228307626.96            2222.04 7.706181  -.03944492
"Brazil" 2000  9.933006e+11 11370.97093925103 1040563838430.28                  .        .           .
"Brazil" 2001   9.76703e+11 11368.46644957097 1048589775648.83            2529.36 7.835721           .
"Brazil" 2002 1.0052062e+12 11559.56059660686 1062419071054.17               2562 7.848544   .01282215
"Brazil" 2003  1.016476e+12 11541.97503234461 1056618339648.85 2492.2799999999997 7.820953 -.027590275
"Brazil" 2004 1.1615509e+12  12058.2077119511 1098074697769.69            2569.32 7.851397   .03044319
"Brazil" 2005 1.1970837e+12 12298.58794846535 1146630208256.88 2717.7599999999998 7.907563   .05616665
"Brazil" 2006 1.2698414e+12 12643.56500898064 1207226170587.02            2960.64 7.993161   .08559752
"Brazil" 2007  1.374465e+12 13268.41390244003 1284202177161.98            3172.32 8.062219   .06905794
"Brazil" 2008 1.4525213e+12 13802.81799242376 1367216409264.71            3394.44 8.129894   .06767559
"Brazil" 2009 1.5419716e+12 13649.86584029567 1428144990218.13            3544.68 8.173203   .04330921
"Brazil" 2010  1.787638e+12 14537.56719309033 1517109456849.07                  .        .           .
"Brazil" 2011 1.9028176e+12 14973.09847367468 1590210762172.73            3903.12 8.269531           .
"Brazil" 2012 2.0173075e+12 15119.91213817678  1645859403527.4            4244.76  8.35344   .08390903
"Brazil" 2013 2.1333167e+12 15432.89363081413  1702987902524.9             4472.4 8.405681   .05224037
"Brazil" 2014 2.2029756e+12  15374.2615071807 1741310567302.85            4621.08 8.438384    .0327034
"Brazil" 2015 2.1500094e+12 14702.59126803164 1685301439554.04 4398.4800000000005 8.389014  -.04936981
"Brazil" 2016             . 14077.12140855444 1612150747594.78                  .        .           .
"Brazil" 2017             . 14103.45153134036 1627686451678.68                  .        .           .
end

Having calculated my growth rate, I am not sure now how to proceed to obtain the index series for median, one that also takes into account the missing data for the growthmedian variable. If anyone has any suggestions on how to do this, I would greatly appreciate it.

Thank you for your time.

Best,

Ryan

Last edited by Ryan Marapin; 09 Apr 2019, 09:52.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#2

09 Apr 2019, 12:46

An index series is just gotten by dividing the current value by the first value, and multiplying by 100--but you know that already. Where the value of median is missing, the value of the series must also be missing. So the only issue presented here is identifying the earliest non-missing value of median. In the example data you give, that's just the first value. I don't know how representative that example is. Perhaps for other countries the first (and perhaps some immediately after that) are missing.

So you can better see and understand how the code works, I have written it to generate an index for the variable gdpcap, as that one actually has some missing values at the beginning of the series. You can adapt it by just globally replacing gdpcap with median.

Code:

gen byte missing_value = !missing(gdpcap) by country missing_value (year), sort: gen first_nm_value = gdpcap[1] by country (first_nm_value), sort: replace first_nm_value = first_nm_value[1] sort country year gen gdpcap_index = 100*gdpcap/first_nm_value

That said, I want to take issue with your definition of the growthmedian variable. The formula you are relying on, log(x_t) - log(x_t_-1) is just an approximation to the actual growth, which is (x_t/x_t-1)-1. It is a fairly decent approximation if the growth rate is small, but if the growth rate strays appreciably from zero, you get increasingly inaccurate results. Not only that, it is much more computationally intensive. I do not understand why anybody uses or teaches this approximation. It might have made sense in the pre-calculator pre-computer era when dividing two numbers was more laborious than looking up and subtracting logarithms by hand. But in the modern era this is a pointlessly inaccurate and inefficient formula.
Comment
Ryan Marapin

Join Date: Apr 2017

Posts: 28
#3

09 Apr 2019, 13:42

Dear Mr. Schechter,

Thank you very much for your reply, the code worked perfectly. Regarding your point about estimating growth rates using logarithms, I do this for the following reasons. From what I learned, this enables symmetry: a 10% increase followed by 10% decrease brings you back to where you started, which is not the case when using the (x_t/x_t-1)-1 formula. Moreover, I believe it is also more feasible to use as it is a continuously compounded growth rate, which I think is better to capture growth rates of economic variables?

I am happy that you mention this ofcourse, because I am not too sure which of the two methods are better, in the end I am just using what is taught to me. You raise a good point however, I will consider using this technique instead if that is more accurate than taking logs. Thank you very much.

Best,

Ryan Marapin
Comment
Ryan Marapin

Join Date: Apr 2017

Posts: 28
#4

10 Apr 2019, 19:00

Dear Mr. Schechter,

I would like to ask a follow up question regarding this same time series. I would also like to estimate compound annual growth rates for these variables, where in the case where there are missing data for a variable, stata still calculates a value for CAGR median, using data for the years for which there is data available. For example, if there is data on median for 2010 and 2013, it calculates the CAGR for 2011,2012 ,and 2013.

I thank you for your time.

Best,

Ryan Marapin
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30192

12 Apr 2019, 09:40

You can derive this from the index, provided you also compute the value of the first year in which there is non-missing data. So the whole thing looks like this:

Code:

gen byte missing_value = !missing(gdpcap)
by country missing_value (year), sort: gen first_nm_value = gdpcap[1]
by country missing_value (year): gen first_nm_year = year[1]
by country (first_nm_value), sort: replace first_nm_value = first_nm_value[1]
by country (first_nm_value): replace first_nm_year = first_nm_year[1]
sort country year
gen gdpcap_index = 100*gdpcap/first_nm_value

gen gdpcap_comp_growth = (gdpcap_index/100)^(1/(year-first_nm_year)) - 1

The italicized lines of code are added to the original code. The resulting variable presents growth as a rate, i.e. as 0.04, not 4%. If you want it as a percentage, just multiply the result by 100.

Comment

Ryan Marapin

Join Date: Apr 2017

Posts: 28
#6

03 Jun 2019, 11:10

Dear Mr. Schechter,

Sorry for the late reply, I have taken a look at the code and it works perfectly thanks!

Best,

Ryan Marapin
Comment

Announcement

Creating an index with missing data in time series

Comment

Comment

Comment

Comment

Comment