Cross-Sectional regression to forecast earnings

Markus Dietz

Join Date: Aug 2017

Posts: 16
#1

Cross-Sectional regression to forecast earnings

21 Aug 2017, 02:55

Hello everyone,

I’m working on a project which aims to use a cross-sectional model based approach to forecast earnings for years t+1, t+2 and t+3. The model is based on Hou et al. (2012) and looks like this:
E_i,t+_τ⁼^α₀⁺^α₁^A_i,t⁺^α₂^D_i,t⁺^α₃^DD_i,t⁺^α₄^E_i,t⁺^α₅^NegE_i,t⁺^α₆^AC_i,t⁺^ε_i,t+_τ
E_i,t+τ = earnings of firm i in year t+τ (τ = 1 to3)

^A_i,t= total assets

^D_i,t= dividend payment

^DD_i,t= dummy variable that equals 1 for dividend payers and 0 otherwise

^NegE_i,t = Dummy variable that equals 0 for firms with negative earnings and 1 otherwise

^AC_i,t= Accruals

All explanatory variables are measures as of year t.
The actual forecasting is a two-step process:
For each year, starting in 2007, the model above should estimate the pooled cross-sectional regressions using the previous 5 years of data.

In the next step the coefficients are being used to forecast earnings for years t+1 ,t+2 and t+3.

For example, if 2010 is the year t, data of year 2005 to 2009 are used to calculate the coefficients
for estimating future earnings for year 2011 (year t + 1). Similarly, data from 2004 to 2008
are used to calculate the coefficient estimates that will be used to calculate the forecast for year
2012 (year t + 2). Finally, the estimates of the cross-sectional regression for the years 2003 to
2007 are used to forecast earnings for the year 2013 (year t + 3).
Example derived from Azevedo and Gerhart (2016).

The earnings forecast for up to three years into the future are calculated by multiplying the independent variables as of year t with the coefficients from the pooled regressions estimated using the previous five years of data.

I have a large cross section of data with many companie (identified by id) for each year. Sometimes companies drop out of the sample due to bankruptcy etc.

I first tried to use -rollreg- (user written command) but had no success since my data has gaps.

In my latest approach, I tried:

qui rolling _b, window(5) saving(coeff.dta, replace): reg F1.E_w A_w D_w DD E_w NegE AC_w
It would be great if someone could comment on this since I’m not sure if my approach even is headed into the right direction!

Thanks!

Last edited by Markus Dietz; 21 Aug 2017, 03:54.
Tags: None
Markus Dietz

Join Date: Aug 2017

Posts: 16
#2

24 Aug 2017, 03:10

Is there anybody who could help me with this?

In a more recent paper [Azevedo et al., 2016] I found an equation which makes it maybe more clear what I try to achieve:

E(i,t) = α0 + α1eA(i,t−τ) + α2D(i,t−τ) + α3DD(i,t−τ) + α4E(i,t−τ) + α5NegE(i,t−τ) + α6Ac(i,t−τ) + ^ε_i(i,t) [Equation 1]

(t − τ with τ = 1 to 3) with τ for forecasting years t+1,t+2 and t+3

First, I try to estimate the cross-sectional regressions using a rolling window pooled regression (in-sample) using the previous five years of data. I try to regress the dependent variable earnings (E(i,t)) for all firms (i) in year (t) on the independent variables (x1, x2, · · · , xn) for all the firms (i) in the relevant year (t−τ with τ = 1 to 3). (^ε(i,t)) is the error term for period (t)

Second, I need to forecast earnings (E(i,t+τ)) (out-of-sample) for year (t + τ ). The forecast should be obtained by multiplying the independent variables for each firm (i) of year (t) with the coefficients (α0, α1, α2, · · · , αn) from the pooled regression from Equation 1. The advantage of this approach is that there are no strict survivorship requirements as firms only to have sufficient account data for year (t) to forecast earnings.

Example [from Azevedo et al. 2016, p.6]:

"Assume that 2010 is year (t) and we want to forecast the earnings for 2011 (t+τ with τ = 1). First, we run a pooled regression dependent variable data for the period 2001-2010 (from year t − 9 to year t) on the independent variables for the period 2000-2009 (from year (t − 9 − τ ) to year (t − τ ) with τ = 1) to estimate the regression coefficients. Then, we multiply these coefficients (α0, α1, α2, · · · , αn) on the independent variables (x1, x2, ..., xn) from year 2010 (year = t) to estimate the earnings for 2011 (year t + τ with τ = 1)."

It would be great if someone could comment on how to tackle a problem like this!

Thanks!!
Comment
Markus Dietz

Join Date: Aug 2017

Posts: 16
#3

27 Aug 2017, 03:08

Anyone?
Comment
em lowthian

Join Date: Jun 2017

Posts: 51
#4

27 Aug 2017, 03:29

Markus - I don't have any idea what your model is. But I do find using the code, or your output allows people to be able to understand and answer your question better. Might be worth explaining in the most simplistic terms what you have done and what you want to see. From where I'm sitting this looks very complex and would take an enormous amount of effort to understand this and people answering these questions are volunteers. Do this, but if you fail to get an answer perhaps cross-post on stat-exchange, quora or look on youtube.
Comment

Markus Dietz

Join Date: Aug 2017
Posts: 16

28 Aug 2017, 05:04

You are right em, I tried to cram to much information into the post.

So I think it is a good start to provide an example of my data. I cleaned all variables I don't need for the regression and provide only the first 100 observations.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int Year float(FFF DD NegE id Earnings_w Dividends_w Accruals_w TotalAssets_w)
2000 . 1 0  1  145000  44140.06    31000  1573000
2001 . 1 0  1  123000  45293.79   -57000  1707000
2002 . 1 0  1  125000   47430.6  -128000  1609000
2003 . 1 0  1  154000   54828.8   -97000  1689000
2004 . 1 0  1  171000  56614.95   -57000  1711000
2005 . 1 0  1  179000   77348.5   -21000  2111000
2006 . 1 0  1  259524     83804   -49206  2251528
2007 1 1 0  1  223976     84408    88431  2363250
2008 1 1 1  1  -23818     35366   -41511  1946498
2009 1 1 0  1   64524     35366   -67391  1661938
2010 1 1 0  1  115907   44207.5   256813  1777113
2011 1 1 0  1   55807   43832.5   -27904  1822692
2012 1 1 0  1   78252  43478.75   -42612  1740368
2013 1 1 0  1   80453   43482.5   -84965  1531712
2014 1 1 0  1   94668   46961.1   -20448  1823535
2015 1 1 0  1  141439     52179   -48960  1921206
2003 . 0 0  2   26184         .   -51569  3219480
2004 . 0 0  2   33670         .   -70428  3561096
2005 . 0 0  2   49938         .  -217817  5040470
2006 . 0 0  2  191351         .  -255000  5482955
2007 . 1 0  2  170229     62700  -323760  7647286
2008 . 1 0  2  157020     62700  -532865  9626986
2009 . 1 0  2  161457     57000  -953640  9479606
2010 . 1 0  2  174857     62700  -515560 10167808
2011 . 1 0  2  194995     63135  -306151 10212330
2012 . 1 0  2   60631   20617.6  -208170  9940072
2013 . 1 0  2  113558     46170  -577162 10343502
2014 . 1 0  2  127969     51300  -686644  9997414
2015 . 1 0  2  156286     66690 -1096362 10436940
2008 . 0 0  3   33274         .   -38700  1101928
2009 . 0 0  3   31192         .   -88094   816687
2010 . 0 0  3   47181         .   -28232   808211
2011 . 1 0  3   86020     52896   -18486   850104
2012 0 1 0  3   71312   21158.4   -46111   854304
2013 0 1 0  3   56028   21158.4   -66206   906199
2014 0 1 0  3   59212   42316.8   -35939  1052512
2015 0 1 0  3   42697   42316.8   -67181  1075303
2002 . 0 0  4    9180         .   -31458   130478
2003 . 0 0  4    6449         .   -31378   272825
2004 . 0 0  4   25049         .   -48174   799876
2005 . 0 0  4   33405         .    52249   972413
2006 . 1 0  4   77186     19800   -61921  1512604
2007 . 0 0  4   40376         0    52668  2845265
2008 . 0 1  4  -33881         .  -133559  2697420
2009 . 0 0  4   54193         0   111919  2550053
2010 . 0 1  4 -270478         0  -436766   900347
2012 . 0 0  4  303551         0        .    55518
2006 . 0 0  5    7449         .    -6872    97212
2007 . 1 0  5    8426      4870     4695   132432
2008 . 1 0  5   11031      5357     1241   135013
2009 . 0 0  5    4597         0    -9990   138868
2010 . 0 1  5  -17257         0   -24016   154887
2011 . 0 1  5  -12377         0   -23183   142687
2012 . 0 1  5   -2491         0    -2698   130689
2013 . 0 1  5  -28297         0     4052   131698
2014 . 0 1  5  -27024         0   -19530    95071
2015 . 0 1  5  -13397         0    -5026    88274
2010 . 0 0  6   27868         .   -24196   253603
2011 . 0 0  6   28144         .    13993   283872
2012 . 0 0  6   10630         .    -3847   330385
2013 . 0 0  6   21145         .     1729   429410
2014 . 0 0  6   28894         0     -786   547185
2015 . 0 1  6   -9594         0    -6672   718245
2010 . 0 0  7    6617         0    -3534    48338
2011 . 0 0  7    5273         0    -4109    55964
2012 . 1 0  7    5124   2496.55     4482    55453
2013 . 1 0  7    3647   2496.55    -4175    52639
2014 . 1 0  7    3484   2496.55       23    52003
2015 . 1 0  7    4156    3108.6       71    52970
2000 . 1 0  8   73100     43890   -27832   718445
2001 . 1 0  8   66267     39900   -32026   733097
2002 . 1 0  8   70291     42000   -50634   762538
2003 . 1 0  8   70842     42000   -79527   823709
2004 . 1 0  8   71688     42000  -122665   942954
2005 . 1 0  8   74304     42000   -83712  1181687
2006 . 1 0  8   76785     46200   -64777  1479892
2007 . 1 0  8   87678     52500   -76735  1559660
2008 . 1 0  8   91149     54600   -57247  1735284
2009 . 1 0  8   73361     44100   -82113  1860932
2010 . 1 0  8   75718     42000   -94025  1998496
2011 . 1 0  8   31602     21000  -147311  2150158
2012 . 1 0  8   71890     22008  -107781  2061803
2013 . 1 0  8   73291     27300  -131063  1953937
2014 . 1 0  8   82482     34608  -138109  1892223
2015 . 1 0  8  100434     42000  -127979  1909656
2004 . 0 1  9   -8631         0    -7035    18773
2005 . 0 1  9  -28612         0    -4565    75004
2006 . 0 1  9  -57292         0    -9533   107981
2007 . 0 1  9  -24070         0    -3228   143510
2008 . 0 1  9  -61600         0   -20294   111697
2004 . 0 1 10   -4082         .    -8714   137814
2005 . 0 0 10    2902         0    -7916   136419
2006 . 0 0 10    3298         0     8069   160097
2007 . 0 0 10    2309         0    17860   224168
2008 . 0 1 10  -27558         0   -29127   229999
2009 . 0 1 10  -36698         0   -31435   127016
2000 . 1 0 11   22544 12025.374   -18069   857314
2001 . 1 0 11   37574  12014.04   -48432   845426
2002 . 1 0 11   44325 19845.834   -16742   855988
2003 . 1 0 11   65382 19845.834   -26614   897086
end

What I now try to do is the following: I want to use the variables DD, NegE, Earnings_w, Dividends_w, Accruals_w, TotalAssets_w to estimate coefficents for years t+1 t+2 and t+3.

For the t+1 forecast I need to use the previous 5 years of data lagged by one year L1.

qui rolling _b, window(5) saving(TestSmallt+1, replace): reg Earnings_w L1.Earnings_w L1.Dividends_w L1.DD L1.NegE L1.Accruals_w L1.TotalAssets_w

For the t+2 forecast I need to use the previous 5 years of data lagged by two years L2.

qui rolling _b, window(5) saving(TestSmallt+2, replace): reg Earnings_w L2.Earnings_w L2.Dividends_w L2.DD L2.NegE L2.Accruals_w L2.TotalAssets_w

For the t+3 forecast I need to use the previous 5 years of data lagged by two years L3.

qui rolling _b, window(5) saving(TestSmallt+3, replace): reg Earnings_w L3.Earnings_w L3.Dividends_w L3.DD L3.NegE L3.Accruals_w L3.TotalAssets_w

Later, the averages of the coefficients (for the t+1, t+2 and t+3 estimates) should be multiplied with the variables in year t to get the forecasted values.

My questions:
1. Is there anything obvious wrong with my approach?
2. Is there a more efficient way than just "rolling"? It takes hours when I use the full data set.
3. How can I combine the three regression operations and store the estimated coefficients for years t+1 t+2 and t+3 in one file?

Thanks

Comment

Robert Picard

Join Date: Mar 2014

Posts: 1536
#6

28 Aug 2017, 13:49

I can't help about the validity of your approach but rangestat (from SSC) is several orders of magnitude faster than rolling at performing regressions over a rolling window of time.
Comment
Markus Dietz

Join Date: Aug 2017

Posts: 16
#7

29 Aug 2017, 03:44

Thank you for the suggestion Robert!

Using he helpfile I tried to build a rolling regression model over a 5 year window. All independent variables in the regression need to be lagged by one period.
Since rangestat does not allow time series operators, how can I solve this problem?

I tried to adjust the interval bounds but this doesn't seem to work

. rangestat (reg) Earnings_w L1.TotalAssets_w L1.Dividends_w L1.DD L1.Earnings_w L1.NegE L1.Accruals_w, interval(Year -4 0) by(id)
factor variables and time-series operators not allowed
r(101);

. rangestat (reg) Earnings_w TotalAssets_w Dividends_w DD Earnings_w NegE Accruals_w, interval(Year -5 1) by(id)
no result for all obs: reg Earnings_w TotalAssets_w Dividends_w DD Earnings_w NegE Accruals_w
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#8

29 Aug 2017, 09:01

Just create new variables with the lagged values. For example:

Code:

gen L1_TotalAssets_w = L1.TotalAssets_w
Comment
Markus Dietz

Join Date: Aug 2017

Posts: 16
#9

01 Sep 2017, 03:00

So I ran a few regressions and the speed of rangestat is just incredible! Thanks to all developers for that!

I ran the following regression without the identifier "by (id)" and the results seem to make sense.

rangestat (reg) Earnings_w L1_TotalAssets_w L1_Dividends_w L1_DD L1_Earnings_w L1_NegE L1_Accruals_w , interval(Year -3 0)

Does rangestat account for panel data structure?

However, when I try to run the regression WITH the indentifier "by (id)" for individual companies

rangestat (reg) Earnings_w L1_TotalAssets_w L1_Dividends_w L1_DD L1_Earnings_w L1_NegE L1_Accruals_w , interval(Year -3 0) by (id)

I get the following error

no result for all obs: reg Earnings_w L1_TotalAssets_w L1_Dividends_w L1_DD L1_Earnings_w L1_NegE L1_Accruals_w

Does someone knows what might cause this?

Also, is there a way to display t-statistics in the output?

Thanks!
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#10

01 Sep 2017, 09:45

If I use the data example you posted in #5 and run what you tried in #9, I get:

Code:

. xtset id Year
       panel variable:  id (unbalanced)
        time variable:  Year, 2000 to 2015, but with a gap
                delta:  1 unit

. foreach v of varlist TotalAssets_w Dividends_w DD Earnings_w NegE Accruals_w {
  2.         qui gen L1_`v' = L1.`v'
  3. }

. 
. rangestat (reg) Earnings_w L1_TotalAssets_w L1_Dividends_w L1_DD L1_Earnings_w L1_NegE L1_Accruals_w , interval(Year -3 0) by(id)
no result for all obs: reg Earnings_w L1_TotalAssets_w L1_Dividends_w L1_DD L1_Earnings_w L1_NegE L1_Accruals_w

rangestat is just a hyper efficient way to loop over each observation and compute some statistics. If the computed statistic is missing for each observation, you get the error you observed. As explained in the help file, you can always use standard Stata commands to generate results for any given observation. Say we want to manually calculate results for observation 5, the code would look like:

Code:

regress Earnings_w L1_TotalAssets_w L1_Dividends_w L1_DD L1_Earnings_w L1_NegE L1_Accruals_w ///
    if id == id[5] & inrange(Year,Year[5]-3,Year[5])

and the results:

Code:

. regress Earnings_w L1_TotalAssets_w L1_Dividends_w L1_DD L1_Earnings_w L1_NegE L1_Accruals_w ///
>         if id == id[5] & inrange(Year,Year[5]-3,Year[5])
note: L1_Dividends_w omitted because of collinearity
note: L1_DD omitted because of collinearity
note: L1_NegE omitted because of collinearity

      Source |       SS           df       MS      Number of obs   =         4
-------------+----------------------------------   F(3, 0)         =         .
       Model |  1.6288e+09         3   542916667   Prob > F        =         .
    Residual |           0         0           .   R-squared       =    1.0000
-------------+----------------------------------   Adj R-squared   =         .
       Total |  1.6288e+09         3   542916667   Root MSE        =         0

----------------------------------------------------------------------------------
      Earnings_w |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-----------------+----------------------------------------------------------------
L1_TotalAssets_w |  -.0337041          .        .       .            .           .
  L1_Dividends_w |          0  (omitted)
           L1_DD |          0  (omitted)
   L1_Earnings_w |   1.034911          .        .       .            .           .
         L1_NegE |          0  (omitted)
   L1_Accruals_w |  -.3327771          .        .       .            .           .
           _cons |   36270.53          .        .       .            .           .
----------------------------------------------------------------------------------

.

So if you perform the rolling regressions within panels, each regression has at most 4 observations and you have collinearity issues. rangestat will not report regression results if one or more variable is omitted because the results are for a different model.

As for the t-statistics, you can compute them using the following code:

Code:

foreach v in L1_TotalAssets_w L1_Dividends_w L1_DD L1_Earnings_w L1_NegE L1_Accruals_w cons {
    gen double t_`v' = b_`v' / se_`v'
}

Last edited by Robert Picard; 01 Sep 2017, 09:48.

Comment

Markus Dietz

Join Date: Aug 2017

Posts: 16
#11

13 Sep 2017, 05:54

Thank you Robert!

so a pooled cross sectional regression on the last 5 years (excluding year t) would look like this:

Code:

rangestat (reg) Earnings_w L1_TotalAssets_w L1_Dividends_w L1_DD L1_Earnings_w L1_NegE L1_Accruals_w , interval(Year -5 -1)

Since it is a pooled regression I don't need the by(id) option right?
Does adding the by (Year) option make any sense?

Also: Can I somehow use Newey West t test?
Comment
Markus Dietz

Join Date: Aug 2017

Posts: 16
#12

19 Sep 2017, 12:47

Hello,

does someone know if it is possible to apply Newey West t-statistics when using rangestat for regression?
Comment

Announcement