Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cross-Sectional regression to forecast earnings

    Hello everyone,

    I’m working on a project which aims to use a cross-sectional model based approach to forecast earnings for years t+1, t+2 and t+3. The model is based on Hou et al. (2012) and looks like this:
    Ei,t+τ=α0+α1Ai,t+α2Di,t+α3DDi,t+α4Ei,t+α5NegEi,t+α6ACi,t+εi,t+τ
    • Ei,t+τ = earnings of firm i in year t+τ (τ = 1 to3)
    • Ai,t = total assets
    • Di,t = dividend payment
    • DDi,t = dummy variable that equals 1 for dividend payers and 0 otherwise
    • NegEi,t = Dummy variable that equals 0 for firms with negative earnings and 1 otherwise
    • ACi,t = Accruals
    All explanatory variables are measures as of year t.
    The actual forecasting is a two-step process:
    1. For each year, starting in 2007, the model above should estimate the pooled cross-sectional regressions using the previous 5 years of data.
    1. In the next step the coefficients are being used to forecast earnings for years t+1 ,t+2 and t+3.
    For example, if 2010 is the year t, data of year 2005 to 2009 are used to calculate the coefficients
    for estimating future earnings for year 2011 (year t + 1). Similarly, data from 2004 to 2008
    are used to calculate the coefficient estimates that will be used to calculate the forecast for year
    2012 (year t + 2). Finally, the estimates of the cross-sectional regression for the years 2003 to
    2007 are used to forecast earnings for the year 2013 (year t + 3).
    Example derived from Azevedo and Gerhart (2016).

    The earnings forecast for up to three years into the future are calculated by multiplying the independent variables as of year t with the coefficients from the pooled regressions estimated using the previous five years of data.

    I have a large cross section of data with many companie (identified by id) for each year. Sometimes companies drop out of the sample due to bankruptcy etc.

    I first tried to use -rollreg- (user written command) but had no success since my data has gaps.

    In my latest approach, I tried:

    qui rolling _b, window(5) saving(coeff.dta, replace): reg F1.E_w A_w D_w DD E_w NegE AC_w
    It would be great if someone could comment on this since I’m not sure if my approach even is headed into the right direction!

    Thanks!
    Last edited by Markus Dietz; 21 Aug 2017, 03:54.

  • #2
    Is there anybody who could help me with this?

    In a more recent paper [Azevedo et al., 2016] I found an equation which makes it maybe more clear what I try to achieve:

    E(i,t) = α0 + α1eA(i,t−τ) + α2D(i,t−τ) + α3DD(i,t−τ) + α4E(i,t−τ) + α5NegE(i,t−τ) + α6Ac(i,t−τ) + εi(i,t) [Equation 1]

    (t − τ with τ = 1 to 3) with τ for forecasting years t+1,t+2 and t+3

    First, I try to estimate the cross-sectional regressions using a rolling window pooled regression (in-sample) using the previous five years of data. I try to regress the dependent variable earnings (E(i,t)) for all firms (i) in year (t) on the independent variables (x1, x2, · · · , xn) for all the firms (i) in the relevant year (t−τ with τ = 1 to 3). (ε(i,t)) is the error term for period (t)

    Second, I need to forecast earnings (E(i,t+τ)) (out-of-sample) for year (t + τ ). The forecast should be obtained by multiplying the independent variables for each firm (i) of year (t) with the coefficients (α0, α1, α2, · · · , αn) from the pooled regression from Equation 1. The advantage of this approach is that there are no strict survivorship requirements as firms only to have sufficient account data for year (t) to forecast earnings.

    Example [from Azevedo et al. 2016, p.6]:

    "Assume that 2010 is year (t) and we want to forecast the earnings for 2011 (t+τ with τ = 1). First, we run a pooled regression dependent variable data for the period 2001-2010 (from year t − 9 to year t) on the independent variables for the period 2000-2009 (from year (t − 9 − τ ) to year (t − τ ) with τ = 1) to estimate the regression coefficients. Then, we multiply these coefficients (α0, α1, α2, · · · , αn) on the independent variables (x1, x2, ..., xn) from year 2010 (year = t) to estimate the earnings for 2011 (year t + τ with τ = 1)."

    It would be great if someone could comment on how to tackle a problem like this!

    Thanks!!

    Comment


    • #3
      Anyone?

      Comment


      • #4
        Markus - I don't have any idea what your model is. But I do find using the code, or your output allows people to be able to understand and answer your question better. Might be worth explaining in the most simplistic terms what you have done and what you want to see. From where I'm sitting this looks very complex and would take an enormous amount of effort to understand this and people answering these questions are volunteers. Do this, but if you fail to get an answer perhaps cross-post on stat-exchange, quora or look on youtube.

        Comment


        • #5
          You are right em, I tried to cram to much information into the post.

          So I think it is a good start to provide an example of my data. I cleaned all variables I don't need for the regression and provide only the first 100 observations.

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input int Year float(FFF DD NegE id Earnings_w Dividends_w Accruals_w TotalAssets_w)
          2000 . 1 0  1  145000  44140.06    31000  1573000
          2001 . 1 0  1  123000  45293.79   -57000  1707000
          2002 . 1 0  1  125000   47430.6  -128000  1609000
          2003 . 1 0  1  154000   54828.8   -97000  1689000
          2004 . 1 0  1  171000  56614.95   -57000  1711000
          2005 . 1 0  1  179000   77348.5   -21000  2111000
          2006 . 1 0  1  259524     83804   -49206  2251528
          2007 1 1 0  1  223976     84408    88431  2363250
          2008 1 1 1  1  -23818     35366   -41511  1946498
          2009 1 1 0  1   64524     35366   -67391  1661938
          2010 1 1 0  1  115907   44207.5   256813  1777113
          2011 1 1 0  1   55807   43832.5   -27904  1822692
          2012 1 1 0  1   78252  43478.75   -42612  1740368
          2013 1 1 0  1   80453   43482.5   -84965  1531712
          2014 1 1 0  1   94668   46961.1   -20448  1823535
          2015 1 1 0  1  141439     52179   -48960  1921206
          2003 . 0 0  2   26184         .   -51569  3219480
          2004 . 0 0  2   33670         .   -70428  3561096
          2005 . 0 0  2   49938         .  -217817  5040470
          2006 . 0 0  2  191351         .  -255000  5482955
          2007 . 1 0  2  170229     62700  -323760  7647286
          2008 . 1 0  2  157020     62700  -532865  9626986
          2009 . 1 0  2  161457     57000  -953640  9479606
          2010 . 1 0  2  174857     62700  -515560 10167808
          2011 . 1 0  2  194995     63135  -306151 10212330
          2012 . 1 0  2   60631   20617.6  -208170  9940072
          2013 . 1 0  2  113558     46170  -577162 10343502
          2014 . 1 0  2  127969     51300  -686644  9997414
          2015 . 1 0  2  156286     66690 -1096362 10436940
          2008 . 0 0  3   33274         .   -38700  1101928
          2009 . 0 0  3   31192         .   -88094   816687
          2010 . 0 0  3   47181         .   -28232   808211
          2011 . 1 0  3   86020     52896   -18486   850104
          2012 0 1 0  3   71312   21158.4   -46111   854304
          2013 0 1 0  3   56028   21158.4   -66206   906199
          2014 0 1 0  3   59212   42316.8   -35939  1052512
          2015 0 1 0  3   42697   42316.8   -67181  1075303
          2002 . 0 0  4    9180         .   -31458   130478
          2003 . 0 0  4    6449         .   -31378   272825
          2004 . 0 0  4   25049         .   -48174   799876
          2005 . 0 0  4   33405         .    52249   972413
          2006 . 1 0  4   77186     19800   -61921  1512604
          2007 . 0 0  4   40376         0    52668  2845265
          2008 . 0 1  4  -33881         .  -133559  2697420
          2009 . 0 0  4   54193         0   111919  2550053
          2010 . 0 1  4 -270478         0  -436766   900347
          2012 . 0 0  4  303551         0        .    55518
          2006 . 0 0  5    7449         .    -6872    97212
          2007 . 1 0  5    8426      4870     4695   132432
          2008 . 1 0  5   11031      5357     1241   135013
          2009 . 0 0  5    4597         0    -9990   138868
          2010 . 0 1  5  -17257         0   -24016   154887
          2011 . 0 1  5  -12377         0   -23183   142687
          2012 . 0 1  5   -2491         0    -2698   130689
          2013 . 0 1  5  -28297         0     4052   131698
          2014 . 0 1  5  -27024         0   -19530    95071
          2015 . 0 1  5  -13397         0    -5026    88274
          2010 . 0 0  6   27868         .   -24196   253603
          2011 . 0 0  6   28144         .    13993   283872
          2012 . 0 0  6   10630         .    -3847   330385
          2013 . 0 0  6   21145         .     1729   429410
          2014 . 0 0  6   28894         0     -786   547185
          2015 . 0 1  6   -9594         0    -6672   718245
          2010 . 0 0  7    6617         0    -3534    48338
          2011 . 0 0  7    5273         0    -4109    55964
          2012 . 1 0  7    5124   2496.55     4482    55453
          2013 . 1 0  7    3647   2496.55    -4175    52639
          2014 . 1 0  7    3484   2496.55       23    52003
          2015 . 1 0  7    4156    3108.6       71    52970
          2000 . 1 0  8   73100     43890   -27832   718445
          2001 . 1 0  8   66267     39900   -32026   733097
          2002 . 1 0  8   70291     42000   -50634   762538
          2003 . 1 0  8   70842     42000   -79527   823709
          2004 . 1 0  8   71688     42000  -122665   942954
          2005 . 1 0  8   74304     42000   -83712  1181687
          2006 . 1 0  8   76785     46200   -64777  1479892
          2007 . 1 0  8   87678     52500   -76735  1559660
          2008 . 1 0  8   91149     54600   -57247  1735284
          2009 . 1 0  8   73361     44100   -82113  1860932
          2010 . 1 0  8   75718     42000   -94025  1998496
          2011 . 1 0  8   31602     21000  -147311  2150158
          2012 . 1 0  8   71890     22008  -107781  2061803
          2013 . 1 0  8   73291     27300  -131063  1953937
          2014 . 1 0  8   82482     34608  -138109  1892223
          2015 . 1 0  8  100434     42000  -127979  1909656
          2004 . 0 1  9   -8631         0    -7035    18773
          2005 . 0 1  9  -28612         0    -4565    75004
          2006 . 0 1  9  -57292         0    -9533   107981
          2007 . 0 1  9  -24070         0    -3228   143510
          2008 . 0 1  9  -61600         0   -20294   111697
          2004 . 0 1 10   -4082         .    -8714   137814
          2005 . 0 0 10    2902         0    -7916   136419
          2006 . 0 0 10    3298         0     8069   160097
          2007 . 0 0 10    2309         0    17860   224168
          2008 . 0 1 10  -27558         0   -29127   229999
          2009 . 0 1 10  -36698         0   -31435   127016
          2000 . 1 0 11   22544 12025.374   -18069   857314
          2001 . 1 0 11   37574  12014.04   -48432   845426
          2002 . 1 0 11   44325 19845.834   -16742   855988
          2003 . 1 0 11   65382 19845.834   -26614   897086
          end
          What I now try to do is the following: I want to use the variables DD, NegE, Earnings_w, Dividends_w, Accruals_w, TotalAssets_w to estimate coefficents for years t+1 t+2 and t+3.


          For the t+1 forecast I need to use the previous 5 years of data lagged by one year L1.

          qui rolling _b, window(5) saving(TestSmallt+1, replace): reg Earnings_w L1.Earnings_w L1.Dividends_w L1.DD L1.NegE L1.Accruals_w L1.TotalAssets_w


          For the t+2 forecast I need to use the previous 5 years of data lagged by two years L2.

          qui rolling _b, window(5) saving(TestSmallt+2, replace): reg Earnings_w L2.Earnings_w L2.Dividends_w L2.DD L2.NegE L2.Accruals_w L2.TotalAssets_w

          For the t+3 forecast I need to use the previous 5 years of data lagged by two years L3.

          qui rolling _b, window(5) saving(TestSmallt+3, replace): reg Earnings_w L3.Earnings_w L3.Dividends_w L3.DD L3.NegE L3.Accruals_w L3.TotalAssets_w


          Later, the averages of the coefficients (for the t+1, t+2 and t+3 estimates) should be multiplied with the variables in year t to get the forecasted values.

          My questions:
          1. Is there anything obvious wrong with my approach?
          2. Is there a more efficient way than just "rolling"? It takes hours when I use the full data set.
          3. How can I combine the three regression operations and store the estimated coefficients for years t+1 t+2 and t+3 in one file?

          Thanks

          Comment


          • #6
            I can't help about the validity of your approach but rangestat (from SSC) is several orders of magnitude faster than rolling at performing regressions over a rolling window of time.

            Comment


            • #7
              Thank you for the suggestion Robert!

              Using he helpfile I tried to build a rolling regression model over a 5 year window. All independent variables in the regression need to be lagged by one period.
              Since rangestat does not allow time series operators, how can I solve this problem?

              I tried to adjust the interval bounds but this doesn't seem to work

              . rangestat (reg) Earnings_w L1.TotalAssets_w L1.Dividends_w L1.DD L1.Earnings_w L1.NegE L1.Accruals_w, interval(Year -4 0) by(id)
              factor variables and time-series operators not allowed
              r(101);

              . rangestat (reg) Earnings_w TotalAssets_w Dividends_w DD Earnings_w NegE Accruals_w, interval(Year -5 1) by(id)
              no result for all obs: reg Earnings_w TotalAssets_w Dividends_w DD Earnings_w NegE Accruals_w

              Comment


              • #8
                Just create new variables with the lagged values. For example:
                Code:
                gen L1_TotalAssets_w = L1.TotalAssets_w

                Comment


                • #9
                  So I ran a few regressions and the speed of rangestat is just incredible! Thanks to all developers for that!

                  I ran the following regression without the identifier "by (id)" and the results seem to make sense.
                  rangestat (reg) Earnings_w L1_TotalAssets_w L1_Dividends_w L1_DD L1_Earnings_w L1_NegE L1_Accruals_w , interval(Year -3 0)
                  Does rangestat account for panel data structure?

                  However, when I try to run the regression WITH the indentifier "by (id)" for individual companies

                  rangestat (reg) Earnings_w L1_TotalAssets_w L1_Dividends_w L1_DD L1_Earnings_w L1_NegE L1_Accruals_w , interval(Year -3 0) by (id)
                  I get the following error

                  no result for all obs: reg Earnings_w L1_TotalAssets_w L1_Dividends_w L1_DD L1_Earnings_w L1_NegE L1_Accruals_w
                  Does someone knows what might cause this?

                  Also, is there a way to display t-statistics in the output?

                  Thanks!

                  Comment


                  • #10
                    If I use the data example you posted in #5 and run what you tried in #9, I get:
                    Code:
                    . xtset id Year
                           panel variable:  id (unbalanced)
                            time variable:  Year, 2000 to 2015, but with a gap
                                    delta:  1 unit
                    
                    . foreach v of varlist TotalAssets_w Dividends_w DD Earnings_w NegE Accruals_w {
                      2.         qui gen L1_`v' = L1.`v'
                      3. }
                    
                    . 
                    . rangestat (reg) Earnings_w L1_TotalAssets_w L1_Dividends_w L1_DD L1_Earnings_w L1_NegE L1_Accruals_w , interval(Year -3 0) by(id)
                    no result for all obs: reg Earnings_w L1_TotalAssets_w L1_Dividends_w L1_DD L1_Earnings_w L1_NegE L1_Accruals_w
                    rangestat is just a hyper efficient way to loop over each observation and compute some statistics. If the computed statistic is missing for each observation, you get the error you observed. As explained in the help file, you can always use standard Stata commands to generate results for any given observation. Say we want to manually calculate results for observation 5, the code would look like:
                    Code:
                    regress Earnings_w L1_TotalAssets_w L1_Dividends_w L1_DD L1_Earnings_w L1_NegE L1_Accruals_w ///
                        if id == id[5] & inrange(Year,Year[5]-3,Year[5])
                    and the results:
                    Code:
                    . regress Earnings_w L1_TotalAssets_w L1_Dividends_w L1_DD L1_Earnings_w L1_NegE L1_Accruals_w ///
                    >         if id == id[5] & inrange(Year,Year[5]-3,Year[5])
                    note: L1_Dividends_w omitted because of collinearity
                    note: L1_DD omitted because of collinearity
                    note: L1_NegE omitted because of collinearity
                    
                          Source |       SS           df       MS      Number of obs   =         4
                    -------------+----------------------------------   F(3, 0)         =         .
                           Model |  1.6288e+09         3   542916667   Prob > F        =         .
                        Residual |           0         0           .   R-squared       =    1.0000
                    -------------+----------------------------------   Adj R-squared   =         .
                           Total |  1.6288e+09         3   542916667   Root MSE        =         0
                    
                    ----------------------------------------------------------------------------------
                          Earnings_w |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                    -----------------+----------------------------------------------------------------
                    L1_TotalAssets_w |  -.0337041          .        .       .            .           .
                      L1_Dividends_w |          0  (omitted)
                               L1_DD |          0  (omitted)
                       L1_Earnings_w |   1.034911          .        .       .            .           .
                             L1_NegE |          0  (omitted)
                       L1_Accruals_w |  -.3327771          .        .       .            .           .
                               _cons |   36270.53          .        .       .            .           .
                    ----------------------------------------------------------------------------------
                    
                    .
                    So if you perform the rolling regressions within panels, each regression has at most 4 observations and you have collinearity issues. rangestat will not report regression results if one or more variable is omitted because the results are for a different model.

                    As for the t-statistics, you can compute them using the following code:
                    Code:
                    foreach v in L1_TotalAssets_w L1_Dividends_w L1_DD L1_Earnings_w L1_NegE L1_Accruals_w cons {
                        gen double t_`v' = b_`v' / se_`v'
                    }
                    Last edited by Robert Picard; 01 Sep 2017, 09:48.

                    Comment


                    • #11
                      Thank you Robert!

                      so a pooled cross sectional regression on the last 5 years (excluding year t) would look like this:

                      Code:
                      rangestat (reg) Earnings_w L1_TotalAssets_w L1_Dividends_w L1_DD L1_Earnings_w L1_NegE L1_Accruals_w , interval(Year -5 -1)
                      Since it is a pooled regression I don't need the by(id) option right?
                      Does adding the by (Year) option make any sense?

                      Also: Can I somehow use Newey West t test?

                      Comment


                      • #12
                        Hello,

                        does someone know if it is possible to apply Newey West t-statistics when using rangestat for regression?

                        Comment

                        Working...
                        X