Doubts on autocorrelation

Paolo Giovanetti

Join Date: May 2020

Posts: 13
#1

Doubts on autocorrelation

22 May 2020, 02:45

Dear Statalist users,
I am kind of new with Stata and I'd really appreciate your help.
I have been working on trying to find a correlation between the Fed's Quantitative Easing program (my main independent variable) and aggregate consumption (my dependent variable). In particular I regress the total (logged) monthly consumption on the total (logged) monthly number of assets owned by the Fed (as you know, QE consists in asset purchases) plus a number of control variables. Therefore, my data consists in a time series with a monthly frequency from 2003 to 2019. Please note that all of the data are seasonally adjusted and/or inflation adjusted, in order to get more precise estimates.
Unfortunately, I cannot paste here the code for my regressions because of character limits.
Anyway, my simple OLS regression has an R-sq of 0.99 and most of the regressors are statistically significant. I conducted a Durbin-Watson test and the result was .75, meaning that there is some autocorrelation.

Therefore I tried to perform a first difference regression. The R-sq. is much smaller (0.16) and, more worryingly, most of the regressors are now insignificant and the coefficients very small (in the 0.00 order), making it hard to draw conclusions from the results.

My questions for you are the following: is it true in this case that the regression with the lowest R-sq. is preferrable to the one with a very high R-sq. because of autocorrelation? Would it be a good idea to use the simple OLS regression with Newey-West standard errors, considering that autocorrelation does not lead to biased estimates but to incorrect standard errors?

I hope I have been clear enough.

Thank you very much in advance
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

22 May 2020, 03:21

Welcome to the Stata Forum / Statalist,

This is far from my field. That said, I believe he Durbin-Watson stat < 1 is not meaning "there is some autocorrelation". It is much worse than "some".

Last but not least, the huge R-squared in the OLS regression would lead me to reflect about (intense) overfitting.

Hopefully you'll get further help from fellows in the field.

Best regards,

Marcos
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#3

22 May 2020, 03:24

Welcome to the Stata Forum / Statalist,

This is far from my field. That said, I believe t
he Durbin-Watson stat < 1 is not meaning "there is some autocorrelation". It is much worse than "some".

Last but not least, the huge R-squared in the OLS regression would lead me to reflect about (intense) overfitting.

Hopefully you'll get further help from fellows in the field.

Best regards,

Marcos
Comment
Sven-Kristjan Bormann

Join Date: Jul 2018

Posts: 310
#4

22 May 2020, 05:29

Without seeing an example of your dataset and your code, it will be difficult to give you a good answer. As far as I understand your text, one problem could be that you regress a logged first-differenced variable on another logged variable. This makes the interpretation of the parameter a bit more challenging. At least I get often times confused. So, it might be that your

The large R-sq in case of the simple OLS-regression is rather typical when your dependent variable follows a unit-root process, meaning that the autocorrelation lies around 1.
Comment

Paolo Giovanetti

Join Date: May 2020
Posts: 13

22 May 2020, 05:50

Thanks Marcos Almeida Sven-Kristjan Bormann for your replies. I will try to post my codes separately, hoping that they could be useful.
Simple OLS:

Code:

     Source |       SS           df       MS      Number of obs   =       192
-------------+----------------------------------   F(8, 183)       =   5947.22
       Model |  1.47539526         8  .184424407   Prob > F        =    0.0000
    Residual |  .005674862       183   .00003101   R-squared       =    0.9962
-------------+----------------------------------   Adj R-squared   =    0.9960
       Total |  1.48107012       191  .007754294   Root MSE        =    .00557

---------------------------------------------------------------------------------------------
                    pce_log |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------------------+----------------------------------------------------------------
                 assets_log |   .0089369   .0034726     2.57   0.011     .0020854    .0157884
        consumer_credit_log |   .1120439   .0219158     5.11   0.000     .0688037     .155284
                   djia_log |   .0156524   .0071305     2.20   0.029     .0015837     .029721
median_household_income_log |   .1305874   .0387227     3.37   0.001      .054187    .2069878
               govt_exp_log |   .2510923   .0119902    20.94   0.000     .2274355    .2747492
                umcsent_log |   .0267592   .0064449     4.15   0.000     .0140434     .039475
                 unrate_log |  -.0672251    .008629    -7.79   0.000    -.0842502      -.0502
                 indpro_log |   .0745467   .0293487     2.54   0.012     .0166414    .1324519
                      _cons |   4.307414   .5004063     8.61   0.000     3.320107    5.294722
---------------------------------------------------------------------------------------------

Last edited by Paolo Giovanetti; 22 May 2020, 05:58.

Comment

Paolo Giovanetti

Join Date: May 2020
Posts: 13

22 May 2020, 05:57

First difference:

Code:

. reg d.(pce_log assets_log consumer_credit_log djia_log median_household_income_log govt_exp_log umcsent_log unrate
> _log indpro_log)

      Source |       SS           df       MS      Number of obs   =       191
-------------+----------------------------------   F(8, 182)       =      4.57
       Model |  .000290392         8  .000036299   Prob > F        =    0.0000
    Residual |  .001445489       182  7.9422e-06   R-squared       =    0.1673
-------------+----------------------------------   Adj R-squared   =    0.1307
       Total |  .001735881       190  9.1362e-06   Root MSE        =    .00282

---------------------------------------------------------------------------------------------
                  D.pce_log |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------------------+----------------------------------------------------------------
                 assets_log |
                        D1. |  -.0090239   .0048382    -1.87   0.064    -.0185701    .0005222
                            |
        consumer_credit_log |
                        D1. |    .029443   .0396524     0.74   0.459    -.0487946    .1076805
                            |
                   djia_log |
                        D1. |   .0091208   .0058011     1.57   0.118    -.0023252    .0205668
                            |
median_household_income_log |
                        D1. |   .0356397   .0322021     1.11   0.270    -.0278978    .0991772
                            |
               govt_exp_log |
                        D1. |   .0128488   .0158355     0.81   0.418    -.0183961    .0440936
                            |
                umcsent_log |
                        D1. |   .0017164   .0040362     0.43   0.671    -.0062475    .0096802
                            |
                 unrate_log |
                        D1. |  -.0129227   .0085588    -1.51   0.133    -.0298099    .0039644
                            |
                 indpro_log |
                        D1. |   .1115001   .0311824     3.58   0.000     .0499746    .1730255
                            |
                      _cons |   .0015385   .0002589     5.94   0.000     .0010276    .0020493
---------------------------------------------------------------------------------------------

Comment

Paolo Giovanetti

Join Date: May 2020

Posts: 13
#7

22 May 2020, 08:11

And Sven-Kristjan Bormann, I understand what you're saying, but what would you suggest I do?
Comment
Sven-Kristjan Bormann

Join Date: Jul 2018

Posts: 310
#8

22 May 2020, 15:54

First of all, you should modify your second regression. At the moment, you regress the first-differenced log consumption on other first-differenced logged variables. I would have no clue how to interpret the parameters.

I suggest that you take a book with an introduction into time-series analysis, try to understand the basic concepts like unit-roots, autocorrelation and maybe log differences for your problem. Then you can come back with new code.
The reason, I suggest this, is that your description of your problem and your estimations give me the impression that you don't know how to do time-series analysis.

What I would do in your case, is to first test all my variables for a unit-root. Then if necessary take first differences and test again. Only after these steps, I would start thinking about taking the logarithm of the first difference of the consumption variable if I was interested what effect the Quantative Easing program has on the growth rate of consumption. You have done it exactly the other way around. You first took the logarithm of the variables and then applied first differences to all variables. These two operations are not interchangable.

After the first OLS regression, I would run the usual diagnostic tests to see if my residuals are autocorrelated, heteroskedastic and do not look like they come from a normal distribution. Depending on my exact model, I would add lags of the dependent and independent variables or estimate an ARIMAX model. The rest then depends on the dataset and your theory.
Comment
Eric de Souza

Join Date: Mar 2014

Posts: 587
#9

23 May 2020, 03:17

I would add to Sven-Kristjan Bormann's excellent advice that you should also look up what an error- or equilibrium- correction model is.
Comment

Paolo Giovanetti

Join Date: May 2020
Posts: 13

#10

23 May 2020, 08:19

Thanks a lot Sven-Kristjan Bormann. I have done tests for unit-root for every variable and they all appeared to follow a unit-root process. So I took first differences (I did for each variable something like, for example,

Code:

gen govt_exp_d = d.govt_exp

, is that correct?) and after that the tests no longer showed unit roots. Then I took the logs of every first-differenced variable and came up with the following regression:

Code:

regress pce_d_log assets_d_log djia_d_log consumer_credit_d_log umcsent_d_log unrate_d_log median_household_income
> _d_log govt_exp_d_log indpro_d_log

      Source |       SS           df       MS      Number of obs   =       184
-------------+----------------------------------   F(8, 175)       =      5.14
       Model |  2.80743264         8   .35092908   Prob > F        =    0.0000
    Residual |  11.9555091       175  .068317195   R-squared       =    0.1902
-------------+----------------------------------   Adj R-squared   =    0.1531
       Total |  14.7629417       183  .080671813   Root MSE        =    .26138

-----------------------------------------------------------------------------------------------
                    pce_d_log |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------------------+----------------------------------------------------------------
                 assets_d_log |  -.0488455   .0183674    -2.66   0.009    -.0850957   -.0125953
                   djia_d_log |   .1101222   .0864956     1.27   0.205    -.0605867    .2808311
        consumer_credit_d_log |   .6868448   .2526387     2.72   0.007     .1882339    1.185456
                umcsent_d_log |   .0394395   .0451708     0.87   0.384    -.0497101    .1285892
                 unrate_d_log |  -.0601625   .0694266    -0.87   0.387    -.1971838    .0768588
median_household_income_d_log |   .1642969   .1361114     1.21   0.229    -.1043343    .4329281
               govt_exp_d_log |   .0357742   .0171152     2.09   0.038     .0019954    .0695529
                 indpro_d_log |   .4486871   .1356479     3.31   0.001     .1809706    .7164036
                        _cons |  -.9359631   1.537259    -0.61   0.543    -3.969917    2.097991
-----------------------------------------------------------------------------------------------

Although the R-sq. is low, I am quite satisfied with the coefficients and the fact that at least some of the variables are significant. Any thoughts on that?

Finally, regarding your last comment, what do you exactly mean by "add lags of the dependent and independent variables"? Does it mean to run the regression with the lagged variables?

I really appreciate your help considering that this is my first time series model and the learning curve is pretty steep...

Comment

Sven-Kristjan Bormann

Join Date: Jul 2018

Posts: 310
#11

23 May 2020, 10:58

I am not an expert for time-series analysis. Maybe you ask the same question on research gate or https://stats.stackexchange.com/ I am not sure what the best approach is if all your variables have a unit-root.
My guess is that you should use a Vector Error Correction Model (VECM) and test if there is cointegration (long-term equilibrium) relation between your dependent variable and your independent variable of interest. But there might be also cointegration relationships between your independent variables.

Your current OLS approach estimates what happens with the growth rate of your dependent variable if the growth rates of the independent variables change. If that is what you are interested in then you could probably stick to it.

What I meant with my last comment is that I run the regression with the growth rates and the lags of the growth rates. But I would only do if the residuals of my first regression are still autocorrelated, because that indicates that I missed some part of the dynamics.
Comment
Paolo Giovanetti

Join Date: May 2020

Posts: 13
#12

25 May 2020, 03:14

Sven-Kristjan Bormann : I have been thinking about the first difference model and i have a "formal" doubt. As you suggested, I first took the first difference of my variables and then their logarithm. So, if I had to write the equation, it would look like: lnΔconsumption = lnΔassets + lnΔdjia... etc. However, in textbooks and online I always found something like this: Δln(y) = Δln(x)...
In words it would sound like: "the difference of the logarithm of y is equal to the difference of the logarithm of x". But what I did with my model sounds more like "the logarithm of the difference". Would it therefore be formally correct to write my equation as lnΔvariable or, as I have found in every example, as Δln(variable) ?
Thanks again
Comment
Sven-Kristjan Bormann

Join Date: Jul 2018

Posts: 310
#13

25 May 2020, 06:15

Paolo Giovanetti : I have to apologise to you. I realised only now, that I was the one who was confused about the order of applying logarithm and differencing. You were correct in the beginning to first take the logarithm and then take differences to get the growth rates, if your goal is to estimate your model for the growth rates of consumption against the growth rates of the quantative easing program.

So yes, you are correct with your formal description. I am truly sorry for the confusion that I caused. And your last OLS regression points into the right direction, I hope. But there are definitely some persons in this forum who are experts with regards to time series analysis.

Nevertheless, you could also look at the ARDL (autoregressive distributed lag) models to estimate your desired correlation.
In Stata, you can estimate these model with the help of the user-provided ardl command by Sebastian Kripfganz and Daniel Schneider. You can find the slides to their London Stata Conference 2018 presentation of the command here. Besides that, I suggest to look how others have estimated similar models, whether they use an ARDL or a VECM.
Comment

Announcement