Specify interaction/square terms with xtabond2 & xtdpdgmm

Huaxin Wanglu started a topic Specify interaction/square terms with xtabond2 & xtdpdgmm

05 Mar 2021, 17:51

Specify interaction/square terms with xtabond2 & xtdpdgmm

Dear all,

First of all, I would like to confirm that I have searched and read many posts here but no extant solution could be found.

I am now working with xtabond2 to conduct two-step sys-GMM estimation. I have read Roodman (2009) and Prof. Sebastian Kripfganz's presentation slides. But my case is a bit uncommon, so I still cannot figure out all the issues by exploring these materials.

To clarify, I do not have a lagged dependent variable in the right-side equation. The reason I run GMM estimation is because for the purpose of robustness check, I have to address endogeneity while I cannot find proper external instrument variables.

My observations in total are more than 600,000 with a time span of 22 years. My core predictor is a macro-level variable (i.e. yearly difference △Xt, △Xt-1, △Xt-2, etc.) and the dependent variable is a micro-level variable (i.e. individual choice). In my OLS & fixed-effect model, I find a U-shaped relationship (convexity), so I want to add the square term of my core predictor to the GMM estimation. But by specifying it as GMM-style instruments, the Hansen test is always significant (fairly below 0.25, just around 0.01 most of time). I tried all the positions it could be placed in, and have found that by treating it as exogenous and putting it in the IV-style instrument, I obtain statistically significant results and a decent Hansen test p-value (>0.40).

1. My first confusion is, I treat the core predictor as endogenous, and put it in the GMM-style instrument with its second- and higher-orders (lag2-lag21). In this way, can I treat its square term as exogenous?

2. Arellano-Bond test rejects the null until AR(6), is it still okay for me to include lags of 1-5 as instruments? Since I don't have lagged dependent variable in the model, so I am unsure whether Arellano-Bond test still applies to my case.

3. From Prof. Sebastian Kripfganz's slides, I learn that dummy variables are usually treated as exogenous and put in the IV-style instrument with the level option, but how about the interaction term between endogenous / predetermined variables and dummies? If Hansen test and Difference-in-Hansen tests are all satisfied (fairly >0.25), is it justifiable to treat the interaction terms as exogenous?

Lastly, I have run my specification with xtdpdgmm command before, but due to the number of my observations is quite large, I cannot obtain the result even after waiting for more than 30 minutes. Is there any way that I can speed up running xtdpdgmm?

Hereby, I leave my codes:

Code:

xtabond2 migrate i.a2003 co_age dy_schooling marriage hukou_type a2025b InIncome ///
c.L.gap_jobdiff3ex##c.L.gap_jobdiff3ex gap_ppden gap_unemploy gap_enterprise gap_med gap_highedu i.yr2-yr22 , ///
gmmstyle(gap_jobdiff3ex, lag(2 .) orthogonal collapse) ///
gmmstyle(gap_ppden gap_enterprise gap_unemploy , lag(1 .) collapse) ///
ivstyle(gap_highedu gap_med) ///
ivstyle(c.L.gap_jobdiff3ex#c.L.gap_jobdiff3ex i.a2003 co_age dy_schooling marriage hukou_type a2025b InIncome i.yr2-yr22 , eq(level)) ///
small twostep artests(6) cluster(dest_code)

Note: i.a2003 co_age dy_schooling marriage hukou_type a2025b InIncome are time-invariant variables. I confirm that I realize that to include them, a stronger assumption is imposed on the estimation.

Here is the test results:

Code:

------------------------------------------------------------------------------
Group variable: numeric_un~e                    Number of obs      =    670476
Time variable : time                            Number of groups   =     57429
Number of instruments = 94                      Obs per group: min =         1
F(30, 272)    =    109.21                                      avg =     11.67
Prob > F      =     0.000                                      max =        17
------------------------------------------------------------------------------

------------------------------------------------------------------------------
Arellano-Bond test for AR(1) in first differences: z =  -7.40  Pr > z =  0.000
Arellano-Bond test for AR(2) in first differences: z =  -3.58  Pr > z =  0.000
Arellano-Bond test for AR(3) in first differences: z =  -7.87  Pr > z =  0.000
Arellano-Bond test for AR(4) in first differences: z =  -3.47  Pr > z =  0.001
Arellano-Bond test for AR(5) in first differences: z =  -3.06  Pr > z =  0.002
Arellano-Bond test for AR(6) in first differences: z =  -0.95  Pr > z =  0.342
------------------------------------------------------------------------------
Sargan test of overid. restrictions: chi2(63)   =89629.95 Prob > chi2 =  0.000
  (Not robust, but not weakened by many instruments.)
Hansen test of overid. restrictions: chi2(63)   =  62.20  Prob > chi2 =  0.505
  (Robust, but weakened by many instruments.)

Difference-in-Hansen tests of exogeneity of instrument subsets:
  GMM instruments for levels
    Hansen test excluding group:     chi2(59)   =  59.02  Prob > chi2 =  0.475
    Difference (null H = exogenous): chi2(4)    =   3.18  Prob > chi2 =  0.528
  gmm(gap_jobdiff3ex, collapse orthogonal lag(2 .))
    Hansen test excluding group:     chi2(49)   =  52.77  Prob > chi2 =  0.331
    Difference (null H = exogenous): chi2(14)   =   9.43  Prob > chi2 =  0.802
  gmm(gap_ppden gap_enterprise gap_unemploy, collapse lag(1 .))
    Hansen test excluding group:     chi2(10)   =  12.31  Prob > chi2 =  0.265
    Difference (null H = exogenous): chi2(53)   =  49.89  Prob > chi2 =  0.596
  iv(gap_highedu gap_med)
    Hansen test excluding group:     chi2(61)   =  60.85  Prob > chi2 =  0.481
    Difference (null H = exogenous): chi2(2)    =   1.35  Prob > chi2 =  0.509
  iv(cL.gap_jobdiff3ex#cL.gap_jobdiff3ex 0b.a2003 1.a2003 co_age dy_schooling marriage hukou_type a2025b InIncome 0b.yr2 1.yr2 0b.yr3 1.yr3 0b.yr4 1.yr4 0b.yr5 1.yr5 0b.yr6 1.yr6 0b.yr7 1.yr7 0b.yr8 1.yr8 0b.yr9 1.yr9 0b.yr10 1.yr10 0b.yr 11 1.yr11 0b.yr12 1.yr12 0b.yr13 1.yr13 0b.yr14 1.yr14 0b.yr15 1.yr15 0b.yr16 1.yr16 0b.yr17 1.yr17 0b.yr18 1.yr18 0b.yr19 1.yr19 0b.yr20 1.yr20 0b.yr21 1.yr21 0b.yr22 1.yr22, eq(level))
    Hansen test excluding group:     chi2(39)   =  39.88  Prob > chi2 =  0.431
    Difference (null H = exogenous): chi2(24)   =  22.32  Prob > chi2 =  0.560

Thanks for any comments!

Last edited by Huaxin Wanglu; 05 Mar 2021, 18:51.

Tags: None

Huaxin Wanglu replied

20 Jan 2022, 16:25
Originally posted by Sebastian Kripfganz View Post

I am afraid I do not have a good explanation for your observations. As I am not familiar with the specific peculiarities of your data and research question, there is not much I can say. You may want to put more weight on the economic model behind the regression if your econometric model specification search is inconclusive. Keep in mind that all models are just approximations and that specification tests are not perfectly accurate either.

Happy New Year to you, too.

Thanks so much for the reply, Sir!
Leave a comment:
Sebastian Kripfganz replied

04 Jan 2022, 03:21
I am afraid I do not have a good explanation for your observations. As I am not familiar with the specific peculiarities of your data and research question, there is not much I can say. You may want to put more weight on the economic model behind the regression if your econometric model specification search is inconclusive. Keep in mind that all models are just approximations and that specification tests are not perfectly accurate either.

Happy New Year to you, too.
Leave a comment:
Huaxin Wanglu replied

03 Jan 2022, 10:23
Originally posted by Sebastian Kripfganz View Post

I do not know the migration literature very well. Is your dependent variable a binary dummy variable? If an individual decides to migrate in period t-1, is that individual still observed in the data set in period t? If so, doesn't the decision to migrate in t-1 affect the migration state in t?

I understand however the concern of the other researcher. A model with a lagged dependent variable may not be the natural choice for your application, depending on how exactly your dependent variable is constructed. It might be more promising to include lags of the independent variables instead of a lagged dependent variable, i.e. to use a distributed lag model. The migration decision at t might well depend on exogenous factors in the previous periods. Adding lags of the independent variables can also help to alleviate autocorrelation concerns. Moreover, without a lagged dependent variable, autocorrelation may not be a concern for the consistency of the estimator anymore if the independent variables are strictly exogenous. It might be sufficient to just use the usual panel-robust standard errors for correct inference.

Dear Dr. Kripfganz,
Hello, sorry for asking you again and again. I am checking my last specifications for report. The creation of my core predictor involves 3 time periods, t, t-1, t-2, and when I do not lag other control variables, AR(2) and AR(4) are well above 0.1 but AR(3) is well below. So, I lag the control variables (they are state variables measured at the end of each year, so this choice is justifiable), then AR(2)--AR(4) are all well above 0.1. But I do not very understand what happen to AR(3) and why I lag the control variables, AR(3) turns out to be no serial correlation?

I also tried using distributed lag model as you suggested, AR test results are good, but Hansen J test result is often around 0.15 and the control variables both lagged- and non-lagged terms are statistically insignificant. And by using distributed lag model, the GMM specification is not consistent with my FE models.

BTW, wish you and your family a Happy New Year!
Leave a comment:
Huaxin Wanglu replied

02 Dec 2021, 12:49
One specification passes at AR(4), others at AR(2). For the one cannot pass at AR(2), I revised the IV to be lagged by one period, and the results now are very good! Shall I also include the non-lagged term? Both L. OR L(0/1) returns good results!

Last edited by Huaxin Wanglu; 02 Dec 2021, 13:30.
Leave a comment:
Huaxin Wanglu replied

02 Dec 2021, 11:36
Originally posted by Sebastian Kripfganz View Post

I do not know the migration literature very well. Is your dependent variable a binary dummy variable? If an individual decides to migrate in period t-1, is that individual still observed in the data set in period t? If so, doesn't the decision to migrate in t-1 affect the migration state in t?

I understand however the concern of the other researcher. A model with a lagged dependent variable may not be the natural choice for your application, depending on how exactly your dependent variable is constructed. It might be more promising to include lags of the independent variables instead of a lagged dependent variable, i.e. to use a distributed lag model. The migration decision at t might well depend on exogenous factors in the previous periods. Adding lags of the independent variables can also help to alleviate autocorrelation concerns. Moreover, without a lagged dependent variable, autocorrelation may not be a concern for the consistency of the estimator anymore if the independent variables are strictly exogenous. It might be sufficient to just use the usual panel-robust standard errors for correct inference.

Highly appreciated for answering me!

Yes, exactly as you understand, my DV is a binary dummy variable indicating migrated or not. If a migrant moved at time t-1, he/she would be still in my sample and coded as 1 at time t. I also personally think that it is reasonable to justify that migration at t-1 affects the status at t, but since I couldn't know if they return afterwards, so probably this design is problematic. After I revised my DV to be 1 at the year of migration and 0 before and afterwards, the DV is no longer a persistent variable.

"It might be more promising to include lags of the independent variables instead of a lagged dependent variable, i.e. to use a distributed lag model." This sounds promising. I need to explore the literature. Thank you so much for the suggestion! I was thinking that I needed to remove all the GMM parts and was quite upset since I spent lots of time in learning and applying it over the past few months.

Do tests usually should be reported have any changes? I have just tested with my revised DV and without the lagged term, the results are quite good! I pass AR and Hansen's!

Last edited by Huaxin Wanglu; 02 Dec 2021, 12:30.
Leave a comment:
Sebastian Kripfganz replied

02 Dec 2021, 02:18
I do not know the migration literature very well. Is your dependent variable a binary dummy variable? If an individual decides to migrate in period t-1, is that individual still observed in the data set in period t? If so, doesn't the decision to migrate in t-1 affect the migration state in t?

I understand however the concern of the other researcher. A model with a lagged dependent variable may not be the natural choice for your application, depending on how exactly your dependent variable is constructed. It might be more promising to include lags of the independent variables instead of a lagged dependent variable, i.e. to use a distributed lag model. The migration decision at t might well depend on exogenous factors in the previous periods. Adding lags of the independent variables can also help to alleviate autocorrelation concerns. Moreover, without a lagged dependent variable, autocorrelation may not be a concern for the consistency of the estimator anymore if the independent variables are strictly exogenous. It might be sufficient to just use the usual panel-robust standard errors for correct inference.
Leave a comment:
Huaxin Wanglu replied

01 Dec 2021, 08:09
Originally posted by Sebastian Kripfganz View Post

Those results are not directly comparable. When you make the model dynamic, you change the meaning of the coefficients. When the dependent variable is quite persistent, it is not uncommon that you have significant coefficients in the static model but insignificant ones in the dynamic model. The other way round is more surprising but I do not see a reason why this should not happen. If there is an omitted variable, much depends on how this variable is related to the included variables. For example, if you add a lag of the variable with questionable coefficient as another regressor to the dynamic model, you might observe that both coefficients are statistically significant with opposite signs and that the effects cancel out when you add them up.

Hello, Dr. Kripfganz. I encountered a new problem that a researcher suggested to me that adding lagged dependent variable to the model changes its meaning and it is problematic because my DV is an individual-level variable (migration decision). So, migration decision at time t shouldn't depend on its status at time t-1. But as I initially proposed in this post, excluding the lagged term, I cannot pass AR(2) test. I am wondering about the implication on adding a lagged term in GMM?
Leave a comment:
Huaxin Wanglu replied

08 Nov 2021, 11:48
Originally posted by Sebastian Kripfganz View Post

Those results are not directly comparable. When you make the model dynamic, you change the meaning of the coefficients. When the dependent variable is quite persistent, it is not uncommon that you have significant coefficients in the static model but insignificant ones in the dynamic model. The other way round is more surprising but I do not see a reason why this should not happen. If there is an omitted variable, much depends on how this variable is related to the included variables. For example, if you add a lag of the variable with questionable coefficient as another regressor to the dynamic model, you might observe that both coefficients are statistically significant with opposite signs and that the effects cancel out when you add them up.

Thanks a lot for the reply. I tried to add a lagged dependent to my OLS model and found that predictors turn to be statistically significant. I guess the problem arises because my predictor is a three-year volatility and my dependent variable is persistent.
Leave a comment:
Sebastian Kripfganz replied

31 Oct 2021, 06:19
Those results are not directly comparable. When you make the model dynamic, you change the meaning of the coefficients. When the dependent variable is quite persistent, it is not uncommon that you have significant coefficients in the static model but insignificant ones in the dynamic model. The other way round is more surprising but I do not see a reason why this should not happen. If there is an omitted variable, much depends on how this variable is related to the included variables. For example, if you add a lag of the variable with questionable coefficient as another regressor to the dynamic model, you might observe that both coefficients are statistically significant with opposite signs and that the effects cancel out when you add them up.
1 like
Leave a comment:
Huaxin Wanglu replied

30 Oct 2021, 07:36
Originally posted by Sebastian Kripfganz View Post

I would not worry too much about this problem. in the output you have shown, the AR(3) and AR(4) p-values still appear to be acceptable. Also, the test might generally become less reliable for higher orders of autocorrelation.

Hello, Dr. Kripfganz. I am recently revising my models and I have newly found a weird situation--the explanatory variable of interest is significant in GMM (consistently significant with different lag lengths) but insignificant in static models. To be honest, in this case, I am quite uncertain if the results are indeed significant or not. And conversely, another variable is significant in static models, but insignificant in GMM. I don't understand why the discrepancy arises and which result is more reliable? Is there any way to justify it?
Leave a comment:
Huaxin Wanglu replied

23 May 2021, 12:01
Originally posted by Sebastian Kripfganz View Post

I would not worry too much about this problem. in the output you have shown, the AR(3) and AR(4) p-values still appear to be acceptable. Also, the test might generally become less reliable for higher orders of autocorrelation.

Ah, thanks a lot! I can be at ease to use such lag strategies.
Leave a comment:
Sebastian Kripfganz replied

22 May 2021, 04:07
I would not worry too much about this problem. in the output you have shown, the AR(3) and AR(4) p-values still appear to be acceptable. Also, the test might generally become less reliable for higher orders of autocorrelation.
Leave a comment:

Huaxin Wanglu replied

21 May 2021, 15:08

Originally posted by Sebastian Kripfganz View Post

That seems to be a matter about efficiency in the (implicit) first-stage regressions of the regressors on the instruments. These level instruments might be informative for some variables but less informative for others. Adding further informative instruments helps to improve the first-stage fit, while adding further uninformative (weak) instruments worsens the first-stage fit. Adding more (instrumental) variables is not always better, even in large samples.

Hello, dear Dr. Kripfganz, I am trying to elaborate my specification and want to include a few interaction terms between my core predictor and time-invariant dummies. When I tested an interaction effect between my continuous variable of interest and gender dummy, I found although AR(2) is well above 0.1, but either AR(3) or AR(4) becomes significantly lower and even just around 0.05. I tried various lag strategies but none of them solves this problem, I am completely confused about why it happens. May I have your advice?

Code:

xtabond2 migrate L.migrate a2003 c.co_age##c.co_age dy_schooling marriage hukou_type a2025b InIncome  ///
c.gap_jobdiff3ex##c.gap_jobdiff3ex##i.a2003  gap_Inwage gap_ppden gap_unemploy gap_enterprise gap_med gap_highedu i.yr2-yr22, ///
gmmstyle(migrate, lag(1 1) eq(level) ) /// lagged dependent
gmmstyle(migrate, lag(2 5) eq(diff) collapse) ///
gmmstyle(c.gap_jobdiff3ex##c.gap_jobdiff3ex, lag(1 1) eq(level)) /// endogenous
gmmstyle(c.gap_jobdiff3ex##c.gap_jobdiff3ex, lag(2 5) eq(diff) collapse) ///
gmmstyle(a2003#c.gap_jobdiff3ex a2003#c.gap_jobdiff3ex#c.gap_jobdiff3ex, lag(1 1) eq(level)) /// endogenous
gmmstyle(a2003#c.gap_jobdiff3ex a2003#c.gap_jobdiff3ex#c.gap_jobdiff3ex, lag(2 5) eq(diff) collapse) ///
gmmstyle(gap_Inwage gap_ppden gap_enterprise gap_unemploy,lag(0 0) eq(level)) /// predetermined
gmmstyle(gap_Inwage gap_ppden gap_enterprise gap_unemploy,lag(1 4) eq(diff) collapse) ///
ivstyle(gap_highedu gap_med ) /// exogenous
ivstyle(a2003 dy_schooling co_age marriage hukou_type a2025b InIncome yr2-yr22 , eq(level)) /// exogenous
small twostep artests(4) cluster(dest_code)

Here, a2003 is the gender dummy.

Code:

------------------------------------------------------------------------------
Arellano-Bond test for AR(1) in first differences: z = -56.89  Pr > z =  0.000
Arellano-Bond test for AR(2) in first differences: z =  -1.03  Pr > z =  0.301
Arellano-Bond test for AR(3) in first differences: z =   1.67  Pr > z =  0.095
Arellano-Bond test for AR(4) in first differences: z =   1.37  Pr > z =  0.170
------------------------------------------------------------------------------
Sargan test of overid. restrictions: chi2(172)  =1810.05  Prob > chi2 =  0.000
  (Not robust, but not weakened by many instruments.)
Hansen test of overid. restrictions: chi2(172)  = 187.35  Prob > chi2 =  0.200
  (Robust, but weakened by many instruments.)

Announcement

Specify interaction/square terms with xtabond2 & xtdpdgmm

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: