xtabond2: System GMM robust estimation - Do I use Hansen or Sargan test results??

Ashish Sharma

Join Date: Nov 2020

Posts: 9
#16

20 Mar 2022, 12:06

Dear Sebastian Kripfganz and other wonderful Statalisters,

First, thanks for your invaluable contribution to the community by offering solutions to the challenges faced by users like me! This forum is my go-to place for everything Stata.

That said, I am encountering some challenges modelling my dataset, using system GMM estimation technique. My dataset measures Sale of roughly 10k employees over 60 months (unbalanced panel). I'm using the following syntax:

xi: xtabond2 Sale L.Sale L.Dperform Dtarget MGrwth Incentv1 Incentv2 c.L.Sale#c.Incentv1 c.L.Sale#i.Incentv2 zone* yr*, gmm(L.Sale L.Dperform Dtarget MGrwth c.L.Sale#c.L.Dperform c.L.Sale#c.Incentv1 c.L.Sale#i.Incentv2, collapse) iv(zone* yr*, eq(level)) or twostep rob small

Where, Sale is the dependent variable and L.Sale is included as a regressor. L.Dperform is lag of division's performance (% target achieved), Dtarget is the target for employee's division for the period, and MGrwth is market growth of employee's division (month-over-month). Incentv1 (continuous) and Incentv2 (dummy) are employee incentives, and zone* and yr* are dummies for zone and year, respectively. L.Sale is endogenous for obvious reasons. Hence, I'm treating it's interactions with incentives endogenous as well. Further, L.Dperform DTarget and MGrwth are also endogenous or predetermined. I use robust SE as my employees are nested within divisions. However, when I estimate this model, I encounter the following challenges:

1. First, I get a warning message : "Warning: Two-step estimated covariance matrix of moments is singular. Using a generalized inverse to calculate optimal weighting matrix for two-step estimation." – is it something to be worried about?

2. Next, while I use collapse option to restrict the number of instruments, the model still uses a large number of instruments (over 400). Although, less than the number of groups (over 5000) in my dataset. – My assumption is as far as the number of instruments is lower than the number of groups, it is acceptable. However, is 400+ still too large?

3. My understanding is that I should include all variables that are either endogenous or predetermined in the gmmstyle option (and therefore all interaction with these variables). Isn't that correct? If yes, do I include these endogenous/predetermined variables in the gmm option just as specified in the regressors (i.e. the syntax creates the lagged difference by itself)? Or, should I include only the lags of these variables, i.e., if DTarget is included as a regressor, should I include Dtarget or L.Dtarget in the gmm option? Also, do I include all exogenous regressors in the ivstyle option or only time-specific dummies?

4. The AR(2) test for my model is not significant, suggesting no second-order serial correlation in the first-differenced model. However, both the Sargan test and the Hansen test are highly significant, suggesting that my instruments are not good?

------------------------------------------------------------------------------
Arellano-Bond test for AR(1) in first differences: z = -8.59 Pr > z = 0.000
Arellano-Bond test for AR(2) in first differences: z = 1.27 Pr > z = 0.204
------------------------------------------------------------------------------
Sargan test of overid. restrictions: chi2(422) =8485.23 Prob > chi2 = 0.000
(Not robust, but not weakened by many instruments.)
Hansen test of overid. restrictions: chi2(422) =1377.92 Prob > chi2 = 0.000
(Robust, but weakened by many instruments.)

If Sargan test is only appropriate after a difference-GMM estimator, should I base my decision on Hansen test result? What does a significant Hansen test stat point toward in my model and what are some possible ways to resolve this?

Here's a snapshot of my results and I sincerely appreciate your time and contributions to help me resolve these issues.

Best,
Ash

Last edited by Ashish Sharma; 20 Mar 2022, 12:16.
Comment
Sebastian Kripfganz

Join Date: May 2014

Posts: 2593
#17

21 Mar 2022, 06:02

You could possibly ignore this warning message. It generally indicates that your number of instruments might be too large. Even though your data set has a very large number of groups, 441 instruments seems to be too much.

See point 1. The number of instruments should be much smaller than the number of groups. There is no specific threshold, but with up to 58 time periods I would normally restrict the lag length for the instruments. For example, consider the 50-th lag of a variable. Such a high lag order inevitably leads to some weak instruments because this 50-th lag will certainly be correlated only poorly with the current observation. Again, there is no clear guideance where you should cut a line.

By default, the variables you put in the gmm() option are assumed to be predetermined. This is normally be a valid assumption for the lagged dependent variable, but will not be sufficient for other variables which you consider to be endogenous. You may have to put endogenous variables into a separate gmm() option with suboption lag(2 .), where you can replace the . with an appropriate upper limit for the lag order in line with point 2. Variables which you put into the iv() option for the level model are implicitly assumed to be uncorrelated with the unobserved group-specific "fixed effects". This is effectively a random-effects assumption, which is stronger than the usual exogeneity assumption you may have in mind. You might want to put them into a gmm() option with lag(0 .) instead.

The effective misclassification of some endogenous variables as predetermined could be a reason for the rejection of the Hansen test. However, notice that with such a large number of observations even a tiny misspecification (e.g. some omitted regressors) could be picked up by the test. It generally can be very hard to obtain a high p-value for the Hansen test with such a large data set.

More on dynamic panel data GMM estimation in Stata:
Kripfganz, S. (2019). Generalized method of moments estimation of linear dynamic panel data models. Proceedings of the 2019 London Stata Conference.

https://www.kripfganz.de/stata/
Comment
Ashish Sharma

Join Date: Nov 2020

Posts: 9
#18

21 Mar 2022, 11:32

Dear Sebastian,

Many thanks for your quick and insightful response!

I've revised the model to reflect your inputs and restrict the number of instruments.

xi: xtabond2 Ln_Sale l.Ln_Sale Dtarget l.Dperform Mgrwth i.Incentv1 Incentv2 c.l.Ln_Sale#c.l.Dperform c.l.Ln_Sale#i.Incentv1 c.l.Ln_Sale#c.Incentv2 zone1-zone4 yr2-yr7, ///
gmm(l.Ln_Sale, eq(diff) lag(2 5) collapse) ///
gmm(Dtarget Mgrwth, lag(0 1) collapse) ///
gmm(c.l.Ln_Sale#c.l.Dperform c.l.Ln_Sale#i.Incentv1 c.l.Ln_Sale#c.Incentv2, lag(2 5) collapse) ///
iv(zone1-zone4 yr2-yr7, eq(level)) or twostep rob small nodiffsargan

However, I have a few questions:

1. Do you recommend using a separate gmm() option to include interactions of the lagged DV? Should they (interactions) then be declared in difference equation only or both difference and level equations?

2. Shouldn't I declare year and zone effects using iv() option in the level equation only?

Here's a snapshot of the output using above syntax. While number of instruments is much smaller, the AR(2) test and Hansen stat are highly significant.
Comment
Sebastian Kripfganz

Join Date: May 2014

Posts: 2593
#19

23 Mar 2022, 04:12

1. You can both do not have to specify separate gmm() options if you treat the variables the same way, i.e. predetermined. The interaction terms can be added as instruments to the level model in the same fashion as you would do for the lagged dependent variable.
2. It is normally absolutely sufficient to specify those dummies simply as instruments for the level model.

I could not see any instruments for i.Incentv1 Incentv2. You may want to add them.

https://www.kripfganz.de/stata/
Comment
Ashish Sharma

Join Date: Nov 2020

Posts: 9
#20

29 Mar 2022, 07:38

Thanks, Sebastian. I appreciate your help!
Comment
Al litu

Join Date: Apr 2024

Posts: 4
#21

01 Apr 2024, 23:17

Hi Sebastian
i have to report all result in my thesis.
I was concern about sargan test but the problem is solved from your clarification. It was great.

I want to know that My two step system gmm estimation is showing that AR 1 is significant means p value is around. 000 but Roodman is saying
it should be More than. 25 and others are saying between 5 % to 10 %.
Also,all are reporting only AR2 (in different article). So i followed them, now if i have to report AR1,HOW CAN I EXPLAIN?
AR1 IS SIGNIFICANT AND P VALUES IS AROUND. 000.THIS MAKEs THE RESULT INVALID?
NO WHAT MAY I DO?

Another thing is if the constant term of two step system gmm is insignificant then is there any problem ?May i report it as valid result?

Last edited by Al litu; 01 Apr 2024, 23:20.
Comment
Sebastian Kripfganz

Join Date: May 2014

Posts: 2593
#22

05 Apr 2024, 09:26

If there is no serial correlation in the level errors, the first-differenced errors would have first-order serial correlation but no higher-order serial correlation. Therefore, you expect to reject the AR(1) test but not the AR(2) test; a very small p-value for the AR(1) test is indeed what you want.

If I remember correctly, Roodman's statement is about the overidentification test, not the serial-correlation test.

Whether the constant term is statistically significant or not does not indicate any statistical problem.

https://www.kripfganz.de/stata/
1 like
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment