Stata does not drop observations despite perfect collinearity

Tom Englaro

Join Date: Nov 2016

Posts: 13
#1

Stata does not drop observations despite perfect collinearity

02 Dec 2016, 04:08

Hello everyone,

I work with a panel dataset comprising 46 countries over a time period of 55 years (1960 to 2014). I estimate this model with country-specific fixed effects and country-specific time trends.

I first created the fixed effects and time-trends manually (only included 45 isntead of 46 for both, I hope that was correct?) and estimated the regression with the -regress- command. As a check, I re-estimated the regression using the -xtreg- command with the fe option obviously only including the country-specific time trends this time. iccountry* are the fixed effects and icyear* are the country-specific time trends. Here's the code and the output (I only display the first five fixed effects and time trends as the entire thing would have have been too large).

Code:

. quietly regress confrontation_incidence temp rainfalllog iccountry* icyear*,vce(cluster country1) . est store eq1 . quietly xtreg confrontation_incidence temp rainfalllog icyear*, fe vce(cluster country1) . est store eq2 . esttab eq1 eq2,b(4) -------------------------------------------- (1) (2) confrontat~e confrontat~e -------------------------------------------- temp -0.0122 -0.0154 (-0.58) (-0.75) rainfalllog -0.0004 -0.0232 (-0.01) (-0.88) iccountry1 -0.5452 (-1.40) iccountry2 -0.6038 (-1.27) iccountry3 1.2843* (2.69) iccountry4 -0.4390 (-1.98) iccountry5 -0.1841 (-0.67) icyear1 0.0005 0.0007 (0.53) (0.78) icyear2 0.0004 0.0006 (0.57) (0.76) icyear3 -0.0333*** -0.0331*** (-31.38) (-31.92) icyear4 0.0005 0.0007 (0.52) (0.78) icyear5 -0.0034*** -0.0033*** (-3.83) (-3.80) _cons 0.6151 0.6968 (0.97) (1.34) -------------------------------------------- N 2158 2158

I now have three questions related to this:

First, the confrontation_incidence variable (binary dependent variable) does not change over time for a number of countries. If I understood the fixed-effect estimation procedure correctly, there should hence be perfect collinearity (perfect prediction) between the time-invariant dependent variable and the fixed-effect for these countries. Consequently, these observations should be omitted in my opinion. However, Stata does not do that and the number of observations is 2158 which is the total number of my observations. Why is that?

Second, why do the coefficients on my regressors (temperature, rainfall, icyear*) differ for the -regress- and the -xtreg- commands even though the only difference between them is the fact that I computed the fixed effects manually for the -regress- command.

Third, why isn't the constant omitted as I am using fixed effects?

Thank you for your consideration.

Last edited by Tom Englaro; 02 Dec 2016, 04:29.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35708
#2

02 Dec 2016, 04:15

I address just one of your points.

Stata attacks collinearity by sometimes omitting variables from a model. They remain part of the dataset. (The word "drop" is best avoided, because its primary meaning is in terms of the drop command.)

What makes you think that Stata should omit observations? Where have you seen that documented?

(I know very little about this kind of model, so I ask in ignorance.)

Last edited by Nick Cox; 02 Dec 2016, 04:22.
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#3

02 Dec 2016, 04:24

I can answer the third question - the constant displayed is the average fixed effect, it is not actually in the regression equation. It is hard to say why you get different results without having access to the data. However, I don't understand why you only create 45 of 46 fixed effects and time trends? A more natural approach is to create all of them (or even better, use Stata's factor variables) and estimate the standard OLS without a constant.

As for the first question, the dependent variable might be time invariant, but the dependent variables. So unless I'm mistaken, they still inform the coefficient estimates. Additionally, I'm fairly sure Stata doesn't omit observations even if they are uninformative and I also don't see why it should.
Comment
Tom Englaro

Join Date: Nov 2016

Posts: 13
#4

02 Dec 2016, 04:45

Nick Cox, as per your recommendation I replaced "drop" by "omit" in my initial post.

As to your question, to be honest I have not seen it documented anywhere. I assumed Stata would operate in the same way than it does when there are missing values for certain variables but I now see that my assumption was wrong.

I guess I have to rephrase my question then. Stata does certainly omit fixed effects when there is perfect collinearity between them and other regressors. Hence, shouldn't it proceed in the same way if there is perfect collinearity between fixed effects and the dependent variable?

Last edited by Tom Englaro; 02 Dec 2016, 04:55.
Comment
Tom Englaro

Join Date: Nov 2016

Posts: 13
#5

02 Dec 2016, 04:51

Jesse Wursten I was writing my answer to Nick's post when you replied so I saw it too late, sorry about that!

I only included 45 as I thought that one should always include one less dummy variable than there are groups. (Same with female and male where you also just include one).

As to the omission of observations, I got that one wrong but the question remains why Stata does not omit the fixed effects in said case.

Last edited by Tom Englaro; 02 Dec 2016, 04:55.
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#6

02 Dec 2016, 06:06

I have to admit I don't fully understand what your question is. Why would the dependent variable be perfectly collinear with the dependent variable? They might be linearly dependent if you only look at a subset of the vector (e.g.the time invariant countries), but the full vector is not perfectly collinear.
Comment
Tom Englaro

Join Date: Nov 2016

Posts: 13
#7

02 Dec 2016, 06:39

Jesse Wursten Of course, you're right. Thank you for that!
1 like
Comment

Tom Englaro

Join Date: Nov 2016
Posts: 13

02 Dec 2016, 07:22

Jesse Wursten One last question as to the number of dummies used. I now included all of them as you said and decided to leave out the constant. However, according to this thread from the archives, if they don't always sum to 1 (due to an unbalanced set for example) Stata will not necessarily omit one: http://www.stata.com/statalist/archi.../msg00206.html

Now I am wondering if it is even necessary to use the -nocons- option in my case as my panel is also highly unbalanced. I am asking because doing so drastically changes the significance of one of my variables (See the t-stat on my temp value below) On top of that, Stata omits the same fixed effects and time trends regardless of whether I use the option or not.

Code:

 
regress confrontation_incidence temp rainfalllog gdp_per_capita_lagged iccountry* icyear*,vce(cluster country1)
note: iccountry7 omitted because of collinearity
note: iccountry36 omitted because of collinearity
note: icyear7 omitted because of collinearity
note: icyear36 omitted because of collinearity

---------------------------------------------------------------------------------------
                      |               Robust
confrontation_incid~e |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------------+----------------------------------------------------------------
                 temp |  -.0046604   .0220425    -0.21   0.834    -.0491132    .0397925
          rainfalllog |  -.0189373   .0256728    -0.74   0.465    -.0707114    .0328368
gdp_per_capita_lagged |  -5.90e-09   1.00e-06    -0.01   0.995    -2.02e-06    2.01e-06



regress confrontation_incidence temp rainfalllog gdp_per_capita_lagged iccountry* icyear*,vce(cluster country1) nocons
note: iccountry7 omitted because of collinearity
note: iccountry36 omitted because of collinearity
note: icyear7 omitted because of collinearity
note: icyear36 omitted because of collinearity

--------------------------------------------------------------------------------------
                      |               Robust
confrontation_incid~e |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------------+----------------------------------------------------------------
                 temp |   .0286975   .0129728     2.21   0.032     .0025353    .0548597
          rainfalllog |   .0221812   .0218853     1.01   0.316    -.0219547    .0663171

Any recommendations as to what's the better solution here?

Comment

Jesse Wursten

Join Date: Jan 2016

Posts: 915
#9

02 Dec 2016, 08:44

Ah, in the case of unbalanced panels my story is no longer true (I think so anyway). Personally, I would avoid defining the trends and dummies yourself and instead use Stata's factor variables.

Code:

use http://www.stata-press.com/data/r14/abdata.dta, clear reg n w k ys id#c.(year) i.id xtreg n w k ys i.id#c.(year), fe reghdfe n w k ys, absorb(id#c.(year) id)

The three regressions above should all give the same result. Can you check if the same is true when you use your own data? The last command (reghdfe) is my favourite, because it doesn't display the fixed effects/trend coefficients and it's crazy fast.
Comment
Tom Englaro

Join Date: Nov 2016

Posts: 13
#10

02 Dec 2016, 09:13

Thank you for the code but Stata told me that the file was not in Stata format. I then went onto the site and downloaded it to my computer. Same problem. (Strange seeing as it's a .dta file)

I have decided to go with reg anyways but don't know if I should include a constant or not. I kind of have a preference for the solution with the constant though.
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#11

02 Dec 2016, 09:21

It's probably because it's a Stata 14 .dta file, it's probably present in your edition too, look for abdata in help q_cross. You should include a constant, but it's still worrying you got different results in reg than in xtreg, which indicates there's something wrong with your variables somewhere.
Comment

Tom Englaro

Join Date: Nov 2016
Posts: 13

#12

02 Dec 2016, 10:32

Well if I use a constant and all 46 dummies, then the differences are minimal as you can see here:

Code:

quietly regress confrontation_incidence temp rainfalllog gdp_per_capita_lagged iccountry* icyear*,vce(cluster country1)
est store eq1
quietly xtreg confrontation_incidence temp rainfalllog gdp_per_capita_lagged icyear*,fe vce(cluster country1)
est store eq2
esttab eq1 eq2,b(4) p r2 starlevels(* 0.10 ** 0.05 *** 0.001) title(Main climate indicators)

Main climate indicators
--------------------------------------------
                      (1)             (2)   
             confrontat~e    confrontat~e   
--------------------------------------------
temp              -0.0047         -0.0045   
                  (0.834)         (0.838)   

rainfalllog       -0.0189         -0.0188   
                  (0.465)         (0.463)   

gdp_per_ca~d      -0.0000         -0.0000   
                  (0.995)         (0.995)

The results got worse when I started to use only 45 dummies as shown here:

Code:

Main climate indicators
--------------------------------------------
                      (1)             (2)   
             confrontat~e    confrontat~e   
--------------------------------------------
temp               0.0026         -0.0015   
                  (0.911)         (0.945)   

rainfalllog        0.0082         -0.0179   
                  (0.813)         (0.482)   

gdp_per_ca~d       0.0000          0.0000   
                  (0.901)         (0.912)

As far as my data go I have specified them based on the .dta file of a much cited paper. It's publicly accessible so here's the link: https://dataverse.harvard.edu/datase...7910/DVN/28109
And then just click on the stata .dta file.

I managed to open abdata just like you said. Their fixed effects variable "id" is specified the same way than my "iccountry*" variable so the problem should not stem from there. As far as their time trends go, I have a problem identifying which would reflect a common time trend and which a country-specific time trend as I would like to have.

Comment

Tom Englaro

Join Date: Nov 2016

Posts: 13
#13

02 Dec 2016, 13:25

Jesse Wursten Unfortunately I have to write you another post as it won't let me edit my previous one. So I have played around a little bit and I have come to the conclusion that the way I specified the fixed effects or the trends is not at fault, and here's why:

First, as seen above, the problem is basically non-existent when I use all 46 dummies no matter how many/few controls I use.
Second, as soon as I drop one control the problem appears but only in those specifications where I use up to two control variables (in addition to my climate variables of interest). As soon as I add more, the coefficients are identical again for all the variables.

If you know why this is the case, I'd appreciate your input. If not, then I'll stick to my initial specification.

Thank you again for your help and enjoy your weekend,
Tom
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#14

05 Dec 2016, 04:16

I think it's alright the way it is. Depending on how big the data is you might want to look into Common Correlated Effects models (Pesaran, 2006; Bai 2009), which makes your estimation more robust to unobserved common correlated effects. These methods are relatively new though, so it's not something that will be expected usually.

Pesaran, M. Hashem. 2006. “Estimation and Inference in Large Heterogeneous Panels with a Multifactor Error Structure.” Econometrica
Bai, Jushan. 2009. “Panel Data Models With Interactive Fixed Effects.” Econometrica
Comment

Announcement