IV regression with fixed effects; Warning: variance matrix is non symmetric or highly singular

Konstantina Maragkou

Join Date: Dec 2017

Posts: 12
#1

IV regression with fixed effects; Warning: variance matrix is non symmetric or highly singular

13 Oct 2019, 12:12

Dear Statalist experts

I am running an IV regression with a dummy dependent variable and primary school fixed effects in Stata 16. The regression has the form: ivregress 2sls y (endogenousvar = instrumentalvar) $controls i.primaryschool, first cluster(primaryschool). I get the error Warning: variance matrix is non symmetric or highly singular. I tried to run the regressions with vce(robust), with cluster(primaryschool) and without specifying the standard errors but I always get the same error. Reading an older post in statalist I understood that the cause of the problem might be because of having many primary schools where only one observation attends. When I run the non-IV version of the same regression though I do not have this problem and the regression is run as normal. It is really important for my model to use both the IV and the fixed effects, any suggestion on how I can overcome this will be really appreciated.

Thanks a lot in advance
Konstantina
Tags: None
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2168
#2

13 Oct 2019, 14:19

Konstanina: Is this a cross-sectional data set, with students grouped by school?
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2168
#3

13 Oct 2019, 14:25

If so, I think you'll be better off by "tricking" Stata into thinking it is a panel data set and then you can use xtivregress, which should be more stable because it uses within school deviations rather than putting in lots of dummy variables.
Comment
Konstantina Maragkou

Join Date: Dec 2017

Posts: 12
#4

14 Oct 2019, 10:01

Dear Jeff

Thank you so much for your response, this seems to work but creates another "problem". When estimating the (non-iv) model with dummy variables instead of the fe specification the R-squared is much higher. For my models is important to show the r-squared value and how much it changes across specifications. Is there a way to adjust the estimation in order to retrieve the same R-squared value to the FE specification as the one from the dummy variables specification?

P.S. my dataset is a panel dataset
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2168
#5

14 Oct 2019, 13:08

Konstanina: I can't endorse reporting the R-squared for either the least squares dummy variable regression or its IV extension. Those R-squareds are almost always very large because most of the variation is across individual units. It tells you nothing about how well X explains Y. Why should we get credit for simply putting in cross-sectional dummies?

Nothing important can hinge on those R-squareds. From what I can tell, you're doing some sort of interesting causal analysis of a policy intervention. Don't muddy it up by reporting a misleading R-squared. If you must report an R-squared, use the within R-squared in both cases, although that has its own problems. In particular, the IV R-squared could be bigger than the withing R-squared for regular fixed effects. This cannot happen without fixed effects, so I'm a bit unsure why that happens. Stata must not be computing it correctly.

You might do this "by hand" to see what's happening. Assuming no missing data, you can obtain the time-demeaned variables by hand. In a simple case with one control x, one endogenous variable w, and one IV, z:

Code:

egen ybar = mean(y), by(id) gen ydd = y - ybar egen wbar = mean(w), by(id) gen wdd = w - wbar gen xbar = mean(x) gen xdd = x - xbar egen zbar = mean(z), by(id) gen zdd = z - zbar reg ydd wdd xdd, vce(cluster id) ivregress 2sls ydd (wdd = zdd) xdd, vce(cluster id)

These remove with within-unit time averages, and the R-squareds seem the best to me.
Comment
Konstantina Maragkou

Join Date: Dec 2017

Posts: 12
#6

15 Oct 2019, 03:16

Dear Jeff

I can't thank you enough for the time you dedicate in responding to me. Let me explain my model in more detail so that you can understand why I need the R-squared value. I am estimating a peer effects model for a single cohort of secondary school students. I have the individual outcome on the left hand side of the equation (dummy variable) and I have peers’ average outcome on the right hand side of the equation together with a detailed vector of controls about the individual. The peer group is composed of all the other students in an individual's secondary school. My dataset allows me to observe these individuals since primary school and therefore I am able to construct a pre-determined peer characteristic using the secondary school peers who were in different primary schools from the individual as an instrument for the secondary school peers. This way I can overcome correlated effects (common shocks that individuals and their peers are experiencing at the same time) and the reflection problem (an individual affecting his peers as much as their peers affect them).

Overcoming correlated effects and the reflection problem I am still left with selection in secondary schools which would lead to a biased peer estimate. As I have a single peer observation per school I cannot apply secondary school fixed effects which would be ideal in accounting for selection in secondary schools. So my identification strategy relies on using primary school fixed effects and a vector of control variables for the secondary school. In order to check if this is a valid identification strategy that would give me unbiased peer estimates, to the greatest possible extent, I thought that I could compare two models. The first will have the individual outcome on the left hand side and a vector of individual characteristics and primary and secondary school fixed effects (using dummy variables) on the right hand side of the equation. Then I would estimate a second model were instead of the secondary school fixed effects I would use a vector of secondary school characteristics (including the peer measure). Then the comparison of the R-squared in the two models would tell me how much of the variation in aspirations is explained by the secondary school controls compared to using the secondary school fixed effects.

The value of the R-squared is very important for my identification strategy and this is why I would want to use the R-squared produced when using the dummy variable regression instead of the ,fe one. Your advice with regards to whether comparing the value of the R-squared from the two models is an effective way to understand the variation captured by the secondary school controls compared to the secondary school fixed effects will be appreciated so much as I still have my doubts about the validity of my method.

Thank you so much in advance for all your help.
Comment

Announcement

IV regression with fixed effects; Warning: variance matrix is non symmetric or highly singular

Comment

Comment

Comment

Comment

Comment