Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Variable omitted for collinearity

    Hello,

    I am running an OLS regression on a very large dataset. To speed up time I am running the regression on a sample of the file of 2.5 million observations out of 59 million observations in total.

    There are many variables circa 1000 in the model and I am using a t-selection iterative process to obtain the final set of variables. When I have obtained a final list variables, I re-run the model on the full dataset of 59 million observations.

    I have a problem in that, I have a series dummy variables that relate to geographic areas, I have omitted one area to serve as the reference group in the regression. However, one area is dropped by Stata due to collinearity. It only drops in the sample dataset.

    I have checked this geographic variable against the other 190 geographic variables: it is 1 only where the others are 0 as expected. All observations are covered by these variables i.e. all observations have a value of 0 or 1. There is no overlap.

    I have tried to identify what is causing the collinearity so that Stata drops the variable at the start of the regression. with so many variables in the model it is difficult to implement an approach to look at all variables at once - and I think VIF will only work on variables that have not been dropped from the model?

    I have noticed there are a couple of variables that are entered into the model that are always 0 where the geographic variable is 1 - would that be enough for Stata to drop the geographic variable?

    is there any method I can implement that will help find why this geographic variable is being dropped? Bearing in mind how large the dataset is and how many explanatory variables I have.

  • #2
    I can think of a few ways...

    Use the dropped variable to build a new regression model

    Try to use that dropped geographic variable from that sample as the new dependent variable, and then use the rest of the 1000 independent variables as the independent variables (The original dependent variable can be put aside; just linear regression is fine, getting the estimate is not the goal here). And then see which got no standard error. That should be the perfect predictors and likely the culprits.

    Hack the VIF

    VIF may still be used. Omission is triggered when the collinearity is perfect. As long as it's slight imperfect, the variable will stay, and you'll get a VIF. Try create a new version of that dropped variable, only to add a little noise to it. Here is an example:

    Code:
    webuse nhanes2, clear
    
    gen sex_identical = sex
    reg age sex sex_identical
    
    gen sex_jitter = sex + rnormal(0, 0.01)
    reg age sex sex_jitter
    estat vif

    Comment


    • #3
      Hi Daniel,
      It could be helpful to review the lasso command in the Stata manual.
      "lasso selects covariates and fits linear, logistic, probit, and Poisson models. Results from lasso can be used for prediction and model selection"
      Benedicte

      Comment

      Working...
      X