Hello,
I am running an OLS regression on a very large dataset. To speed up time I am running the regression on a sample of the file of 2.5 million observations out of 59 million observations in total.
There are many variables circa 1000 in the model and I am using a t-selection iterative process to obtain the final set of variables. When I have obtained a final list variables, I re-run the model on the full dataset of 59 million observations.
I have a problem in that, I have a series dummy variables that relate to geographic areas, I have omitted one area to serve as the reference group in the regression. However, one area is dropped by Stata due to collinearity. It only drops in the sample dataset.
I have checked this geographic variable against the other 190 geographic variables: it is 1 only where the others are 0 as expected. All observations are covered by these variables i.e. all observations have a value of 0 or 1. There is no overlap.
I have tried to identify what is causing the collinearity so that Stata drops the variable at the start of the regression. with so many variables in the model it is difficult to implement an approach to look at all variables at once - and I think VIF will only work on variables that have not been dropped from the model?
I have noticed there are a couple of variables that are entered into the model that are always 0 where the geographic variable is 1 - would that be enough for Stata to drop the geographic variable?
is there any method I can implement that will help find why this geographic variable is being dropped? Bearing in mind how large the dataset is and how many explanatory variables I have.
I am running an OLS regression on a very large dataset. To speed up time I am running the regression on a sample of the file of 2.5 million observations out of 59 million observations in total.
There are many variables circa 1000 in the model and I am using a t-selection iterative process to obtain the final set of variables. When I have obtained a final list variables, I re-run the model on the full dataset of 59 million observations.
I have a problem in that, I have a series dummy variables that relate to geographic areas, I have omitted one area to serve as the reference group in the regression. However, one area is dropped by Stata due to collinearity. It only drops in the sample dataset.
I have checked this geographic variable against the other 190 geographic variables: it is 1 only where the others are 0 as expected. All observations are covered by these variables i.e. all observations have a value of 0 or 1. There is no overlap.
I have tried to identify what is causing the collinearity so that Stata drops the variable at the start of the regression. with so many variables in the model it is difficult to implement an approach to look at all variables at once - and I think VIF will only work on variables that have not been dropped from the model?
I have noticed there are a couple of variables that are entered into the model that are always 0 where the geographic variable is 1 - would that be enough for Stata to drop the geographic variable?
is there any method I can implement that will help find why this geographic variable is being dropped? Bearing in mind how large the dataset is and how many explanatory variables I have.

Comment