Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata is not dropping a binary variable from a set of binary variables to avoid collinearity

    Hi Statalist,

    I am using IPUMS CPS dataset from 2012q1-2023q2. I have defined an industry variable which is a group of 12 different dummy variables. The primary goal is to look at the technology sector, so tech = 1 if industry == technology and tech = 0 if industry != technology. I am using the following line of code:
    Code:
     reghdfe employed i.qtime i.qtime#tech i.educ_dum i.marst_dum age ageSQ female nchild i.industry [aweight = wtfinl] if Race == 1, absorb(statecensus) cluster(time)
    I intend to chart out employment trends in the technology industry by race (Race: 1 White, 2 Black, 3 Hispanic, 4 Mixed/Other) by quarters for my timeframe. When I run the above code, Stata includes the 2012q1#1.tech instead of dropping it. While it does drop the overall quarter effects for 2012q1, qtime == 2012. I have checked to see if it drops another i.qtime#tech dummy but it does not. For the i.industry dummies, Stata drops the first one by default and also drops the technology dummy. But when I run:
    Code:
     reghdfe employed i.qtime##tech i.educ_dum i.Race i.marst_dum age ageSQ female nchild i.industry [aweight = wtfinl] if Race == 1, absorb(statecensus) cluster(time)
    Stata does drop 2012q1#1.tech and also includes tech in the explanatory variables due to the double ##. Again drops the first industry by default and drops technology from the industry fixed effects.

    I am confused why the first line of regression is not working as I would expect. Mainly I want to know why is Stata not dropping 2012q1#1.tech. My dataset has over 5M observations over this 10.5 years.

    Edit 1: Employed is a dummy variable that indicates whether individual i residing in state s is employed at time t.
    Last edited by Yatharth Garg; 19 Aug 2023, 10:33.

  • #2
    What Stata is doing is one correct way of handling your model. You are asking for trouble by setting up a model with colinearity baked-in to start with. You have engineered the model to include colinearity among tech and industry. When you do that, Stata will break that colinearity in any way it chooses, and you do not have control over that. It can do it by eliminating any of the variables involved in the colinearity. The results are, for modeling purposes, equivalent regardless of the particular variables Stata uses to break the colinearity. You will get the same values from -predict-. All model-level statistics will be the same. And all coefficients of variables not implicated in the colinearity will also be the same.

    The coefficients of the variables that are implicated in the colinearity depend on which variables are omitted (or otherwise linearly constrained) to break the colinearity. Consequently, they are not valid estimates of any effects: such effects are inherently unidentifiable in a model that includes colinear variables anyway. If estimating the effects of one of these variables is necessary for your research goals, then you have to redesign your model so that it does not include other variables with which it is colinear. In particular, you cannot estimate the effect of technology in any model that also includes a complete series of industry indicators ("dummies"). You have two choices, really: get rid of the industry indicators, or, if they are really necessary for other reasons, get rid of the technology variable and just use the coefficient of the industry indicator that corresponds to technology for your tests and technology-effect estimation.

    Comment


    • #3
      Thanks for your answer Prof. Schechter, I understand the collinearity aspect for the industry variable which I am not as worried about. My main concern is when I run the first model, the interaction term between technology dummy and 2012Q1 have been included in my results. Normally, Stata should drop this variable by default as it is the first interaction term in a series of interactions that will otherwise be perfectly collinear. I understand that the technology dummy would (and should) be dropped regardless. My research is mainly focused on the interaction terms which will allow me to map out employment effects on technology industry workers by year-quarter.

      Again, when I run the second model, Stata drops the interaction term between 2012Q1 and the technology dummy. So I am concerned why does it not happen when I run the first model.

      Comment


      • #4
        Again, when I run the second model, Stata drops the interaction term between 2012Q1 and the technology dummy. So I am concerned why does it not happen when I run the first model.
        The two models are actually the same model. This is because in the ## version, the ## interaction gets expanded into the two "main" effects and the # interaction. But the first model contains all of those terms other than the "tech" main effect--which you omitted. But because the tech main effect is colinear with i.industry, Stata omitted it from the ## model anyway. Stata is just choosing a different subset of the colinear variables to get to a model with identifiable coefficients..

        The most important thing is the both of these versions of the model fail to identify a tech effect in 2021Q1. Where 1.tech#2021Q1 is omitted, it is being arbitrarily constrained to a zero effect. Where 1.tech#2021Q1 is shown, that coefficient is just an artifact of which other variables were omitted in order to resolve the colinearity--it is not the effect of anything.

        This is linear algebra. You cannot include i.tech and a complete set of industry indicators in the same model and reach any valid conclusions about tech. So you need to step away from the keyboard and figure out what variables you actually need in order to answer your research questions.

        Comment

        Working...
        X