Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • reasons for adding control variables

    I have always been thinking about why we add control variables in our models. The reasons I give in my papers are because these variables either relate to the independent or dependent variables, or both. Is that right?

    I have also observed some people in their papers like to add control variables by group. Say they first regress only the dependent variable on the independent variable, and then add controls by group to find out how the controls change the relationship. It makes sense to me when the relationship starts strong and significant, but reduces its magnitude and becomes not statistically significant when controls are added. But sometimes I see people state controls can suppress the relationship; after controls being added, the relationship between the independent and dependent variables becomes significant. I don't understand this.

    Thanks.

  • #2
    Although the term "control variable" is widely used, it is an abuse of language. Unless you are doing an experiment, or using a matched design, you do not and cannot actually "control" for their effects. In observational studies all you can do is include them in the model to adjust for them. They are properly referred to as adjustment variables, or potential (or actual) confounding variables, or simply as covariates.

    If a variable is associated with both the predictor of interest and the outcome, then the predictor coefficient can change considerably when the variable is added to the model. Whether the predictor coefficient grows or shrinks, or even changes sign altogether, is generally hard to predict: pretty much anything is possible. This phenomenon is known in my discipline, epidemiology, as confounding, and in econometrics is usually referred to as omitted variable bias. That last term, too, is an abuse of language because sometimes it is the model that includes the confounding variable that is biased! It depends on what you are trying to estimate. The models with and without the confounding variable(s) actually both provide unbiased estimates, but they are estimating different things. One is a crude effect and the other is the adjusted effect. Which one is of interest depends on the specific research question.

    The change in coefficient that occurs when new variables are added to the model is a version of Simpson's paradox. Wikipedia has a really good page explaining Simpson's paradox and I recommend you read it. The examples given there use discrete variables only, but the principles involved are exactly the same when continuous variables are involved. (With continuous variables, it is sometimes called Lord's paradox.)

    Sometimes a variable is unrelated to the predictor but is associated with the outcome variable. In that case, adding it to the model does not systematically change the regression coefficient of the predictor: it will generally remain about the same (provided missing data does not change the estimation sample). But sometimes adding the variable will absorb some of the "noise " (i.e. residual variance) in the outcome, resulting in more precise (i.e. lower standard error) estimation of the predictor coefficient.

    Adding to the model a variable that is related to the predictor but unrelated to the outcome is usually a bad idea: it doesn't improve anything about the model, but it may result in less precise (i.e. higher standard error) estimation of the coefficient of the predictor as it "steals" variance from the predictor itself. (Added: Do not, however, confuse this with the less common situation where a variable is added to the model for the purpose of examining the joint effect of the original predictor and the variable. This can be one way of strengthening a predictive model, as the pair of predictors may provide better information than either one alone.)

    As for adding predictors in groups, this is, indeed a common practice. Sometimes there are good reasons for doing this, particularly if the different parameters being estimated by the different models are all of interest. Too often, though, I see papers where this is done for no apparent reason and just serves to clutter the paper with additional pointless numbers.

    tl;dr It's pretty complicated.
    Last edited by Clyde Schechter; 02 Apr 2018, 18:30.

    Comment


    • #3
      Hi Clyde,
      Thank you for your reply. I have read about Simpson's Paradox in Wikipedia. I actually have a situation which I am not sure if it is relevant. In my regression results from xtlogit random effects model, there is a positive relationship between A and B. However, when A is interacted with gender, the relationship is not significant for men, but negative for women. I wonder how I am supposed to interpret it.

      Thanks!

      Comment


      • #4
        So, the first thing you need to look at is whether you are doing the estimations on the same sample. When you add a new variable to a model, if that variable has missing values in observations that were included in the original regression, then those values will be excluded from the new analysis. If the models are estimated on different samples, you can't really compare them at all. The simplest way to know if you have the same sample is to look at the N's for both models. If they're the same, then the estimation samples are the same.

        Assuming that you are in fact estimating both models on the same sample, then this would be an instance of Simpson's paradox. In most contexts where the extra variable is sex, we would be more interested in the sex-specific results than in the results not adjusted for sex. That's because sex is usually easily ascertained and, hence, the predictions of the interaction model would always be readily applied. If the additional variable were something that is difficult to observe in ordinary conditions (e.g. a biochemical marker that can only be identified with expensive testing), then one might want to rely instead on the model that omits it.

        The situation you are in is somewhat similar to the statistical issues that were once raised in a lawsuit brought against Harvard University alleging sex-discrimination in graduate school admissions. If you just looked at overall admissions rates to graduate school by sex, the male acceptance rate was higher than the female one. But if you then broke it down by Department, it turned out that within each department, female admissions rates were as high as or, often higher than, males. But the females were applying in much larger numbers to Departments that admit a smaller proportion of their applicants as a whole, i.e. the most competitive departments. Here Department plays the role that sex plays in your situation. So there is no inconsistency in saying that female applicants were less likely to be admitted to Harvard graduate school, but in any specific department, there was either no difference or females had higher acceptance rates. Which model would you use? It depends. If a woman were just asking about her chances of being admitted to Harvard but didn't have any particular Department in mind, you would tell her that her chances are lower than that of a male. But if she is interested in some specific Department, you would tell her that her chances are higher than that of males applying to the same Department.

        Comment


        • #5
          Thank you again, Clyde. I conducted listwise deletion, so the sample sizes are the same. I now know how to interpret my results. I appreciate your help.

          Comment


          • #6
            I wonder what is the difference between using a continuous covariate and a categorical covariate in a logit regression. For instance, when I include income as a continuous variable, or recode it into a categorical variable.

            Comment


            • #7
              Well, if you use a large number of categories, each of which is narrow, it makes little difference, although the categorical variable approach increases the degrees of freedom in the model and make the calculations more cumbersome. But usually when people talk about categorizing a continuous variable they mean using a small number of categories, say 5 or fewer.

              This is usually a bad idea. There are a number of drawbacks. The boundary between categories then becomes a place where the model "jumps." If you have a continuous variable, 1 is 1, and 2 is 2 and 1000 is 1000. And 1 and 2 are close to each other but distant from 1000. But if you have a category representation of it, let's say for the sake of illustration, with categories 0-10000, 10001-50000, 50001-100000 and 100001 and above, then strange things start to happen. This representation says that a person with an income with an income of 50,000 currency units is really the same as a person with an income of just 100001 currency units but is radically different from a person with an income of 50001 units. That's clearly wrong for almost any purpose. It also means that you lose any possibility of capturing how your outcome varies with income within those categories. And it also means that any measurement error when the true value of the continuous variable is close to one of these cutpoints has drastic consequences instead of small, graded ones.

              Now, here are the advantages of using a categorized version of a continuous variable:






              OK, I'm exaggerating. For example, data privacy concerns might make you want to not have a person's actual age, height and number of children in a data set that is going to be widely seen: it makes people more or less identifiable. So to protect privacy you might replace these variables by categories to make it harder for somebody to guess who is who. But notice that this privacy advantage is precisely the consequence of a major problem: you are throwing out information. People who previously were distinguishable in the data now have the same values for their variables, so any data analysis is now starting out with less information. Consequently any analytic results will have less precision. So yes, there is this privacy advantage, but it comes at a price.


              So, don't categorize continuous variables unless, in the real world, your outcome actually exhibits a discontinuous jump at some cutoff value. (Such situations do exist, but they are very, very uncommon.) Or if you really need to suppress information for the sake of privacy.

              Comment


              • #8
                Thank you again, Clyde. I see why it is a better idea to use continuous covariates now. But I also want to know how technically the "constraints" between various independent variables happen in a model. My current understanding is, for instance, if I have both gender and age groups in my model, then they constrain each other, or do they? But if I have continuous variables in my model, how do they "interact" with other covariates?

                Comment


                • #9
                  My current understanding is, for instance, if I have both gender and age groups in my model, then they constrain each other, or do they?
                  I don't know what you mean by "constrain each other" here. Off hand, I cannot think of any meaning that would make this a true statement, but as I am not sure what you have in mind, I don't want to make a real assertion.

                  But if I have continuous variables in my model, how do they "interact" with other covariates?
                  If by "interact" you mean interaction terms, it is handled the same way that discrete variables are handled in interaction terms, except that you must use the c. prefix with them. The interpretation of interaction terms including a continuous component is a bit more complicated. It would be a textbook chapter to cover that in generality. If you have a specific example in mind, post it and I'd be happy to comment on it.

                  Comment


                  • #10
                    I guess I am thinking about the bivariates are different vectors, but don't really understand the calculation process of the model.

                    Comment


                    • #11
                      I guess I can put it this way. Say, I am examining giving to charities and mental health, is it different to recode a charities/income (percentage) variable or include them separately, using income as a bivariate?

                      Comment


                      • #12
                        It's not clear to me what you have in mind here. Are you talking about mental health as an outcome and giving to charities as a predictor, and then wondering about how income plays a role in it? So, putting it in Stata terms, are you talking about
                        Code:
                        regress mental_health charities_as_percent_income
                        
                        // AS OPPOSED TO
                        
                        regress mental_health charities income
                        And when you refer to income as a "bivariate" do you mean dichotomizing income into "high" and "low" or something like that?

                        Let me assume that these are correct interpretations of your questions and I'll respond accordingly.

                        As I think I made clear in #7 there are very few circumstances under which I would endorse turning a continuous variable like income into a dichotomy. This definitely is not one of them.

                        As for whether it is better to construct a ratio of charities to income, or to include charities and income separately as variables depends on the substance. I don't know anything about this content area, but I can discuss some modeling issues that are raised.

                        If you model -regress mental_health charities income-, you are implying that both charitable giving and income can separately, possibly independently, affect mental health. The effect of a certain difference in income will be the same regardless of what the level of charities is (assuming charities is unchanged). By contrast, if you model -regress mental_health charities_as_percent_income-, you imply that the two factors are not separately associated with mental health, but rather only through their ratio does such an association arise. So the effect on mental health associated with a particular change in income would depend on what the level of charities was at the start.

                        Which of these is more realistic in the world, I do not know. But there may be a literature on it you can consult. Or, if not, I would explore this graphically in the data. And I might try both models to see which is a better fit.

                        I should also point out that there is a "compromise" position that incorporates both of these models:

                        Code:
                        gen inverse_income = 1/income
                        regress mental_health c.charities##c.inverse_income
                        By creating a ## interaction of charities with 1/income you are entering charities, 1/income, and charities:income ratio into the model. Consequently if, say, income affects mental health both directly on its own, and also through the charities as percentage of income ratio, you would capture both of these in this model.

                        Comment


                        • #13
                          Thank you very much Clyde. I meant to write "covariates", not "bivariate".
                          I see interaction is a good idea, but the problem is I am already interacting charities giving with something else. But what you suggested of trying different models is a good idea.
                          I recoded my income variable to income/1000. Is 1/income a better option?
                          Thanks.

                          Comment


                          • #14
                            The reason I suggested 1/income is because you expressed interest in modeling the charities/income percentage. So a 1/income variable interacted with charities would accomplish that. But I don't have any thoughts about whether income is directly or inversely related to the outcome variable you are modeling, so I don't have an opinion about which would be a better specification of an income effect in your model. That is, again, a content issue that I'm not able to advise you about. If there is nothing in the literature to go on, I would explore this graphically.

                            Comment


                            • #15
                              Thanks Clyde.
                              I would explore this graphically in the data.
                              what kind of graph do you suggest?

                              Comment

                              Working...
                              X