Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regression: Results become insignificant after adding control variables

    Dear all,

    I am doing a binary logisitic regression with the dependent variable of happiness (happy=1, unhappy=2) for the year 1981 for the country Belgium.

    logit A170 i.X047 i.sex i.X028 age i.X011, or
    where X047 is the income_scale, X028 is the employment status, X011 is the number of children.

    My variable of interest /main predictor is X047 and doing this regression, the results are significant (number of observations is 835)
    When I add the variable of marital stauts (which has 6 categories), my results for X047 become insignificant.
    How do i have to interpret this?

    Even if I would do it for all the countries in year 1981, what would this mean, if the adding of a control changes the results?

    Thanks so much for any help.
    Kind regards,
    Olivia
    Last edited by olivia schuter; 02 Oct 2017, 08:33. Reason: reg

  • #2
    First, you should understand that the difference between statistically significant and statistically insignificant is not, itself, statistically significant. p-values can jump around all over the place. They are not measures of strength of association or effect size, and every p-value vs effect size curve has a steep part somewhere, so that a small change in effect size can throw the p-value across the arbitrary .05 threshold (or whatever threshold you are using). So the first thing you should do is stop looking at the p-values and look at the regression coefficients. Are they appreciably different in the two models? They may not be. If so, then nothing important has changed and you should just move on.

    It is quite possible, however, that there actually is a major change in the regression coefficients. There is nothing unusual or surprising about this. It is often the case that the association of a predictor with an outcome is different when you control for other variables. In fact, any kind of change is possible, including a change to a large, significant, value with the opposite sign. This is known as Simpson's paradox. The Wikipedia page on Simpson's paradox is quite good, and I recommend you read it. It presents it in the context of simple contingency tables rather than regression, but the principles and reasoning are the same.

    As for your particular situation, you will need to decide whether the model that includes marital status of the one that excludes it is the one that best reflects the research question. That's a substantive issue, not a statistical one. What you should not do is make that decision based on whether the results confirm or disagree with your prior beliefs.

    As a complete aside, it is better programming practice to rename your variables so that they have mnemonic value. If you have to return to this work several months from now (say a reviewer asks you to make some changes to your work or has questions requiring additional calculations), how likely is it that you will remember what X011 and the like are? It will take you unnecessary time to refresh your memory on this. Also, because these variables look so much like each other, it is hard to spot errors that results from using one where another should have been used. Large-scale surveys often name their variables in this way because it is too difficult or impossible to come up with good mnemonic names for everything. But when you work with those data, typically you only use a subset of the variables that is small enough to support distinctive mnemonic variable names. Renaming the variables in that way should be one of the first steps in data management. It will save you a lot of time in the long run and, even more important, it will prevent errors and make some of the errors you do make easier to find and fix.

    Comment


    • #3
      Dear Mr Schechter,
      thanks for the comprehensive answer. I will rename the variables! Thanks for the hint!
      I looked at the coefficients and they change between by 0,5 until 2 points (for example from 5,2 to 4,1 after adding all control variables). Apparently it is now the variable of health_status which is essential for the model: My variable of interest, the income_Scale, from 1 to 10, has in the final model 2 statistically significant variables: the 4th and the 6th step of income. Before, when I did not include health_status, also the 5th and the 10th step from income_scale were significant.

      As I aim at finding out whether the relation between income_scale and happiness is positive or negative, can I then just use these two significant results for the 4th and the 6th income_level to draw my conclusion? I mean, the effect of the 6th level is by 1.5 points higher than on the 4th level. This actually indicates a positive relation, although I dont know whether two numbers are enought to draw such a conclusion (having in mind that the variable actually has 10 levels).... thanks for your help.

      Comment


      • #4
        Olivia:
        two remarks about your last post:
        -as Clyde excellently pointed out, hunting for <0.05 p-values is seldom rewarding: adding or subtracting predictors affects all the remaining predictors: hence, no wonder that you have changing coefficients and what is statistical significant in a model stops to behave like that in the following one. The best approach is to give your regression specification a true and fair view of the data generating process, on the grounds of the indications obtained by the literature in your research field.
        - a major concern might affect the relationship between happiness and income_scale: it is not my research field but, if it were, I would check whether happiness is not a predictor of higher income (or better employment status) vs (say) sadness, just to exclude reverse causality (ie, a form of endogeneity);
        - eventually, it's better to start numbering categorical variable levels from 0 (vs1) onwards.
        Kind regards,
        Carlo
        (Stata 18.0 SE)

        Comment


        • #5
          I agree with everything Carlo has to say except:

          as Clyde excellently pointed out, hunting for <0.05 p-values is seldom rewarding
          I would say that the problem with hunting for <0.05 p-values is that it is far too often rewarding, but the reward is fool's gold.

          Comment


          • #6
            Hello Olivia. If your library has this book by Mosteller & Tukey (1977), I recommend Chapter 13, Woes of Regression Coefficients.

            HTH.
            --
            Bruce Weaver
            Email: [email protected]
            Web: http://sites.google.com/a/lakeheadu.ca/bweaver/
            Version: Stata/MP 18.0 (Windows)

            Comment


            • #7
              thanks a lot for all your answers! I will keep them in mind and include them in my ongoing work! good evening.

              Comment


              • #8
                Admittedly, I have misundestood Clyde's stetement.
                Sorry for this.
                Kind regards,
                Carlo
                (Stata 18.0 SE)

                Comment


                • #9
                  Dear Mr Lazzaro and Schechter,

                  so can I still interpret the coefficients altthough none of them is statistically significant?


                  Thanks again.

                  Comment


                  • #10
                    Yes. A coefficient is a point estimate of the magnitude and direction of an effect. Whether it is "significant" or not, there is a range of uncertainty around it, given by the confidence interval. It is a common misconception that coefficients that are not "statistically significant" are somehow meaningless or indicative of the absence of any effect. That is not true. It is reasonable to present the coefficients along with their confidence intervals so you can say: our best estimate of this effect is X, with an uncertainty from Y to Z. If 0 happens to be inside the confidence interval, then the interpretation is that the current data do not enable us to confidently determine the direction of the effect, and the magnitude of the effect is, in either case, somewhat limited.

                    But don't forget that you still have to answer for yourself the question of which regression model is the appropriate one for your research goals, the one that includes marital status or the one that leaves it out.

                    Comment


                    • #11
                      okay great thanks for the comprehensive answer! I will figure it out for me!

                      Comment


                      • #12
                        Hey again,

                        besides testing happiness I also test life satisfaction (endogenous variable) and income.
                        Doing also a binary logistic regression I have several control variables as age, sex, education, health and marital status. When I include the independent variable of job satisfaction, the results lose their statistical significance and in 50% of the 112 analysed countries the before positive relation between income and life satisfaction then turns negative.
                        I read in the Internet more about confounding variables but I am now not sure whether job satisfaction should be included in the model or whether it shoudlnt as it maybe have a simultaneous causality with life satisfaction. What do you say?
                        thanks!

                        Comment


                        • #13
                          That's way out of my area of expertise. I'm an epidemiologist.

                          If job satisfaction lies on the causal pathway between your variables of interest and the outcome, then it should not be included in the model. On the other hand, if it is not on the causal pathway, then you would probably need to include it. But I'm in no position to say what's on the causal pathway here. It's really a content-area question, and I think it needs to be answered by an expert in your discipline.

                          Comment


                          • #14
                            okay thank you.

                            Comment


                            • #15
                              Olivia:
                              - you neither say what dependent variable you considered in logistic regression, nor you post what you typed and what Stata gave you back (as per FAQ). Hence, I find difficult to advise positively;
                              - I would also check your model to rule out endogeneity: you reported education and income among your happiness predictors. The usual example says that individual ability (that lurks in residuals) is correlated with both education attainments and income levels. If it's also correlated with happiness (something that you can gather from the literature in your research field), you may have an endogeneity issue.
                              Kind regards,
                              Carlo
                              (Stata 18.0 SE)

                              Comment

                              Working...
                              X