Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bad control variables

    Hi,

    I am working on a regression where I am wondering what the relationship between X and Y is. I am including some controls Z. I know not including controls that are determined before the treatment can lead to omitted variables bias. The other way around can lead to bad controls and I am exactly wondering if I should be worried about this.

    Let’s say that X impacts Y positively. A higher value of X leads to a higher value of Y.
    I then add control Z. Z can be considered a bad control because X impacts Y through Z. If Z is then positively correlated with X and Y, and the coefficient of X is insignificant, we could say that the true effect is perhaps underestimated. Since Z is perhaps picking up the effect.

    What would be the case however be if the relationship between X and Z is negative (or the relationship between Z and Y is negative) while the relationship between X and Y stays positive? Could we then say that the effect is overestimated? Just like with omitted variable bias?

    Last edited by Dana Baade; 07 Aug 2022, 13:13.

  • #2
    What you are referring to as "bad control" I would term "multicollinearity between independent variables" where X and Z are collinear. It's a bit of a mouthful, but I think it's more descriptive. I don't really know what the answer to this question is analytically speaking, but I wonder if we can use a simple simulation study to find out. I've written the following program to generate some variables for a linear regression.

    Code:
    clear
    set obs 1000
    generate Z = rnormal(5, 3)
    local theta = 1
    generate X = (`theta' * rnormal()) - Z
    generate Y = 1 + (2 * X) + (3 * Z) + rnormal() /* the true model is defined here! */
    correlate Y X Z
    regress Y X Z
    As you can see, X and Z are negatively correlated, and you can adjust theta to get different absolute values of the correlation. If theta equals zero, the correlation is -1. Greater values will lower the correlation coefficient. Don't ask me why I named it theta, its not any kind of convention, I just thought it sounded good. When I run this over and over again, I see the coefficients are sometimes overestimated and sometimes underestimated. Notice also that if you get rid of the correlation between X and Z (by not subtracting Z from X), you still don't get a perfect estimation of the coefficients. This is probably because this is discrete, finite data with some random noise in it, not a set of perfect continuous and infinite normal distributions. Increasing the number of observations should increase the accuracy of the model, and decreasing the number of observations should decrease the accuracy of the model all else equal. I'd be willing to bet that lowering the number of observations will also increase the amount of bias from multiple collinearity, but I haven't actually tested that yet.

    Hope that helps.

    Comment


    • #3
      Daniel Schaefer's analysis and calculations are correct.

      But I would offer a different interpretation. The situation you describe with X -> Z -> Y makes Z a mediator of the X -> Y relationship. It is not necessarily right, nor necessarily wrong to include such a variable in the analysis. It depends on what your research question is. If you are interested solely on the overall effect on Z, including Z will distort that by absorbing (or, to coin a word "disabsorbing" if the X -> Z relationship is negative) part of that effect. But if you are interested in a more fine-grained question that seeks to understand not just the overall X->Y relationship but also "how it works" and how Z might be involved, then omitting Z would be a mistake. In general, however, if you want that kind of fine-grained understanding then it is best to model the mediation explicitly, for example by using -sem- with both direct X->Z and indirect X->Y and Y->Z equations in the model. With that you can then estimate the total effect of X on Z as the the direct X->Z coefficient plus the product of the X->Y and Y->Z coefficients. The product term is the indirect effect mediated by Y. And, yes, it is possible for the direct and indirect effects to have opposite signs, as in the kind of question you raise.

      The really bad kind of covariate, which you did not mention in your post, is a collider. Z is a collider on the X -> Y relationship if it is also true that X -> Z and Y -> Z. In that case, inclusion of Z introduces bias, potentially severe bias, into the analysis of X -> Y.

      Comment


      • #4
        Thank you both!

        I am afraid I am not totally getting t unlike the nice explanation. In my model X and Y are positively related. Same goes for Z and Y. X and Z are however negatively related. I was thinking that including the variable would kind of be ‘reverse omitted variable bias’. Since I have signs - and +, the overall effect would then be an overestimation of X when including Z in the model. Anyway, including it not including it does not make much of much of a difference. I was just looking an explanation.

        For example I have party support (X), spending on campaigning (Z) and taxes (T). I expect that greater party support leads to less spending on campaigning. More campaigning leads to higher taxes. let’s say for the sake of explanation, I have no reverse causality between the variables.

        More party support leads to higher taxes. I I have a model with just X and Y and I notice a positive relationship between X and Y like I expected. I then include Z in the model to see if it is truly not coming from an increase in campaigning like I expected. I then still notice a positive sign on X and the sign for Z is also positive like expected. I was wondering however since the relationship between Z and X is negative, including Z will lead to a coefficient on X that is somehow too large theoretically speaking. So can we apply the same logic for under and overestimation just like with OVB? But then reverse. Is that what you call ‘dis absorbing’?

        Also Clyde, I am very sorry, but I don’t get the notation. What exactly do you mean by the ‘-‘ sign in X->Y for example? X leads to lower Y?
        Last edited by Dana Baade; 07 Aug 2022, 23:08.

        Comment


        • #5
          What exactly do you mean by the ‘-‘ sign in X->Y for example? X leads to lower Y?
          No, sorry that wasn't clear. The intended meaning is that -> is a single arrow symbol. So X -> Y just means that X influences (causes) an effect on Y. And it could be a positive or a negative effect. It's just meant to convey the existence of a (presumed or known) causal effect.

          Is that what you call ‘dis absorbing’?
          Yes.

          Comment


          • #6
            Dear Prof. Clyde and Daniel,

            I encounter a similar problem, though my context is different from #1. In my case, I examine the effects of parental education on child health and I use two different specifications. The first one is to run separate regressions for mother's and father's education, which could be formulated as follows:
            Code:
            * Mother
            ivregress 2sls health (mom_ed = mom_T) mom_run mom_interact region, robust
            
            Instrumental variables 2SLS regression            Number of obs   =        200
                                                              Wald chi2(4)    =       4.83
                                                              Prob > chi2     =     0.3054
                                                              R-squared       =     0.3102
                                                              Root MSE        =     .38743
            
            ------------------------------------------------------------------------------
                         |               Robust
                  health | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
            -------------+----------------------------------------------------------------
                  mom_ed |  -.1644155    .124598    -1.32   0.187    -.4086231    .0797921
                 mom_run |   .0007542    .002321     0.32   0.745    -.0037949    .0053033
            mom_interact |  -.0016864   .0029403    -0.57   0.566    -.0074493    .0040765
                  region |   .0769011   .0643013     1.20   0.232    -.0491271    .2029294
                   _cons |   2.117528   1.390044     1.52   0.128    -.6069078    4.841964
            ------------------------------------------------------------------------------
            Instrumented: mom_ed
             Instruments: mom_run mom_interact region mom_T
            
            
            * Father
            ivregress 2sls health (dad_ed = dad_T) dad_run dad_interact region, robust
            
            Instrumental variables 2SLS regression            Number of obs   =        200
                                                              Wald chi2(4)    =       3.41
                                                              Prob > chi2     =     0.4913
                                                              R-squared       =          .
                                                              Root MSE        =     .86635
            
            ------------------------------------------------------------------------------
                         |               Robust
                  health | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
            -------------+----------------------------------------------------------------
                  dad_ed |  -.2216177   .1586564    -1.40   0.162    -.5325784    .0893431
                 dad_run |   .0004701   .0018704     0.25   0.802    -.0031958     .004136
            dad_interact |  -.0008702   .0038075    -0.23   0.819    -.0083328    .0065925
                  region |   .0724355   .1229536     0.59   0.556    -.1685492    .3134201
                   _cons |   2.639388   1.719515     1.53   0.125       -.7308    6.009575
            ------------------------------------------------------------------------------
            Instrumented: dad_ed
             Instruments: dad_run dad_interact region dad_T
            where health is child health (1=bad), mom_ed is years of schooling of mother, mom_run is mother's age, mom_interact is the interaction term between the treatment status (mom_T) and mother's age. Similar notations apply for father.

            In the second specification, I put both mother's and father's education into one model by running the following commands.
            Code:
                qui reg mom_ed mom_T mom_run mom_interact region, robust
                    cap drop yhat1
                    predict yhat1
                    
                qui reg dad_ed dad_T dad_run dad_interact region, robust
                    cap drop yhat2
                    predict yhat2
                    
                reg health yhat1 yhat2 mom_run mom_interact dad_run dad_interact region, robust
            
            Linear regression                               Number of obs     =        200
                                                            F(7, 192)         =       2.80
                                                            Prob > F          =     0.0086
                                                            R-squared         =     0.0715
                                                            Root MSE          =     .45875
            
            ------------------------------------------------------------------------------
                         |               Robust
                  health | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
            -------------+----------------------------------------------------------------
                   yhat1 |  -.0889556   .1454766    -0.61   0.542    -.3758932     .197982
                   yhat2 |  -.2272356   .0770165    -2.95   0.004    -.3791426   -.0753286
                 mom_run |   .0014083   .0027848     0.51   0.614    -.0040845     .006901
            mom_interact |  -.0002115   .0036503    -0.06   0.954    -.0074113    .0069884
                 dad_run |   .0003707   .0009268     0.40   0.690    -.0014573    .0021986
            dad_interact |  -.0017265   .0026438    -0.65   0.515    -.0069411    .0034882
                  region |   .0987698   .0803036     1.23   0.220    -.0596207    .2571603
                   _cons |   3.669981   1.666596     2.20   0.029     .3827939    6.957169
            ------------------------------------------------------------------------------
            I understand that standard errors in the second specification should be adjusted either by using bootstrap method or other methods, but for simplicity just ignore it for now.

            Turning to the results of the two specifications. In the first one, there are neither effects of father's education nor mother's education (in my real data, I found negative and significant effects for both father's and mother's education but I can't replicate the results using samples generated by dataex). However, in the second specification, father's education (yhat2) turns significant. To explore a bit more I run some correlations as the followings:

            Code:
            * Correlation between yhat1 and yhat2
            corr yhat1 yhat2
            (obs=200)
            
                         |    yhat1    yhat2
            -------------+------------------
                   yhat1 |   1.0000
                   yhat2 |   0.4898   1.0000
            
            * Between yhat1 and health
            corr yhat1 health
            (obs=200)
            
                         |    yhat1   health
            -------------+------------------
                   yhat1 |   1.0000
                  health |  -0.0981   1.0000
            
            
            * Between yhat2 and health
            corr yhat2 health
            (obs=200)
            
                         |    yhat2   health
            -------------+------------------
                   yhat2 |   1.0000
                  health |  -0.2449   1.0000
            Here we can see that yhat1 and yhat2 are positively correlated, though not strong (in my real data the correlation coefficient is high = 0.95). We also can see that yhat1 and yhat2 are negatively correlated with child health, but the correlation coefficient of that is greater than that of yhat1. I am quite confusing about the results of the second specification, is the reason yhat2 significant is because its correlation is stronger than that of yhat1 and child health? is there any advice for simulation for my case? Thank you.

            Data example
            Code:
            clear
            input float id byte(child_dead m_edu) int m_R byte m_D int m_DR byte f_edu float(f_R f_D f_DR) byte region
              1 1  9  77 1  77 11  -47 0  0 1
              2 1  7 -15 0   0  7  -75 0  0 1
              3 1  7 -39 0   0  7 -116 0  0 1
              4 0 12 -52 0   0  7  -29 0  0 1
              5 1  7   9 1   9  7  -67 0  0 1
              6 0 12  37 1  37 12   23 1 23 1
              7 0 12 -55 0   0  7 -149 0  0 1
              8 1  5  26 1  26  5 -122 0  0 1
              9 1  1  -7 0   0  7  -79 0  0 1
             10 1  5  77 1  77  5   80 1 80 1
             11 0 12  45 1  45  7   42 1 42 1
             12 0 12 -19 0   0  8  -71 0  0 1
             13 1  6  54 1  54  5   -5 0  0 1
             14 0 17  36 1  36 17   -5 0  0 1
             15 1  5  13 1  13  5  -35 0  0 1
             16 1  7  12 1  12  7    9 1  9 1
             17 0 17 -14 0   0  5  -49 0  0 1
             18 1  5 -47 0   0  5  -30 0  0 1
             19 1  4 -40 0   0  8  -90 0  0 1
             20 0 12 -10 0   0 17  -14 0  0 1
             21 1  5 -55 0   0  5  -46 0  0 1
             22 0 15  -5 0   0 12 -136 0  0 1
             23 1  5 -43 0   0 10 -172 0  0 1
             24 0 12 -60 0   0  5  -92 0  0 1
             25 1  9  38 1  38  5   56 1 56 1
             26 1  5  42 1  42  5    5 1  5 1
             27 1  9  55 1  55  9  -15 0  0 1
             28 0 10  35 1  35  8    0 1  0 1
             29 0 12 -58 0   0  6 -180 0  0 1
             30 0 12 -53 0   0  5  -63 0  0 1
             31 1  5 -19 0   0  5 -136 0  0 1
             32 1  7  59 1  59  7  -15 0  0 1
             33 0 12  62 1  62  5   41 1 41 1
             34 1  7   9 1   9  7  -17 0  0 1
             35 1  5 -10 0   0  7  -17 0  0 1
             36 0 10  60 1  60  7 -103 0  0 1
             37 0 11 -49 0   0 11  -56 0  0 1
             38 0 16   9 1   9 17  -18 0  0 1
             39 0 12  -9 0   0 12  -83 0  0 1
             40 1  9 -38 0   0  5 -177 0  0 1
             41 0 16  -3 0   0 11 -157 0  0 1
             42 0 15  53 1  53  5   34 1 34 1
             43 1  5  52 1  52  5   54 1 54 1
             44 0 12 -45 0   0 15  -67 0  0 1
             45 1  5 -35 0   0  5  -58 0  0 1
             46 0 10  46 1  46  7    7 1  7 1
             47 0 12 -32 0   0 12  -49 0  0 1
             48 0 11  28 1  28 13  -44 0  0 1
             49 0 15  74 1  74 12   52 1 52 1
             50 0 16  31 1  31 16    1 1  1 1
             51 0 15  58 1  58 15   28 1 28 1
             52 0 12  51 1  51 12  -80 0  0 1
             53 0 11  64 1  64 12   38 1 38 1
             54 0 12  87 1  87  8   30 1 30 1
             55 0 14  47 1  47 14   31 1 31 1
             56 0 12  28 1  28 18   35 1 35 1
             57 0 15  70 1  70  5   42 1 42 1
             58 0 16  64 1  64 16   54 1 54 1
             59 0 15  25 1  25 16  -30 0  0 1
             60 1  5  21 1  21  5   -7 0  0 1
             61 0 16  88 1  88 16   31 1 31 1
             62 0 14  44 1  44 16   12 1 12 1
             63 1  5   7 1   7  5  -25 0  0 1
             64 0 16  10 1  10 16    0 1  0 1
             65 0 15  20 1  20 15   10 1 10 1
             66 1  9  73 1  73  5  -15 0  0 1
             67 1  5  40 1  40  5   17 1 17 1
             68 0 12  10 1  10 15  -10 0  0 1
             69 0 12  41 1  41 15   27 1 27 1
             70 1  5  17 1  17  5  -39 0  0 1
             71 0 16   8 1   8 16  -45 0  0 1
             72 0 16  56 1  56 16   72 1 72 1
             73 0 12   5 1   5 12   23 1 23 1
             74 0 12  15 1  15 12  -16 0  0 1
             75 0 12 -22 0   0 16  -26 0  0 1
             76 0 16  50 1  50 16    5 1  5 1
             77 0 16  64 1  64 16  -53 0  0 1
             78 0 15  27 1  27 17   32 1 32 1
             79 0 16   2 1   2 16  -38 0  0 1
             80 0 11  16 1  16  5  -75 0  0 1
             81 0 12 -59 0   0 12  -76 0  0 1
             82 0 16 -43 0   0 16  -81 0  0 1
             83 0 17 -54 0   0 11  -87 0  0 1
             84 0 10  50 1  50 10  -26 0  0 1
             85 0 11  60 1  60 16  -82 0  0 1
             86 0 12  -8 0   0 12  -34 0  0 1
             87 0 11 115 1 115 12   95 1 95 1
             88 0 12  24 1  24 12   14 1 14 1
             89 0 11 -38 0   0 18 -209 0  0 1
             90 0 12 -13 0   0 12  -44 0  0 1
             91 0 16   6 1   6 16   -1 0  0 1
             92 0 12  68 1  68 17   53 1 53 1
             93 0 18 -40 0   0 16  -63 0  0 1
             94 0 16   8 1   8 17   31 1 31 1
             95 0 17   5 1   5 17  -59 0  0 1
             96 0 15 -37 0   0 16  -95 0  0 1
             97 0 18 -56 0   0 18 -240 0  0 1
             98 0 16  43 1  43 16   50 1 50 1
             99 1  8  13 1  13 10  -62 0  0 1
            100 0 18  44 1  44 17   -6 0  0 2
            101 0 11  29 1  29 16  -89 0  0 2
            102 0 12  52 1  52 17   26 1 26 2
            103 0 18 -18 0   0 18  -15 0  0 2
            104 0 17 -59 0   0 18 -215 0  0 2
            105 0 11   1 1   1 11   -5 0  0 2
            106 0 16 -51 0   0 16  -64 0  0 2
            107 0 17 -48 0   0 18 -176 0  0 2
            108 0 11  13 1  13 11  -77 0  0 2
            109 0 11  31 1  31 11   27 1 27 2
            110 0 11  57 1  57 15    6 1  6 2
            111 0 16 -49 0   0 17 -100 0  0 2
            112 0 18   6 1   6 17  -17 0  0 2
            113 1  6  42 1  42  6  -15 0  0 2
            114 0 11  57 1  57 11   60 1 60 2
            115 0 18 -15 0   0 16  -80 0  0 2
            116 0 16  80 1  80 17   44 1 44 2
            117 0 12  14 1  14  5  -72 0  0 2
            118 0 12  66 1  66 12   27 1 27 2
            119 1  9  81 1  81 11  -24 0  0 2
            120 0 17  29 1  29 17   -3 0  0 2
            121 0 14   6 1   6  5  -61 0  0 2
            122 0 11   9 1   9 11  -70 0  0 2
            123 1  7 -44 0   0  4 -167 0  0 2
            124 1  5  42 1  42 11   82 1 82 2
            125 1  7  56 1  56 12  -90 0  0 2
            126 1  7  67 1  67  6   38 1 38 2
            127 1  6 -18 0   0  6  -28 0  0 2
            128 0 16 -16 0   0 17  -32 0  0 2
            129 0 16 -22 0   0 16  -71 0  0 2
            130 0 11  53 1  53  5  -40 0  0 2
            131 1  5 -35 0   0  5  -71 0  0 2
            132 0 11 103 1 103  6   24 1 24 2
            133 1  7 -43 0   0  5  -94 0  0 2
            134 0 10 -51 0   0  6 -125 0  0 2
            135 1  7 -17 0   0  5  -27 0  0 2
            136 1  5 -14 0   0 12  -69 0  0 2
            137 1  9  88 1  88  5   77 1 77 2
            138 1 12  50 1  50  8   20 1 20 2
            139 0 12 110 1 110 10   91 1 91 2
            140 0 10 -43 0   0 11  -86 0  0 2
            141 1  8 -27 0   0  7 -122 0  0 2
            142 0 16  47 1  47  5    9 1  9 2
            143 1  7 -18 0   0  7  -38 0  0 2
            144 1  5  55 1  55  7 -260 0  0 2
            145 1 12 -39 0   0 13  -47 0  0 2
            146 0 11  22 1  22  5    2 1  2 2
            147 0 16  28 1  28 16   22 1 22 2
            148 0 12  41 1  41 12   43 1 43 2
            149 1  8  31 1  31  6  -60 0  0 2
            150 1  3  17 1  17  7 -146 0  0 2
            151 0 17  38 1  38 17    8 1  8 2
            152 0 16  16 1  16  7 -106 0  0 2
            153 1  7  -4 0   0  5  -42 0  0 2
            154 0 15 -16 0   0 10  -58 0  0 2
            155 0 16  33 1  33 11  -17 0  0 2
            156 0 16 -23 0   0 17  -53 0  0 2
            157 0 11 -53 0   0  5 -103 0  0 2
            158 1  5  -8 0   0  5   -4 0  0 2
            159 0 11  19 1  19 17    9 1  9 2
            160 0 17  -4 0   0 17  -49 0  0 2
            161 1  4 -85 0   0 11   -3 0  0 2
            162 0 12  48 1  48 17   10 1 10 2
            163 1  5  28 1  28  5  -20 0  0 2
            164 1  9 -21 0   0  5 -108 0  0 2
            165 0 17 -22 0   0 16  -65 0  0 2
            166 0 16  28 1  28 16  -24 0  0 2
            167 0 16  37 1  37 16   47 1 47 2
            168 0 16  58 1  58 18  -43 0  0 2
            169 1  8  36 1  36 16  -76 0  0 2
            170 0 18 -18 0   0 18  -72 0  0 2
            171 0 15  69 1  69 16   21 1 21 2
            172 0 17   1 1   1 12  -10 0  0 2
            173 1  5 -24 0   0 16 -311 0  0 2
            174 0 12 -20 0   0 11  -67 0  0 2
            175 0 14  95 1  95 12   82 1 82 2
            176 0 12  60 1  60  5   20 1 20 2
            177 1  7 -56 0   0  5  -75 0  0 2
            178 0 12  80 1  80 17   28 1 28 2
            179 0 16 -36 0   0  5  -79 0  0 2
            180 0 12 -24 0   0 11  -65 0  0 2
            181 1  3  -1 0   0  9 -220 0  0 2
            182 1  9 -13 0   0  5  -60 0  0 2
            183 0 15  75 1  75 16   35 1 35 2
            184 1  5 -35 0   0 11  -77 0  0 2
            185 0 11   4 1   4  7 -226 0  0 2
            186 0 11  33 1  33 11   53 1 53 2
            187 0 17 -53 0   0 17  -88 0  0 2
            188 0 12  65 1  65  5  -59 0  0 2
            189 0 12  45 1  45 12   46 1 46 2
            190 1  9 -58 0   0  5 -160 0  0 2
            191 0 16  -2 0   0  5 -119 0  0 2
            192 0 12  21 1  21 12 -113 0  0 2
            193 0 12  69 1  69  5   14 1 14 2
            194 0 15  20 1  20 15  -59 0  0 2
            195 1  9  24 1  24  7  -68 0  0 2
            196 1  9  14 1  14  9  -49 0  0 2
            197 1  7  43 1  43  7  -53 0  0 2
            198 0 15   5 1   5 16  -13 0  0 2
            199 0 12 -29 0   0 12 -138 0  0 2
            200 0 12  40 1  40 11   10 1 10 3
            end

            Comment


            • #7
              You say in the example dataset that neither mother's nor father's education is not significant and that father's education becomes significant when mother's education is included as in the model as well. You've put predicted values for mother's education and fathers education into the same model. Theoretically these two variables should be correlated since we know there is a good deal of homophily among married couples in terms of education. That is, there is a process by which people select one another for marriage, and that process involves having similar levels of education. Indeed, you show that the predicted values have a moderate correlation with a value of approximately 0.5 in the example data, and a strong correlation in your real dataset with a value of approximately 0.95. You also put predictors for your predicted education variables in the combined model as well.

              This does indeed seem like a recipe for multicollinearity. My understanding is that you should expect inaccurate or otherwise exaggerated coefficients, relatively large standard errors, and relatively wide confidence intervals. The solution to your problem will depend to some extent on your research question and the relevant theory, but I would probably avoid using two variables with a correlation of 0.95 in the same model without a clear theoretical justification. I suppose if you really do need both variables, then you might create a latent "parent education" construct. However, when you already have mother's education in the model, you are simply not adding much more information by including father's education (and vice versa). If you only include one of these variables you have a more parsimonious model and you avoid some of the issues related to multicollinearity.

              Comment


              • #8
                Dear Daniel,

                Thank you so much for your insightful inputs. I am grateful for that.
                Theoretically these two variables should be correlated since we know there is a good deal of homophily among married couples in terms of education. That is, there is a process by which people select one another for marriage, and that process involves having similar levels of education
                You are right and I am aware of this. What you mentioned is known as assortative matching.

                The solution to your problem will depend to some extent on your research question and the relevant theory, but I would probably avoid using two variables with a correlation of 0.95 in the same model without a clear theoretical justification
                Here, I assume that child health is a function of both mother's education and father's education because their education may both affect their child health simultaneously. So, in the first specification, assuming that the effect of mother's education on child health is negative and significant, the coefficient of mom_ed may not be the true effect of mother's education because father's education is omitted. Similar logic could be apply to dad_ed. That is why I come up with the second specification. However, challenging of multicollinearity arises because in my real data the correlation between yhat1 and yhat2 is very strong (0.95). But, still, I do not know how much the results of the second specification are affected by multicollinearity. Here are some scenarios I could think of: (suppose that I am talking about my real data, where both mom_ed and dad_edu are negative and significant in the 1st specification, but only yhat2 is significant in the 2nd specification).
                i) the results show that yhat1 and yhat2 are negatively associated with child health, and that the correlation between yhat2 and health is stronger than that of yhat1 and health. Thus, when putting yhat1 and yhat2 into one model, it could be that yhat2 outperforms yhat1 due to its effect size is larger and due to its association with health is stronger. That means yhat1 itself may still have some effects but because of its strong correlation with yhat2 and be outperformed, its coefficient becomes insignificant.
                ii) on the other hands, it could be that case that yhat1 is significant in the first specification because this regression omitted an important variable, which is yhat2. So, when controlling yhat2 in the model, the coefficient of yhat1 turns insignificant. That means, in fact, yhat1 itself does not have any effect on child health.

                To be honest, I do not what kind of directions could be and how to determine or examine these two directions. Can simulation method help? if so, I would be grateful if you could provide me with some coding.

                Code:
                I suppose if you really do need both variables, then you might create a latent "parent education" construct
                Thanks for your advice, it seems very interesting approaches, but could you please elaborate more about how to create a latent "parent education"?

                Thank you so much for your time and help.

                Comment

                Working...
                X