Bad control variables

Dana Baade

Join Date: Jul 2022

Posts: 23
#1

Bad control variables

07 Aug 2022, 12:57

Hi,

I am working on a regression where I am wondering what the relationship between X and Y is. I am including some controls Z. I know not including controls that are determined before the treatment can lead to omitted variables bias. The other way around can lead to bad controls and I am exactly wondering if I should be worried about this.

Let’s say that X impacts Y positively. A higher value of X leads to a higher value of Y.
I then add control Z. Z can be considered a bad control because X impacts Y through Z. If Z is then positively correlated with X and Y, and the coefficient of X is insignificant, we could say that the true effect is perhaps underestimated. Since Z is perhaps picking up the effect.

What would be the case however be if the relationship between X and Z is negative (or the relationship between Z and Y is negative) while the relationship between X and Y stays positive? Could we then say that the effect is overestimated? Just like with omitted variable bias?

Last edited by Dana Baade; 07 Aug 2022, 13:13.
Tags: bias, control variables, over control bias
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#2

07 Aug 2022, 16:07

What you are referring to as "bad control" I would term "multicollinearity between independent variables" where X and Z are collinear. It's a bit of a mouthful, but I think it's more descriptive. I don't really know what the answer to this question is analytically speaking, but I wonder if we can use a simple simulation study to find out. I've written the following program to generate some variables for a linear regression.

Code:

clear set obs 1000 generate Z = rnormal(5, 3) local theta = 1 generate X = (`theta' * rnormal()) - Z generate Y = 1 + (2 * X) + (3 * Z) + rnormal() /* the true model is defined here! */ correlate Y X Z regress Y X Z

As you can see, X and Z are negatively correlated, and you can adjust theta to get different absolute values of the correlation. If theta equals zero, the correlation is -1. Greater values will lower the correlation coefficient. Don't ask me why I named it theta, its not any kind of convention, I just thought it sounded good. When I run this over and over again, I see the coefficients are sometimes overestimated and sometimes underestimated. Notice also that if you get rid of the correlation between X and Z (by not subtracting Z from X), you still don't get a perfect estimation of the coefficients. This is probably because this is discrete, finite data with some random noise in it, not a set of perfect continuous and infinite normal distributions. Increasing the number of observations should increase the accuracy of the model, and decreasing the number of observations should decrease the accuracy of the model all else equal. I'd be willing to bet that lowering the number of observations will also increase the amount of bias from multiple collinearity, but I haven't actually tested that yet.

Hope that helps.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#3

07 Aug 2022, 17:51

Daniel Schaefer's analysis and calculations are correct.

But I would offer a different interpretation. The situation you describe with X -> Z -> Y makes Z a mediator of the X -> Y relationship. It is not necessarily right, nor necessarily wrong to include such a variable in the analysis. It depends on what your research question is. If you are interested solely on the overall effect on Z, including Z will distort that by absorbing (or, to coin a word "disabsorbing" if the X -> Z relationship is negative) part of that effect. But if you are interested in a more fine-grained question that seeks to understand not just the overall X->Y relationship but also "how it works" and how Z might be involved, then omitting Z would be a mistake. In general, however, if you want that kind of fine-grained understanding then it is best to model the mediation explicitly, for example by using -sem- with both direct X->Z and indirect X->Y and Y->Z equations in the model. With that you can then estimate the total effect of X on Z as the the direct X->Z coefficient plus the product of the X->Y and Y->Z coefficients. The product term is the indirect effect mediated by Y. And, yes, it is possible for the direct and indirect effects to have opposite signs, as in the kind of question you raise.

The really bad kind of covariate, which you did not mention in your post, is a collider. Z is a collider on the X -> Y relationship if it is also true that X -> Z and Y -> Z. In that case, inclusion of Z introduces bias, potentially severe bias, into the analysis of X -> Y.
Comment
Dana Baade

Join Date: Jul 2022

Posts: 23
#4

07 Aug 2022, 22:59

Thank you both!

I am afraid I am not totally getting t unlike the nice explanation. In my model X and Y are positively related. Same goes for Z and Y. X and Z are however negatively related. I was thinking that including the variable would kind of be ‘reverse omitted variable bias’. Since I have signs - and +, the overall effect would then be an overestimation of X when including Z in the model. Anyway, including it not including it does not make much of much of a difference. I was just looking an explanation.

For example I have party support (X), spending on campaigning (Z) and taxes (T). I expect that greater party support leads to less spending on campaigning. More campaigning leads to higher taxes. let’s say for the sake of explanation, I have no reverse causality between the variables.

More party support leads to higher taxes. I I have a model with just X and Y and I notice a positive relationship between X and Y like I expected. I then include Z in the model to see if it is truly not coming from an increase in campaigning like I expected. I then still notice a positive sign on X and the sign for Z is also positive like expected. I was wondering however since the relationship between Z and X is negative, including Z will lead to a coefficient on X that is somehow too large theoretically speaking. So can we apply the same logic for under and overestimation just like with OVB? But then reverse. Is that what you call ‘dis absorbing’?

Also Clyde, I am very sorry, but I don’t get the notation. What exactly do you mean by the ‘-‘ sign in X->Y for example? X leads to lower Y?

Last edited by Dana Baade; 07 Aug 2022, 23:08.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#5

08 Aug 2022, 08:47

What exactly do you mean by the ‘-‘ sign in X->Y for example? X leads to lower Y?

No, sorry that wasn't clear. The intended meaning is that -> is a single arrow symbol. So X -> Y just means that X influences (causes) an effect on Y. And it could be a positive or a negative effect. It's just meant to convey the existence of a (presumed or known) causal effect.

Is that what you call ‘dis absorbing’?

Yes.
Comment

Duong Le

Join Date: Apr 2020
Posts: 66

08 Aug 2022, 10:00

Dear Prof. Clyde and Daniel,

I encounter a similar problem, though my context is different from #1. In my case, I examine the effects of parental education on child health and I use two different specifications. The first one is to run separate regressions for mother's and father's education, which could be formulated as follows:

Code:

* Mother
ivregress 2sls health (mom_ed = mom_T) mom_run mom_interact region, robust

Instrumental variables 2SLS regression            Number of obs   =        200
                                                  Wald chi2(4)    =       4.83
                                                  Prob > chi2     =     0.3054
                                                  R-squared       =     0.3102
                                                  Root MSE        =     .38743

------------------------------------------------------------------------------
             |               Robust
      health | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
      mom_ed |  -.1644155    .124598    -1.32   0.187    -.4086231    .0797921
     mom_run |   .0007542    .002321     0.32   0.745    -.0037949    .0053033
mom_interact |  -.0016864   .0029403    -0.57   0.566    -.0074493    .0040765
      region |   .0769011   .0643013     1.20   0.232    -.0491271    .2029294
       _cons |   2.117528   1.390044     1.52   0.128    -.6069078    4.841964
------------------------------------------------------------------------------
Instrumented: mom_ed
 Instruments: mom_run mom_interact region mom_T


* Father
ivregress 2sls health (dad_ed = dad_T) dad_run dad_interact region, robust

Instrumental variables 2SLS regression            Number of obs   =        200
                                                  Wald chi2(4)    =       3.41
                                                  Prob > chi2     =     0.4913
                                                  R-squared       =          .
                                                  Root MSE        =     .86635

------------------------------------------------------------------------------
             |               Robust
      health | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
      dad_ed |  -.2216177   .1586564    -1.40   0.162    -.5325784    .0893431
     dad_run |   .0004701   .0018704     0.25   0.802    -.0031958     .004136
dad_interact |  -.0008702   .0038075    -0.23   0.819    -.0083328    .0065925
      region |   .0724355   .1229536     0.59   0.556    -.1685492    .3134201
       _cons |   2.639388   1.719515     1.53   0.125       -.7308    6.009575
------------------------------------------------------------------------------
Instrumented: dad_ed
 Instruments: dad_run dad_interact region dad_T

where health is child health (1=bad), mom_ed is years of schooling of mother, mom_run is mother's age, mom_interact is the interaction term between the treatment status (mom_T) and mother's age. Similar notations apply for father.

In the second specification, I put both mother's and father's education into one model by running the following commands.

Code:

    qui reg mom_ed mom_T mom_run mom_interact region, robust
        cap drop yhat1
        predict yhat1
        
    qui reg dad_ed dad_T dad_run dad_interact region, robust
        cap drop yhat2
        predict yhat2
        
    reg health yhat1 yhat2 mom_run mom_interact dad_run dad_interact region, robust

Linear regression                               Number of obs     =        200
                                                F(7, 192)         =       2.80
                                                Prob > F          =     0.0086
                                                R-squared         =     0.0715
                                                Root MSE          =     .45875

------------------------------------------------------------------------------
             |               Robust
      health | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       yhat1 |  -.0889556   .1454766    -0.61   0.542    -.3758932     .197982
       yhat2 |  -.2272356   .0770165    -2.95   0.004    -.3791426   -.0753286
     mom_run |   .0014083   .0027848     0.51   0.614    -.0040845     .006901
mom_interact |  -.0002115   .0036503    -0.06   0.954    -.0074113    .0069884
     dad_run |   .0003707   .0009268     0.40   0.690    -.0014573    .0021986
dad_interact |  -.0017265   .0026438    -0.65   0.515    -.0069411    .0034882
      region |   .0987698   .0803036     1.23   0.220    -.0596207    .2571603
       _cons |   3.669981   1.666596     2.20   0.029     .3827939    6.957169
------------------------------------------------------------------------------

I understand that standard errors in the second specification should be adjusted either by using bootstrap method or other methods, but for simplicity just ignore it for now.

Turning to the results of the two specifications. In the first one, there are neither effects of father's education nor mother's education (in my real data, I found negative and significant effects for both father's and mother's education but I can't replicate the results using samples generated by dataex). However, in the second specification, father's education (yhat2) turns significant. To explore a bit more I run some correlations as the followings:

Code:

* Correlation between yhat1 and yhat2
corr yhat1 yhat2
(obs=200)

             |    yhat1    yhat2
-------------+------------------
       yhat1 |   1.0000
       yhat2 |   0.4898   1.0000

* Between yhat1 and health
corr yhat1 health
(obs=200)

             |    yhat1   health
-------------+------------------
       yhat1 |   1.0000
      health |  -0.0981   1.0000


* Between yhat2 and health
corr yhat2 health
(obs=200)

             |    yhat2   health
-------------+------------------
       yhat2 |   1.0000
      health |  -0.2449   1.0000

Here we can see that yhat1 and yhat2 are positively correlated, though not strong (in my real data the correlation coefficient is high = 0.95). We also can see that yhat1 and yhat2 are negatively correlated with child health, but the correlation coefficient of that is greater than that of yhat1. I am quite confusing about the results of the second specification, is the reason yhat2 significant is because its correlation is stronger than that of yhat1 and child health? is there any advice for simulation for my case? Thank you.

Data example

Code:

clear
input float id byte(child_dead m_edu) int m_R byte m_D int m_DR byte f_edu float(f_R f_D f_DR) byte region
  1 1  9  77 1  77 11  -47 0  0 1
  2 1  7 -15 0   0  7  -75 0  0 1
  3 1  7 -39 0   0  7 -116 0  0 1
  4 0 12 -52 0   0  7  -29 0  0 1
  5 1  7   9 1   9  7  -67 0  0 1
  6 0 12  37 1  37 12   23 1 23 1
  7 0 12 -55 0   0  7 -149 0  0 1
  8 1  5  26 1  26  5 -122 0  0 1
  9 1  1  -7 0   0  7  -79 0  0 1
 10 1  5  77 1  77  5   80 1 80 1
 11 0 12  45 1  45  7   42 1 42 1
 12 0 12 -19 0   0  8  -71 0  0 1
 13 1  6  54 1  54  5   -5 0  0 1
 14 0 17  36 1  36 17   -5 0  0 1
 15 1  5  13 1  13  5  -35 0  0 1
 16 1  7  12 1  12  7    9 1  9 1
 17 0 17 -14 0   0  5  -49 0  0 1
 18 1  5 -47 0   0  5  -30 0  0 1
 19 1  4 -40 0   0  8  -90 0  0 1
 20 0 12 -10 0   0 17  -14 0  0 1
 21 1  5 -55 0   0  5  -46 0  0 1
 22 0 15  -5 0   0 12 -136 0  0 1
 23 1  5 -43 0   0 10 -172 0  0 1
 24 0 12 -60 0   0  5  -92 0  0 1
 25 1  9  38 1  38  5   56 1 56 1
 26 1  5  42 1  42  5    5 1  5 1
 27 1  9  55 1  55  9  -15 0  0 1
 28 0 10  35 1  35  8    0 1  0 1
 29 0 12 -58 0   0  6 -180 0  0 1
 30 0 12 -53 0   0  5  -63 0  0 1
 31 1  5 -19 0   0  5 -136 0  0 1
 32 1  7  59 1  59  7  -15 0  0 1
 33 0 12  62 1  62  5   41 1 41 1
 34 1  7   9 1   9  7  -17 0  0 1
 35 1  5 -10 0   0  7  -17 0  0 1
 36 0 10  60 1  60  7 -103 0  0 1
 37 0 11 -49 0   0 11  -56 0  0 1
 38 0 16   9 1   9 17  -18 0  0 1
 39 0 12  -9 0   0 12  -83 0  0 1
 40 1  9 -38 0   0  5 -177 0  0 1
 41 0 16  -3 0   0 11 -157 0  0 1
 42 0 15  53 1  53  5   34 1 34 1
 43 1  5  52 1  52  5   54 1 54 1
 44 0 12 -45 0   0 15  -67 0  0 1
 45 1  5 -35 0   0  5  -58 0  0 1
 46 0 10  46 1  46  7    7 1  7 1
 47 0 12 -32 0   0 12  -49 0  0 1
 48 0 11  28 1  28 13  -44 0  0 1
 49 0 15  74 1  74 12   52 1 52 1
 50 0 16  31 1  31 16    1 1  1 1
 51 0 15  58 1  58 15   28 1 28 1
 52 0 12  51 1  51 12  -80 0  0 1
 53 0 11  64 1  64 12   38 1 38 1
 54 0 12  87 1  87  8   30 1 30 1
 55 0 14  47 1  47 14   31 1 31 1
 56 0 12  28 1  28 18   35 1 35 1
 57 0 15  70 1  70  5   42 1 42 1
 58 0 16  64 1  64 16   54 1 54 1
 59 0 15  25 1  25 16  -30 0  0 1
 60 1  5  21 1  21  5   -7 0  0 1
 61 0 16  88 1  88 16   31 1 31 1
 62 0 14  44 1  44 16   12 1 12 1
 63 1  5   7 1   7  5  -25 0  0 1
 64 0 16  10 1  10 16    0 1  0 1
 65 0 15  20 1  20 15   10 1 10 1
 66 1  9  73 1  73  5  -15 0  0 1
 67 1  5  40 1  40  5   17 1 17 1
 68 0 12  10 1  10 15  -10 0  0 1
 69 0 12  41 1  41 15   27 1 27 1
 70 1  5  17 1  17  5  -39 0  0 1
 71 0 16   8 1   8 16  -45 0  0 1
 72 0 16  56 1  56 16   72 1 72 1
 73 0 12   5 1   5 12   23 1 23 1
 74 0 12  15 1  15 12  -16 0  0 1
 75 0 12 -22 0   0 16  -26 0  0 1
 76 0 16  50 1  50 16    5 1  5 1
 77 0 16  64 1  64 16  -53 0  0 1
 78 0 15  27 1  27 17   32 1 32 1
 79 0 16   2 1   2 16  -38 0  0 1
 80 0 11  16 1  16  5  -75 0  0 1
 81 0 12 -59 0   0 12  -76 0  0 1
 82 0 16 -43 0   0 16  -81 0  0 1
 83 0 17 -54 0   0 11  -87 0  0 1
 84 0 10  50 1  50 10  -26 0  0 1
 85 0 11  60 1  60 16  -82 0  0 1
 86 0 12  -8 0   0 12  -34 0  0 1
 87 0 11 115 1 115 12   95 1 95 1
 88 0 12  24 1  24 12   14 1 14 1
 89 0 11 -38 0   0 18 -209 0  0 1
 90 0 12 -13 0   0 12  -44 0  0 1
 91 0 16   6 1   6 16   -1 0  0 1
 92 0 12  68 1  68 17   53 1 53 1
 93 0 18 -40 0   0 16  -63 0  0 1
 94 0 16   8 1   8 17   31 1 31 1
 95 0 17   5 1   5 17  -59 0  0 1
 96 0 15 -37 0   0 16  -95 0  0 1
 97 0 18 -56 0   0 18 -240 0  0 1
 98 0 16  43 1  43 16   50 1 50 1
 99 1  8  13 1  13 10  -62 0  0 1
100 0 18  44 1  44 17   -6 0  0 2
101 0 11  29 1  29 16  -89 0  0 2
102 0 12  52 1  52 17   26 1 26 2
103 0 18 -18 0   0 18  -15 0  0 2
104 0 17 -59 0   0 18 -215 0  0 2
105 0 11   1 1   1 11   -5 0  0 2
106 0 16 -51 0   0 16  -64 0  0 2
107 0 17 -48 0   0 18 -176 0  0 2
108 0 11  13 1  13 11  -77 0  0 2
109 0 11  31 1  31 11   27 1 27 2
110 0 11  57 1  57 15    6 1  6 2
111 0 16 -49 0   0 17 -100 0  0 2
112 0 18   6 1   6 17  -17 0  0 2
113 1  6  42 1  42  6  -15 0  0 2
114 0 11  57 1  57 11   60 1 60 2
115 0 18 -15 0   0 16  -80 0  0 2
116 0 16  80 1  80 17   44 1 44 2
117 0 12  14 1  14  5  -72 0  0 2
118 0 12  66 1  66 12   27 1 27 2
119 1  9  81 1  81 11  -24 0  0 2
120 0 17  29 1  29 17   -3 0  0 2
121 0 14   6 1   6  5  -61 0  0 2
122 0 11   9 1   9 11  -70 0  0 2
123 1  7 -44 0   0  4 -167 0  0 2
124 1  5  42 1  42 11   82 1 82 2
125 1  7  56 1  56 12  -90 0  0 2
126 1  7  67 1  67  6   38 1 38 2
127 1  6 -18 0   0  6  -28 0  0 2
128 0 16 -16 0   0 17  -32 0  0 2
129 0 16 -22 0   0 16  -71 0  0 2
130 0 11  53 1  53  5  -40 0  0 2
131 1  5 -35 0   0  5  -71 0  0 2
132 0 11 103 1 103  6   24 1 24 2
133 1  7 -43 0   0  5  -94 0  0 2
134 0 10 -51 0   0  6 -125 0  0 2
135 1  7 -17 0   0  5  -27 0  0 2
136 1  5 -14 0   0 12  -69 0  0 2
137 1  9  88 1  88  5   77 1 77 2
138 1 12  50 1  50  8   20 1 20 2
139 0 12 110 1 110 10   91 1 91 2
140 0 10 -43 0   0 11  -86 0  0 2
141 1  8 -27 0   0  7 -122 0  0 2
142 0 16  47 1  47  5    9 1  9 2
143 1  7 -18 0   0  7  -38 0  0 2
144 1  5  55 1  55  7 -260 0  0 2
145 1 12 -39 0   0 13  -47 0  0 2
146 0 11  22 1  22  5    2 1  2 2
147 0 16  28 1  28 16   22 1 22 2
148 0 12  41 1  41 12   43 1 43 2
149 1  8  31 1  31  6  -60 0  0 2
150 1  3  17 1  17  7 -146 0  0 2
151 0 17  38 1  38 17    8 1  8 2
152 0 16  16 1  16  7 -106 0  0 2
153 1  7  -4 0   0  5  -42 0  0 2
154 0 15 -16 0   0 10  -58 0  0 2
155 0 16  33 1  33 11  -17 0  0 2
156 0 16 -23 0   0 17  -53 0  0 2
157 0 11 -53 0   0  5 -103 0  0 2
158 1  5  -8 0   0  5   -4 0  0 2
159 0 11  19 1  19 17    9 1  9 2
160 0 17  -4 0   0 17  -49 0  0 2
161 1  4 -85 0   0 11   -3 0  0 2
162 0 12  48 1  48 17   10 1 10 2
163 1  5  28 1  28  5  -20 0  0 2
164 1  9 -21 0   0  5 -108 0  0 2
165 0 17 -22 0   0 16  -65 0  0 2
166 0 16  28 1  28 16  -24 0  0 2
167 0 16  37 1  37 16   47 1 47 2
168 0 16  58 1  58 18  -43 0  0 2
169 1  8  36 1  36 16  -76 0  0 2
170 0 18 -18 0   0 18  -72 0  0 2
171 0 15  69 1  69 16   21 1 21 2
172 0 17   1 1   1 12  -10 0  0 2
173 1  5 -24 0   0 16 -311 0  0 2
174 0 12 -20 0   0 11  -67 0  0 2
175 0 14  95 1  95 12   82 1 82 2
176 0 12  60 1  60  5   20 1 20 2
177 1  7 -56 0   0  5  -75 0  0 2
178 0 12  80 1  80 17   28 1 28 2
179 0 16 -36 0   0  5  -79 0  0 2
180 0 12 -24 0   0 11  -65 0  0 2
181 1  3  -1 0   0  9 -220 0  0 2
182 1  9 -13 0   0  5  -60 0  0 2
183 0 15  75 1  75 16   35 1 35 2
184 1  5 -35 0   0 11  -77 0  0 2
185 0 11   4 1   4  7 -226 0  0 2
186 0 11  33 1  33 11   53 1 53 2
187 0 17 -53 0   0 17  -88 0  0 2
188 0 12  65 1  65  5  -59 0  0 2
189 0 12  45 1  45 12   46 1 46 2
190 1  9 -58 0   0  5 -160 0  0 2
191 0 16  -2 0   0  5 -119 0  0 2
192 0 12  21 1  21 12 -113 0  0 2
193 0 12  69 1  69  5   14 1 14 2
194 0 15  20 1  20 15  -59 0  0 2
195 1  9  24 1  24  7  -68 0  0 2
196 1  9  14 1  14  9  -49 0  0 2
197 1  7  43 1  43  7  -53 0  0 2
198 0 15   5 1   5 16  -13 0  0 2
199 0 12 -29 0   0 12 -138 0  0 2
200 0 12  40 1  40 11   10 1 10 3
end

Comment

Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#7

08 Aug 2022, 11:37

You say in the example dataset that neither mother's nor father's education is not significant and that father's education becomes significant when mother's education is included as in the model as well. You've put predicted values for mother's education and fathers education into the same model. Theoretically these two variables should be correlated since we know there is a good deal of homophily among married couples in terms of education. That is, there is a process by which people select one another for marriage, and that process involves having similar levels of education. Indeed, you show that the predicted values have a moderate correlation with a value of approximately 0.5 in the example data, and a strong correlation in your real dataset with a value of approximately 0.95. You also put predictors for your predicted education variables in the combined model as well.

This does indeed seem like a recipe for multicollinearity. My understanding is that you should expect inaccurate or otherwise exaggerated coefficients, relatively large standard errors, and relatively wide confidence intervals. The solution to your problem will depend to some extent on your research question and the relevant theory, but I would probably avoid using two variables with a correlation of 0.95 in the same model without a clear theoretical justification. I suppose if you really do need both variables, then you might create a latent "parent education" construct. However, when you already have mother's education in the model, you are simply not adding much more information by including father's education (and vice versa). If you only include one of these variables you have a more parsimonious model and you avoid some of the issues related to multicollinearity.
2 likes
Comment
Duong Le

Join Date: Apr 2020

Posts: 66
#8

09 Aug 2022, 00:24

Dear Daniel,

Thank you so much for your insightful inputs. I am grateful for that.

Theoretically these two variables should be correlated since we know there is a good deal of homophily among married couples in terms of education. That is, there is a process by which people select one another for marriage, and that process involves having similar levels of education

You are right and I am aware of this. What you mentioned is known as assortative matching.

The solution to your problem will depend to some extent on your research question and the relevant theory, but I would probably avoid using two variables with a correlation of 0.95 in the same model without a clear theoretical justification

Here, I assume that child health is a function of both mother's education and father's education because their education may both affect their child health simultaneously. So, in the first specification, assuming that the effect of mother's education on child health is negative and significant, the coefficient of mom_ed may not be the true effect of mother's education because father's education is omitted. Similar logic could be apply to dad_ed. That is why I come up with the second specification. However, challenging of multicollinearity arises because in my real data the correlation between yhat1 and yhat2 is very strong (0.95). But, still, I do not know how much the results of the second specification are affected by multicollinearity. Here are some scenarios I could think of: (suppose that I am talking about my real data, where both mom_ed and dad_edu are negative and significant in the 1st specification, but only yhat2 is significant in the 2nd specification).
i) the results show that yhat1 and yhat2 are negatively associated with child health, and that the correlation between yhat2 and health is stronger than that of yhat1 and health. Thus, when putting yhat1 and yhat2 into one model, it could be that yhat2 outperforms yhat1 due to its effect size is larger and due to its association with health is stronger. That means yhat1 itself may still have some effects but because of its strong correlation with yhat2 and be outperformed, its coefficient becomes insignificant.
ii) on the other hands, it could be that case that yhat1 is significant in the first specification because this regression omitted an important variable, which is yhat2. So, when controlling yhat2 in the model, the coefficient of yhat1 turns insignificant. That means, in fact, yhat1 itself does not have any effect on child health.

To be honest, I do not what kind of directions could be and how to determine or examine these two directions. Can simulation method help? if so, I would be grateful if you could provide me with some coding.

Code:

I suppose if you really do need both variables, then you might create a latent "parent education" construct

Thanks for your advice, it seems very interesting approaches, but could you please elaborate more about how to create a latent "parent education"?

Thank you so much for your time and help.
Comment

Announcement