Regressing the results of regression: a case of the oozlum bird in statistics?

Nigel Moore

Join Date: Apr 2016

Posts: 79
#1

Regressing the results of regression: a case of the oozlum bird in statistics?

23 Aug 2017, 02:41

Hello everyone,

I have a -mixed- model regression of embryonic heart rate that includes a number of independent factor variables and a single continuous covariate (pH). If I apply -margins- to the model without the covariate I obtain results that are (almost) identical to those derived from simply calculating the mean group values. That is, my model accurately predicts the measured values when the covariate is not included. This makes sense to me, and makes me happy. It may be the happiness borne of ignorance though.

When I add the covariate, the results change slightly, but not substantially. This is also a good thing, since it means that the pH-adjusted predictions are also close to the measured values.

To quantify the relationship between measured and predicted values, I ran a regression between the two (actually two regressions, with and without the pH covariate).

I have written this up as follows:

The predictivity of the mixed regression model, and the influence of within-sample and within-group pH variance, were determined by regressing mean predicted heart rate (HR′), estimated with and without pH as covariate, against mean measured heart rate (HR). Without pH as covariate in the model, the regression between HR′ and HR was close to unity; for GD 11 embryos the coefficient (slope) was 0.9999999 and the constant (intercept) was 0.0000166, while for GD 13 embryos the coefficient and constant were 1 and 3.62×10^‑7 respectively (R²=1.0000 in both cases). When pH was included as a covariate in the model, the coefficient and constant for GD 11 embryos were 1.034411 and -6.107584 (R²=0.9956), while for GD 13 embryos they were 0.9935429 and 1.383487 (R²=0.9979). In all cases, the influence of measured pH in the model was small and not statistically significant (p>0.05). Nevertheless, all HR′ reported herein were determined with measured sample pH included as covariate in the prediction model.

This makes sense to me. I have tried to demonstrate that the model accurately represents the measured values when the covariate is not applied, and therefore the values obtained when it is applied are reliable. But by regressing a dependent variable against an independent variable that it's a predictor of, have I simply shown that white is white and black is black?

Stata 14.2MP
OS X
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#2

23 Aug 2017, 08:55

I don't understand what you've done here. The relationship between the measured and predicted values of the outcome variable in a regression is given by that regression's R², and there is nothing gained by doing another regression between the observed and predicted values. Moreover, the fact that the model with no covariates accurately predicts the mean of the observed outcomes is always true and it says nothing about the trustworthiness of other models that nclude covariates.

Perhaps I am misunderstanding what you did. Descriptions of analyses in words often lack detail or have ambiguities. If you want a more considered evaluation of what you've done and what it means, I suggest you post the code and output. (Be sure to bind it between code delimiters so it is easily readable.)
Comment

Nigel Moore

Join Date: Apr 2016
Posts: 79

23 Aug 2017, 11:49

Hello Clyde,

Thank you for your reply. I may have missed something, but there is no R² in the -mixed- model output:

Code:

.  mixed hr ib179.sb10##time c.ph if gd==11 & treat==0 & !inrange(id, 9147, 9194) || id:

Performing EM optimization: 

Performing gradient-based optimization: 

Iteration 0:   log likelihood = -435.33309  
Iteration 1:   log likelihood = -435.33077  
Iteration 2:   log likelihood = -435.33077  

Computing standard errors:

Mixed-effects ML regression                     Number of obs     =        106
Group variable: id                              Number of groups  =         53

                                                Obs per group:
                                                              min =          2
                                                              avg =        2.0
                                                              max =          2

                                                Wald chi2(6)      =      84.12
Log likelihood = -435.33077                     Prob > chi2       =     0.0000

------------------------------------------------------------------------------
          hr |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        sb10 |
          0  |   54.66829   35.83739     1.53   0.127    -15.57171    124.9083
         15  |    37.3623    23.2042     1.61   0.107    -8.117095    82.84169
             |
      1.time |  -4.075213   2.822213    -1.44   0.149    -9.606648    1.456223
             |
   sb10#time |
        0 1  |  -36.44387   8.415938    -4.33   0.000    -52.93881   -19.94894
       15 1  |  -37.57096   7.329471    -5.13   0.000    -51.93646   -23.20546
             |
          ph |   35.28616   20.87117     1.69   0.091    -5.620595    76.19291
       _cons |  -70.74118   152.6358    -0.46   0.643    -369.9019    228.4196
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
id: Identity                 |
                  var(_cons) |   80.96055    33.8667      35.66206    183.7979
-----------------------------+------------------------------------------------
               var(Residual) |   149.8417   29.23925      102.2196      219.65
------------------------------------------------------------------------------
LR test vs. linear model: chibar2(01) = 6.85          Prob >= chibar2 = 0.0044

So what I have done is collected the output of -margins- and compared that with the measured values:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte gd double(measured predicted_noph) float predicted_ph
11 179.6364 179.9054 179.6364
11 185.8182 185.0371 185.8182
11 181.4545 181.4456 181.4545
11 182.1818 182.1728 182.1818
11    176.8 178.4986    176.8
11    176.4 175.5507    176.4
11      178 177.5754      178
11    176.4 175.9754    176.4
11 177.3333 178.4275 177.3333
11 172.4444 171.0979 172.4444
11 173.7778 173.1127 173.7778
11 174.6667 174.0016 174.6667
13 229.6667 229.7297 229.6667
13 218.6667 218.6692 218.6667
13      211 210.9672      211
13 191.3333 191.3005 191.3333
13      227 227.0617      227
13    235.8 235.7927    235.8
13    232.2 232.1728    232.2
13      221 220.9728      221
13    227.5 227.5544    227.5
13      233 232.9782      233
13      234 233.9837      234
13    227.5 227.4837    227.5
11 182.2859 186.5453 182.2857
11 150.2857 146.0262 150.2857
11 186.8571 188.8231 186.8571
11 149.1429 147.1769 149.1429
11 187.2821 187.7299 187.2821
11 184.1026 183.6547 184.1026
13 222.2222 223.7927 222.2222
13 216.8889 215.3184 216.8889
13    224.4 224.6817    224.4
13    218.8 218.5183    218.8
13 225.4444 225.0691 225.4444
13      230 230.3754      230
11 187.2821 187.7014 187.2821
11 184.1026 183.6832 184.1026
11 188.8889 189.0541 188.8889
11 181.7778 181.6126 181.7778
11      186 186.8591      186
11    169.6 168.7409    169.6
11      198 199.3837      198
11      124 122.6163      124
11    186.4 186.4165    186.4
11    174.8 174.7835    174.8
11      190 190.6443      190
11    164.4 163.7557    164.4
13 225.4444 225.1598 225.4444
13      230 230.2847      230
13 224.6667 224.3166 224.6667
13 215.3333 215.6834 215.3333
13    228.8 228.7457    228.8
13    201.6 201.6543    201.6
13      226 226.2209      226
13      200 199.7791      200
13      230 230.2625      230
13      196 195.7375      196
13      235 236.8422      235
13   174.25 172.4078   174.25
13      234 236.0278      234
13      184 181.9722      184
13 222.9091  222.728 222.9091
13 219.2727 219.4538 219.2727
13    226.8 226.8145    226.8
13    200.4 200.3855    200.4
13      236 237.2674      236
13       72 70.73264       72
11      184 184.7393      184
11    146.8 146.0607    146.8
11 190.8571 190.6741 190.8571
11 133.1429 133.3259 133.1429
11    193.2 193.0857    193.2
11    159.2 159.3143    159.2
11      186 185.9347      186
11    169.6 169.6653    169.6
11      190  189.951      190
11    164.4  164.449    164.4
11    184.4 184.3724    184.4
11    170.8 170.8276    170.8
11      190 189.8945      190
11    175.6 175.7055    175.6
13      234 228.4634      234
13      184 189.5366      184
13      236 232.5396      236
13       72 75.46035       72
13 227.7143 228.5123 227.7143
13 225.1429 224.3449 225.1429
13 229.7778 230.1073 229.7778
13 227.5556  227.226 227.5556
13    230.4 231.3657    230.4
13    189.6 188.6343    189.6
13    223.6 223.8608    223.6
13    194.4 194.1392    194.4
13    228.8 228.7276    228.8
13    201.6 201.6724    201.6
13      230 229.6716      230
13    226.4 226.7284    226.4
end

Code:

. regress predicted_noph measured if gd==11

      Source |       SS           df       MS      Number of obs   =        44
-------------+----------------------------------   F(1, 42)        =   9691.99
       Model |  11144.3094         1  11144.3094   Prob > F        =    0.0000
    Residual |  48.2936049        42  1.14984774   R-squared       =    0.9957
-------------+----------------------------------   Adj R-squared   =    0.9956
       Total |   11192.603        43  260.293093   Root MSE        =    1.0723

------------------------------------------------------------------------------
predicte~oph |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    measured |   1.034411   .0105072    98.45   0.000     1.013207    1.055615
       _cons |  -6.107584   1.857299    -3.29   0.002    -9.855765   -2.359404
------------------------------------------------------------------------------

Code:

. regress predicted_ph measured if gd==11

      Source |       SS           df       MS      Number of obs   =        44
-------------+----------------------------------   F(1, 42)        >  99999.00
       Model |  10415.1793         1  10415.1793   Prob > F        =    0.0000
    Residual |  3.7389e-08        42  8.9022e-10   R-squared       =    1.0000
-------------+----------------------------------   Adj R-squared   =    1.0000
       Total |  10415.1793        43  242.213472   Root MSE        =    3.0e-05

------------------------------------------------------------------------------
predicted_ph |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    measured |   .9999999   2.92e-07  3.4e+06   0.000     .9999993           1
       _cons |   .0000166   .0000517     0.32   0.750    -.0000877    .0001209
------------------------------------------------------------------------------

Etc.

Stata 14.2MP
OS X

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#4

23 Aug 2017, 12:21

Yes, of course, you are right. There is no R² in -mixed-. So you are basically emulating that.

You don't say in detail how the variables predicted_ph and predicted_noph were created--you allude to output of -margins-, but without showing the specific -margins- command you used, it's hard to say what this means. Also in #1 you described a model with no covariates, but you don't show that model. So, all in all, I still don't have a clear picture of what you did, so I can't give a confident commentary.
1 like
Comment

Nigel Moore

Join Date: Apr 2016
Posts: 79

23 Aug 2017, 12:57

The problem is one that you have helped me with recently (here).

The independent variables are:

conc10 (10-fold of drug concentration, which can be a float to 1d.p.)
sb10 (10-fold of sodium bicarbonate concentration, which is a float to 1d.p.)
treat, an integer for the drug being used (0=control, 1=drug1, 2=drug2)
time

The covariate is the measured ph of the sample at each time point, c.ph.

Essentially, there are a number of models. Depending on the evaluation that we want to make, they are limited as to the records in the database that they are based on. The -mixed- command is repeated for GD 11 and GD 13 embryos. In the particular cases below, the reference sodium bicarbonate concentration is 17.9mM:

Code:

mixed hr ib179.sb10##time c.ph if gd==11 || id:
margins time, over(sb10)

mixed hr conc10##treat##ib179.sb10##time c.ph if gd==11 || id:
margins time, over(sb10 treat conc10)

Without the covariate, these become:

Code:

mixed hr ib179.sb10##time if gd==11 || id:
margins time, over(sb10)

mixed hr conc10##treat##ib179.sb10##time if gd==11 || id:
margins time, over(sb10 treat conc10)

So, running that model from my first post, with margins and the covariate:

Code:

. mixed hr ib179.sb10##time c.ph if gd==11 & treat==0 & !inrange(id, 9147, 9194) || id:

Performing EM optimization: 

Performing gradient-based optimization: 

Iteration 0:   log likelihood = -435.33309  
Iteration 1:   log likelihood = -435.33077  
Iteration 2:   log likelihood = -435.33077  

Computing standard errors:

Mixed-effects ML regression                     Number of obs     =        106
Group variable: id                              Number of groups  =         53

                                                Obs per group:
                                                              min =          2
                                                              avg =        2.0
                                                              max =          2

                                                Wald chi2(6)      =      84.12
Log likelihood = -435.33077                     Prob > chi2       =     0.0000

------------------------------------------------------------------------------
          hr |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        sb10 |
          0  |   54.66829   35.83739     1.53   0.127    -15.57171    124.9083
         15  |    37.3623    23.2042     1.61   0.107    -8.117095    82.84169
             |
      1.time |  -4.075213   2.822213    -1.44   0.149    -9.606648    1.456223
             |
   sb10#time |
        0 1  |  -36.44387   8.415938    -4.33   0.000    -52.93881   -19.94894
       15 1  |  -37.57096   7.329471    -5.13   0.000    -51.93646   -23.20546
             |
          ph |   35.28616   20.87117     1.69   0.091    -5.620595    76.19291
       _cons |  -70.74118   152.6358    -0.46   0.643    -369.9019    228.4196
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
id: Identity                 |
                  var(_cons) |   80.96055    33.8667      35.66206    183.7979
-----------------------------+------------------------------------------------
               var(Residual) |   149.8417   29.23925      102.2196      219.65
------------------------------------------------------------------------------
LR test vs. linear model: chibar2(01) = 6.85          Prob >= chibar2 = 0.0044

. margins time, over(sb10)

Predictive margins                              Number of obs     =        106

Expression   : Linear prediction, fixed portion, predict()
over         : sb10

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   sb10#time |
        0 0  |   186.5453   6.270516    29.75   0.000     174.2553    198.8352
        0 1  |   146.0262   6.270516    23.29   0.000     133.7362    158.3162
       15 0  |   188.8231   5.858661    32.23   0.000     177.3403    200.3059
       15 1  |   147.1769   5.858661    25.12   0.000     135.6941    158.6597
      179 0  |   187.7299   2.447076    76.72   0.000     182.9337    192.5261
      179 1  |   183.6547   2.447076    75.05   0.000     178.8585    188.4509
------------------------------------------------------------------------------

Now without the covariate:

Code:

. mixed hr ib179.sb10##time if gd==11 & treat==0 & !inrange(id, 9147, 9194) || id:

Performing EM optimization: 

Performing gradient-based optimization: 

Iteration 0:   log likelihood =  -436.7289  
Iteration 1:   log likelihood = -436.72557  
Iteration 2:   log likelihood = -436.72557  

Computing standard errors:

Mixed-effects ML regression                     Number of obs     =        106
Group variable: id                              Number of groups  =         53

                                                Obs per group:
                                                              min =          2
                                                              avg =        2.0
                                                              max =          2

                                                Wald chi2(5)      =      78.34
Log likelihood = -436.72557                     Prob > chi2       =     0.0000

------------------------------------------------------------------------------
          hr |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        sb10 |
          0  |  -4.996337    6.29288    -0.79   0.427    -17.33016    7.337482
         15  |  -.4249084    6.29288    -0.07   0.946    -12.75873    11.90891
             |
      1.time |  -3.179487   2.842858    -1.12   0.263    -8.751387    2.392413
             |
   sb10#time |
        0 1  |  -28.82051    7.28761    -3.95   0.000    -43.10397   -14.53706
       15 1  |   -34.5348    7.28761    -4.74   0.000    -48.81825   -20.25134
             |
       _cons |   187.2821    2.45482    76.29   0.000     182.4707    192.0934
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
id: Identity                 |
                  var(_cons) |   77.42346   33.98905      32.74865    183.0424
-----------------------------+------------------------------------------------
               var(Residual) |    157.596   30.61415      107.6944    230.6201
------------------------------------------------------------------------------
LR test vs. linear model: chibar2(01) = 6.09          Prob >= chibar2 = 0.0068

. margins time, over(sb10)

Predictive margins                              Number of obs     =        106

Expression   : Linear prediction, fixed portion, predict()
over         : sb10

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   sb10#time |
        0 0  |   182.2857   5.794325    31.46   0.000      170.929    193.6424
        0 1  |   150.2857   5.794325    25.94   0.000      138.929    161.6424
       15 0  |   186.8571   5.794325    32.25   0.000     175.5005    198.2138
       15 1  |   149.1429   5.794325    25.74   0.000     137.7862    160.4995
      179 0  |   187.2821    2.45482    76.29   0.000     182.4707    192.0934
      179 1  |   184.1026    2.45482    75.00   0.000     179.2912    188.9139
------------------------------------------------------------------------------

Stata 14.2MP
OS X

Comment

Nigel Moore

Join Date: Apr 2016
Posts: 79

23 Aug 2017, 13:11

For comparison, the measured data look like this:

Code:

. bysort sb: summarize hr0 hr1 if gd==11 & treat==0 & !inrange(id, 9147, 9194)

----------------------------------------------------------------------------------------------------------------------
-> sb = 0

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         hr0 |          7    182.2857    7.609518        168        192
         hr1 |          7    150.2857    22.25395        112        176

----------------------------------------------------------------------------------------------------------------------
-> sb = 1.5

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         hr0 |          7    186.8571    12.79881        168        204
         hr1 |          7    149.1429    34.84934         96        192

----------------------------------------------------------------------------------------------------------------------
-> sb = 17.9

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         hr0 |         39    187.2821    13.66892        156        208
         hr1 |         39    184.1026     12.7976        152        208

Last edited by Nigel Moore; 23 Aug 2017, 13:16.

Stata 14.2MP
OS X

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#7

23 Aug 2017, 13:38

Ah, yes, I thought the context seemed familiar, but I couldn't quite place it.

Well, I think that demonstrating a high correlation between the predicted margins and the measured values is one way of building confidence in a model. Demonstrating that the means are also in line is important, as you can have a high correlation between two variables that are widely offset or on different scales. So what you've done make sense. The only part I'd disagree with is that I don't think that good results from one model are in any way a reason for confidence in some different model. To bolster a model, you need to present results from that model.

Other than that point, what you've done makes sense to me.
1 like
Comment
Nigel Moore

Join Date: Apr 2016

Posts: 79
#8

23 Aug 2017, 14:03

Thank you once again, Clyde, that is very helpful.

The only part I'd disagree with is that I don't think that good results from one model are in any way a reason for confidence in some different model. To bolster a model, you need to present results from that model.

Good point. The problem is that with CO₂-buffered culture conditions, a small change in pH is always expected. Even when you control pH, it will change. So I don't think that there's any realistic way of testing that model beyond what I have already done.

However, as part of this study, we made major changes to pH, by changing the sodium bicarbonate concentration. That did result in large changes in heart rate. So the fact that minor fluctuations in pH result in only small changes to heart rate is commensurate with that.

As long as I don't claim that this somehow 'validates' the covariate model, what I have now seems to be in order. All I have reported is a minor deviation between the model with the covariate and the one without. When I stated that "the influence of measured pH in the model was small and not statistically significant (p>0.05)", I was referring to the influence of pH in the -mixed- model, not a comparison of the with/without regressions. Sorry not to have made that clearer.

Stata 14.2MP
OS X
Comment

Announcement