Understanding differences in marginsplot and predicted values

Julien Dagenais

Join Date: Sep 2016

Posts: 34
#1

Understanding differences in marginsplot and predicted values

20 Oct 2016, 13:16

Dear statalist,

I'm running a MV logistic regression and using marginsplot to understand the relationship between aki2 (DV) and log_avl (IV) in an observational cohort. As a side note, log_avl was log transformed after assessing its distribution with ladder plots, and it is now normally distributed and has a understandable and statistically significant relationship with aki2.

To better understand the specific nature of the relationship between aki2 and log_avl in the context of other coviarates, I'm also generating scatterplots after using the predict command. There seems to be a discrepancy in the shape of the marginsplot and scatter plot. Please see the code and graphs below:

logistic aki2 c.log_avl##i.it_type i.agecat male race i.bmicat i.cci_cat Auto_CKD_Preop i.renal i.clavien_cat
margins, at(log_avl=(-2(1)6))
predict fitted2
quietly margins, at(log_avl=(-2(0.5)6)) saving(file2, replace)
marginsplot
graph addplot scatter fitted2 log_avl, msym(oh) msize(vsmall) mcolor(cranberry*0.8) xlabel(-2(1)6)
graph addplot qfitci fitted2 log_avl

The marginsplot is in blue. The scatterplot is in red and is created by using the predict command. The quadratic fit to the scatterplot is in gray. A quadratic fit was chosen after running the fp command to determine the optimal fit between pr(aki2) and log_avl.

When I assess the scatterplot, it seems like the quadratic fit does a much better job fitting the relationship. Is that because margins naturally attempts to fit the shape of a logistic regression, and so the tail ends of the curve flatten out? Which of the two curves is more appropriate than? Is marginsplot really giving me the correct relationship between these two variables in the regression?

Thanks for any help!
Julien
Tags: None

Julien Dagenais

Join Date: Sep 2016
Posts: 34

24 Oct 2016, 09:18

Dear Statalist,

I've done some follow-up work since the question was posed, and a few questions arise when trying to tease out the relationship between the outcome, pr(aki2), and the IV of interest, log_avl. I don't have a ton of experience investigating fractional polynomials, but it seems to most accurately define the data, so any help with understanding the questions below is appreciated.

After running margins and obtaining the values for pr(aki2), I then utilize the fracpoly command to best define the nature of the relationship, which doesn't appear to be linear per the scatterplot above. In fact, the best estimated model has 4df with powers of 0.5 and 2 (code and output below).

Code:

logistic aki2 c.log_avl##i.it_type i.agecat male race i.bmicat i.cci_cat Auto_CKD_Preop i.renal i.clavien_cat, vce(robust)
margins, at(log_avl=(-2(1)6)) vce(unconditional)
predict fitted2
quietly margins, at(log_avl=(-2(0.5)6)) saving(file2, replace) vce(unconditional)
fracpoly regress fitted log_avl

-> gen double Ilog___1 = X^.5-2.305790321 if e(sample)
-> gen double Ilog___2 = X^2-28.26696929 if e(sample)
(where: X = (log_avl+2.893064498901367))

Source | SS df MS Number of obs = 1272
-------------+------------------------------ F( 2, 1269) = 647.20
Model | 24.2111391 2 12.1055695 Prob > F = 0.0000
Residual | 23.7359434 1269 .018704447 R-squared = 0.5050
-------------+------------------------------ Adj R-squared = 0.5042
Total | 47.9470825 1271 .037723904 Root MSE = .13676

------------------------------------------------------------------------------
fitted2 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
Ilog___1 | -.1411984 .036375 -3.88 0.000 -.2125601 -.0698367
Ilog___2 | .0137386 .0008466 16.23 0.000 .0120777 .0153995
_cons | .2866 .0043338 66.13 0.000 .2780978 .2951022
------------------------------------------------------------------------------
Deviance: -1454.50. Best powers of log_avl among 44 models fit: .5 2.

Fractional polynomial model comparisons:
------------------------------------------------------------------------------
log_avl df Deviance Res. SD Dev. dif. P (*) Powers
------------------------------------------------------------------------------
Not in model 0 -560.152 .194226 894.353 0.000
Linear 1 -1338.540 .143086 115.965 0.000 1
m = 1 2 -1439.778 .137504 14.727 0.001 3
m = 2 4 -1454.504 .136764 -- -- .5 2
------------------------------------------------------------------------------
(*) P-value from deviance difference comparing reported model with m = 2 model

I can then plot this estimated curve using Patrick Roystons mcp command (https://www.econ.uzh.ch/dam/jcr:0000...8a3/sj13-3.pdf , pg 110).

Code:

mcp log_avl (Ilog___1 Ilog___2), ci at1(-1.4(0.2)6) plotopts(title("Positive Margins") ytitle("Predicted Probability")) saving(mcp1, replace)

Click image for larger version

Name: fitted2_aki.jpg
Views: 1
Size: 302.2 KB
ID: 1361501

Clearly, this is not the curve generated by marginsplot (in blue, #1, above). However, it seems to accurately fit the data and bears out in the statistical testing above. Am I correct to move forward with using this higher order relationship to define the relationship between pr(aki2) and log_avl? Would using margins and marginsplot to fit the data to the sigmoid shape of a logistic regression be incorrect? Had I not looked at the scatterplot and had I just generated a marginsplot curve, would I be oversimplifying and misconstruing the relationship? Finally, the confidence intervals seem awfully small here, which hardly change if resample with a bootstrap or jackknife. Should that raise concern that I'm overfitting?

Thanks!

Julien

Last edited by Julien Dagenais; 24 Oct 2016, 09:21.

Comment

Julien Dagenais

Join Date: Sep 2016

Posts: 34
#3

26 Oct 2016, 07:45

Any help here?
I apologize for any errors. My background is not as a statistician. I'll try to clarify my questions. Essentially, I'd like to know if anyone can please help address any of the following questions:

1) If marginsplot is meant to be a graphical representation of the adjusted predictions, why does the blue curve in #1 form a less than ideal fit with the corresponding scatterplot? Is it because it's being plotted under the constraints of a logistic regression and so is meant to approximate a sigmoid curve?

2) Is it wrong methodologically to run the model as is, and then take the adjusted predictions and fit them to a fractional polynomial curve if the estimated FP model appears to be a good fit per the model estimates in #2?

3) The confidence intervals of the FP curve in #2 seem awfully small. Obviously, if it's a better fit than the marginsplot, I'd expect smaller confidence intervals. But in the range of log_avl = 2 to 3, for example, both curves seem to fit the data approximately the same, but marginsplot has much larger confidence intervals. What does this say about how the CIs are calculated? Should I be concerned?

4) I've heard from some parties that a restricted cubic spline is better than a fractional polynomial. I'm not trying to open up a can of worms here, but is there some legitimacy to that claim for the above model?

Ultimately, the shape of the plotted curve is very important because it will potentially allow me to overlay various curves to demonstrate the nature of their relationships.

Again, thanks

Julien
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30029
#4

26 Oct 2016, 10:33

-marginsplot- is doing what it is supposed to do; -qfit- is doing what it is supposed to do. They do different things and, unsurprisingly, give different results.

-marginsplot- just graphs the results from the preceding -margins- command. Those results are the average predicted probabilities that you would calculate if you went through the data set and replaced all values of log_avl by -2 and applied predict, then repeated that replacing all values of log_avl by -1.5 and applying predict, etc. This is not the same as calculating the predicted values with the actual values of log_avl because the distributions of the other variables may differ according to the value of log_avl, but -margins- obliterates that source of variation. Your -qfit- command takes the predicted values and then fits a quadratic curve through them.

I don't have much experience with fractional polynomial models, so I'm reluctant to comment on that aspect of things (questions 2 through 4).
Comment
Julien Dagenais

Join Date: Sep 2016

Posts: 34
#5

28 Oct 2016, 12:06

Thanks Clyde for your response.

My confusion very likely related to the predict function. When I call predict (line 3, #1) as a post-estimation command, we know that it is generating a predicted probability for each and every patient. But is it true, than, that those predicted probabilities do not adjust for the other covariates?

Is that why the margins curve in #1 doesn't fit the scatter plot of predicted values very well? Because only margins is obliterating the source of variation? To illustrate that possible discrepancy, let us look at the log_avl value of 5. Do the scatterplot values for the predicted probabilities lie much higher than the marginsplot curve because there are other source of variation at those higher log_avl values that increase the likelihood of AKI? From a clinical standpoint that makes sense. Higher values of log_avl generally occur with larger more complex tumors in which the probability of AKI would be higher.

IF so, we have 2 curves that represent 2 populations; using margins we have one in which the effect of all other covariates that might predict AKI are obliterated. This would allow us to take two populations that are completely average in all other respects (if using -atmeans), for example, and explore the influence of log_avl on AKI as a function of it_type; IF, on the other hand, I was using predict and curve-fitting to the scatter plot (as I did in #2), we have a population in which other covariates that might alter the prediction of AKI are allowed to exert their influence.

Am I correct?

Best,
Julien
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30029
#6

28 Oct 2016, 12:16

Everything you say in #5 is correct. -predict- calculates a predicted score for each observation and uses the observed values of every modeled variable for that observation. It makes no adjustments of any kind, and all sources of outcome variation contribute to predict in full force. This has all of the consequences you describe.
Comment

Announcement