Asses performance of logistic model in MI data

Sunny Sam

Join Date: Oct 2015

Posts: 4
#1

Asses performance of logistic model in MI data

17 Feb 2017, 06:34

Dear all,
I have fitted a multivariate logistic model as below in MI data.
mi estimate, or: logistic dep_var indep_var1 indep_var2 indep_var3
Now I need to asses the performance of this logistic model such as receiver operating characteristic curve.
Is this possible, could I get help with the script?
Many thanks
Sunil Sampath

PS: apologies I created account with a false user name, I have contacted the administrator to amend this.

Last edited by Sunny Sam; 17 Feb 2017, 06:43.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#2

17 Feb 2017, 08:31

So, you need to re-run the logistic regression saving the estimates, then calculate model predictions, and then calculate the ROC curve

Code:

mi estimate, or saving(my_estimates, replace): logistic dep_var indep_var1 indep_var2 indep_var3 mi predict xb using my_estimates roctab dep_var xb
Comment
Sunny Sam

Join Date: Oct 2015

Posts: 4
#3

17 Feb 2017, 08:54

When I use roctab dep_var xb
it only uses observations from m=0 data, so only on complete cases?
Thanks
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

17 Feb 2017, 09:07

Originally posted by Sunny Sam View Post

When I use roctab dep_var xb
it only uses observations from m=0 data, so only on complete cases?
Thanks

No. I think the program is putting the predicted probability in all the observations in m==0. Then, roctab is running off that complete set of probabilities. Note, though, that any confidence interval you calculate from roctab won't account for the greater uncertainty in those probabilities (because they come from the imputed data). It's probably fine for your purposes, though, as we don't normally pay that much attention to the standard error of an AUC estimate.

Actually, Clyde Schechter - when you predict xb, that's the linear predictor, and I think Sunil has to transform that to the probability with the inverse logit function, right?

If so, Sunil should do this:

Code:

mi estimate, or saving(my_estimates, replace): logistic dep_var indep_var1 indep_var2 indep_var3 mi predict xb using my_estimates gen p = involgit(xb) roctab dep_var p

Last edited by Weiwen Ng; 17 Feb 2017, 09:10.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#5

17 Feb 2017, 09:15

Well, I think there is some controversy about this. Here's my reasoning:

You are looking to test the fit of your model. Although -mi estimate- first estimates a separate model for each of the imputed data sets, but then they get combined by Rubin's rules into a single model. That single model is your MI-estimate of the model for your data. -mi predict- acknowledges this: although it does calculate predicted values in your imputed data sets as well as the original data, in fact the predicted values in each imputed data set are the same as the predicted values in your original data.

It is also worth bearing in mind that your imputed data sets are not data. They are, to use a provocative term, fantasies. Useful fantasies for some purposes, but fantasies nonetheless. Indeed, they may very well contain imputed values that are not even possible values for the variables. While nothing but the time and effort required would stop you from calculating the area under the ROC curve in each of the imputed data sets and then combining the results in some way, perhaps emulating Rubin's rules, there is no particular reason to believe that the resulting statistic would be meaningful in any way, let alone as some estimate of the discrimating power of the -mi estimate- calculated model for any set of data in the real world.

So my answer is, yes, one can only test model fit on real data, which, in this case, means restricted to complete cases. Others may disagree, but I do not understand what interpretation could be given to the results.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#6

17 Feb 2017, 09:20

Weiwen Ng

Actually, Clyde Schechter - when you predict xb, that's the linear predictor, and I think Sunil has to transform that to the probability with the inverse logit function, right?

The usual way of calculating ROC curve areas after a logistic regression is based on the predicted probability, which is the inverse logit of the linear predictor. However, the ROC curve is constructed using only the ordinal properties of the predictor being assessed, and logit is a strictly monotone function. So working with xb instead of predicted probability will give the same results for an ROC curve, and it's simpler to code this way.
Comment
Sunny Sam

Join Date: Oct 2015

Posts: 4
#7

17 Feb 2017, 09:51

Hi Clyde Schechter and Weiwen Ng,
Thank you for your response.
Regarding using linear predictor or invlogit function, as you have said, the AUC obtained is exactly the same by both approaches.

I was also trying other goodness of fit tests like the Hosmer-Lemeshow test by typing "estat gof , group(10)" following mi estimate function. But, I get an error message. I guess the reasons are similar to what is discussed here- only 'real data can be divided up into groups ?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#8

17 Feb 2017, 10:03

Yes. If you want to emulate the Hosmer-Lemeshow calculations you can do this with your -mi predict- results:

Code:

// IDENTIFY DECILES OF PREDICTED RISK xtile decile = xb, nq(10) // CALCULATE EXPECTED & OBSERVED SUCCESSES // IN EACH DECILE collapse (sum) predicted = xb observed = dep_var, by(decile)

Now if you want the H-L chi square statistic you can just calculate the sum of (observed - predicted)^2/predicted. For my part, I am not fond of the chi square statistic in this context. I generally prefer to graph the observed vs the predicted, overlayed on a diagonal line to get a sense of whether the model is reasonably well calibrated all over, or if perhaps there is some particular region of predicted risk where it fits less well. That might enable me to improve the model, whereas the chi-square test just gives a somewhat arbitrary up/down verdict. Also, if your sample is large, you probably should do this not with deciles but with a larger number of groups.

Also, you didn't say what your three independent variables are. If each is a dichotomy, then, in fact, there are only 8 predicted values possible, so the standard H-L procedure makes no sense. You would be better off emulating the Pearson chi square for the model instead. The code would be similar to above, but the groups would be defined by individual predicted values rather than by "deciles" (which don't mean much when there are only 8 values).
1 like
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#9

17 Feb 2017, 11:22

Originally posted by Clyde Schechter View Post

Weiwen Ng

The usual way of calculating ROC curve areas after a logistic regression is based on the predicted probability, which is the inverse logit of the linear predictor. However, the ROC curve is constructed using only the ordinal properties of the predictor being assessed, and logit is a strictly monotone function. So working with xb instead of predicted probability will give the same results for an ROC curve, and it's simpler to code this way.

Ah, you are quite right. Thanks!

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Sunny Sam

Join Date: Oct 2015

Posts: 4
#10

20 Feb 2017, 08:13

Hi Clyde Schechter , thank you for your response. My sample size is around 2000. I actually have 6 indep variables - 3 are linear and 3 are binary (and binary dependent variable) Thanks
Comment
Priyanka Pandhi

Join Date: Jun 2018

Posts: 25
#11

12 Feb 2019, 14:18

Hi Clyde , calculating ROC on real data is fine. When reporting the results/ methods in publication, should we explicitly write that ROCs are from real data and not the pooled one?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#12

12 Feb 2019, 14:51

Yes.
Comment
Priyanka Pandhi

Join Date: Jun 2018

Posts: 25
#13

25 Jul 2019, 10:33

Originally posted by Clyde Schechter View Post

Well, I think there is some controversy about this. Here's my reasoning:

You are looking to test the fit of your model. Although -mi estimate- first estimates a separate model for each of the imputed data sets, but then they get combined by Rubin's rules into a single model. That single model is your MI-estimate of the model for your data. -mi predict- acknowledges this: although it does calculate predicted values in your imputed data sets as well as the original data, in fact the predicted values in each imputed data set are the same as the predicted values in your original data.

It is also worth bearing in mind that your imputed data sets are not data. They are, to use a provocative term, fantasies. Useful fantasies for some purposes, but fantasies nonetheless. Indeed, they may very well contain imputed values that are not even possible values for the variables. While nothing but the time and effort required would stop you from calculating the area under the ROC curve in each of the imputed data sets and then combining the results in some way, perhaps emulating Rubin's rules, there is no particular reason to believe that the resulting statistic would be meaningful in any way, let alone as some estimate of the discrimating power of the -mi estimate- calculated model for any set of data in the real world.

So my answer is, yes, one can only test model fit on real data, which, in this case, means restricted to complete cases. Others may disagree, but I do not understand what interpretation could be given to the results.

I think Clyde is right about prediction should be done on complete cases a.ka. real cases.
Comment
Dennis Ton

Join Date: Aug 2024

Posts: 2
#14

27 Aug 2024, 03:43

Originally posted by Priyanka Pandhi View Post

Hi Clyde , calculating ROC on real data is fine. When reporting the results/ methods in publication, should we explicitly write that ROCs are from real data and not the pooled one?

Dear readers,
I am encountering this same issue and am wondering whether how to interpret the ROC curve on the real data when the odds ratios are a pooled estimate of the imputation sets.
Should one just interpret it as: "The pooled OR was 1.26, corresponding with a 1.26 increase in the odds of developing <outcome> with each unit increase of the <predictor>. Based on the ROC analysis, a value of x of the predictor corresponded with a predicted risk of xx% on development of the outcome based on the original data." ?

Thank you very much for your help

Last edited by Dennis Ton; 27 Aug 2024, 04:00.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#15

27 Aug 2024, 09:25

Based on the ROC analysis, a value of x of the predictor corresponded with a predicted risk of xx% on development of the outcome based on the original data." ?

For reasons having nothing to do with the issue of what can be done with the original data and what can be done with imputations, this sentence is very wrong. An ROC analysis says nothing whatsoever about the correspondence between the value of a predictor and the associated risk. An ROC analysis measures the ability of the model to discriminate positive and negative outcomes, that's all.
Comment

Announcement

Asses performance of logistic model in MI data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment