Evaluating the magnitude of AIC and BIC reductions

Andrea Arancio

Join Date: Jan 2015

Posts: 56
#1

Evaluating the magnitude of AIC and BIC reductions

16 Feb 2017, 09:13

Sorry for cross-posting (http://stats.stackexchange.com/quest...bic-reductions).
I'm trying to get an answer to this question, using Stata and you were alway extremely helpful in the past.

Reading the interesting post Interpretation of variance in multilevel logistic regression I understood that I cannot compare multilevel logistic models looking at the empty model and at variance reductions (as I would have done in multilevel models with a continuous dependent variable).

The suggestion is therefore to use AIC and BIC to make the comparison. In my case study I have about 20 groups and 18,000 observations (nested in groups). AIC and BIC for the empty model are both about 18,600. In the full model (where I add predictors) I obtain a reduction of AIC and BIC of about 1,000 (they are now about 17,650).

Can you help me quantify this reduction? Is it small or large? What are the general rules for assessing the reduction size of AIC and BIC in multilevel logit model, i.e. to say if they are big or small?

Lastly, running these models in Stata, is there any way to have a % measure of the variance explained? and of the effect size of each predictor?

Thanks a lot as always
Andrea
Tags: None
Sebastian Kripfganz

Join Date: May 2014

Posts: 2593
#2

16 Feb 2017, 10:14

The values of the AIC and BIC alone do not tell you much. They are only meaningful in the relative comparison of different model versions.

https://www.kripfganz.de/stata/
Comment
Andrea Arancio

Join Date: Jan 2015

Posts: 56
#3

16 Feb 2017, 10:16

Thanks for the quick answer. But how big is the difference? How better is a model compared to the other if AIC and BIC decrease of 1000?

And is there any way to have a % measure of the variance explained? and of the effect size of each predictor?
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

16 Feb 2017, 10:55

Originally posted by Andrea Arancio View Post

Thanks for the quick answer. But how big is the difference? How better is a model compared to the other if AIC and BIC decrease of 1000?

And is there any way to have a % measure of the variance explained? and of the effect size of each predictor?

I get the sense your question can't be answered easily. The absolute difference does not matter. Also, while there are more complicated explanations (none of which I understand), you can just choose the model with the lowest AIC or BIC and say that was your decision rule. Note, of course, that they must be nested models, i.e. one model must have all the parameters of another model. You can also use a likelihood ratio test between nested models; this will show if the improvement in the log likelihood is statistically significant or not (i.e. does the model fit better).

The odds ratios for each predictor are already an estimate of effect size.

Lastly, variance explained works a bit differently in logistic regression than in linear regression. I haven't worked with multilevel logistic models personally. However, if you want to get a sense of predictive accuracy between models, it turns out that the ROC comparison commands will work on these models, as we discussed below. In particular, check the sample code I posted; you would see that the c-statistic from a plain logistic model leaving out a random effect is about 0.644, whereas a mixed model with a random intercept for district (I think) gives a c-statistic of 0.7something.

http://www.statalist.org/forums/foru...effects-models

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#5

16 Feb 2017, 11:39

A point of clarification. The ROC curve area is not really a measure of validity or goodness of fit of a model. It is a measure of the model's ability to discriminate cases with success (1) outcome from those with failure (0) outcomes. To understand the difference, imagine we have a logistic model which predicts a 0.51 probability of success to every actual success observation, and a probability of 0.49 to every actual failure observation. If you were to set up such a model and do the ROC curve area, it would actually be 1.0--perfect discrimination. The model always predicts a higher success probability to a success observation than it does to any failure observation. So the discriminatory power of the model is excellent. But the fit of the model to the data is not very good, because of the observations that it predicts have as .51 probability of success, actually 100% of them do. And of the observations that it predicts of a 0.49 probability of success, 0% of them do. So despite the high level of discrimination it provides, the model is poorly calibrated.

Calibration can be assessed using the -estat gof- procedure after Stata -logistic- models. (And it is fairly easy to write code that calculates the same statistics after -melogit-.) A calibration measure actually looks at groups of observations that have the same (or a narrow range) of predicted probabilities and compares the predicted probability of success in each group with the actual proportion of those observations that have a success outcome. The Hosmer-Lemeshow statistic is probably the best known of these statistics. While Hosmer-Lemeshow chi square summary statistic has been the subject of considerable criticism (for good reasons), the general approach of comparing predicted and observed outcomes remains the cornerstone of assessing goodness of fit in any kind of statistical model.

Actually, the overall usefulness of any binomial-outcome model requires assessment of both discrimination and calibration, as a model can do well at either will doing poorly at the other. Depending on the use to which the model will be put, one or the other may suffice. But for most situations, you will want evidence of both good discrimination and good calibration.
1 like
Comment
Sebastian Kripfganz

Join Date: May 2014

Posts: 2593
#6

16 Feb 2017, 13:22

Another point of clarification: You can also compare two models on the grounds of the AIC and BIC if one is not nested in the other. What is needed is that they are both nested in the same parent model, e.g. for the parent model
\[y = x_1 \beta_1 + x_2 \beta_2 + x_3 \beta_3 + e\]
you can compare the AIC and BIC between the following two sub-modells:
\[y = x_1 \beta_1 + x_2 \beta_2 + u\]
\[y = x_1 \beta_1 + x_3 \beta_3 + v\]

https://www.kripfganz.de/stata/
Comment
Andrea Arancio

Join Date: Jan 2015

Posts: 56
#7

17 Feb 2017, 01:50

Thanks a lot everybody! I compare an empty model, with a model where I add predictors, so I should not have the problem of comparison using AIC and BIC.

However, Weiwen's and Clyde's where especially helpful to my purpose!
To check if I understood correctly, first I get the ROC area as suggested by Weiwen:

Code:

meqrlogit DEPENDENT PREDICTORS || Group:, intpoints(10) predict A roctab DEPENDENT A

And I get a value of the ROC area of about 0.8 (which is not that much of improvement compared to the empty model where this area is 0.76).

Then I compare predicted and real outcomes. I do this in this way (can you tell me if it's ok):

Code:

meqrlogit DEPENDENT PREDICTORS || Groups:, intpoints(10) predict A gen B = 1 if A > .5 #could try other cutoffs to test for improvements replace B = 0 if mi(B) tab DEPENDENT B

Then from the table I calculate accuracy, sensitivity and specificity.

Now my doubts are:

- can I adjust the cutoff to my pleasure, to see if I can get better results?

- how do I calculate the Hosmer-Lemeshow test after meqrlogit?

- What are decent acceptable values for the ROC area and Hosmer-Lemeshow test and correct classification?

- Which are acceptable values for sensitivity and specificity?

Thanks again!

Last edited by Andrea Arancio; 17 Feb 2017, 02:25.
Comment
Andrea Arancio

Join Date: Jan 2015

Posts: 56
#8

17 Feb 2017, 02:56

And would it also make sense to get the goodnees of fit looking at the Cohen's Kappa?
I could calculate it this way, if I'm not wrong:

Code:

kap DEPENDENT B
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#9

17 Feb 2017, 07:36

Some thoughts.

Cohen's Kappa is a measure of how often two binary variables disagree. So, if I remember correctly, you can't simply use B as the other variable for kappa, because B is your predicted probability, which is continuous.

Anyway, kappa isn't a measure of goodness of fit. Neither are sensitivity nor specificity. They are diagnostic measures. There are no absolute acceptable values for either, it depends on both the context and how frequent the outcome is.

And, why do you need to calculate sensitivity and specificity? Do you actually intend to use the risk score as a classifier for some diagnostic test? If you do, remember that your random effects matter quite a bit here. When you go predict a probability for an observation, the actual predicted probability also depends on the random intercept. If you go apply the model out of sample, your predicted probabilities will be incorrect. Do sensitivity and specificity actually matter, or are you just trying to diagnose the model? If the latter, just focus on AUC and the Hosmer-Lemeshow test.

As to coding for that test, I have some work to do, and I may try to code this afterward. No guarantees.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
2 likes
Comment
Andrea Arancio

Join Date: Jan 2015

Posts: 56
#10

17 Feb 2017, 07:44

Thanks a lot Weiwen.

I think using Kappa might be ok as this is often used to assess the quality of classification problems (for instance after support vector machines).
B is binary indeed, after the commands

gen B = 1 if A > .5 #could try other cutoffs to test for improvements
replace B = 0 if mi(B)

A correct classification would be important in my case (and I did not think about the problem of random intercepts). Perhaps for classifciation purposes it would be better to use a regular logit, ignoring the fact these are repeated measures on same subjets?

But you are right, what I mostly want is to diagnose the model. Your help for the HL test would be precious. Thanks a lot!
Comment
Andrea Arancio

Join Date: Jan 2015

Posts: 56
#11

23 Feb 2017, 10:02

Any chance to get that very very useful piece of code?
That would be a great present for me
Comment
Youenn LOHEAC

Join Date: Apr 2017

Posts: 1
#12

24 Apr 2017, 09:10

Originally posted by Sebastian Kripfganz View Post

Another point of clarification: You can also compare two models on the grounds of the AIC and BIC if one is not nested in the other. What is needed is that they are both nested in the same parent model, e.g. for the parent model
\[y = x_1 \beta_1 + x_2 \beta_2 + x_3 \beta_3 + e\]
you can compare the AIC and BIC between the following two sub-modells:
\[y = x_1 \beta_1 + x_2 \beta_2 + u\]
\[y = x_1 \beta_1 + x_3 \beta_3 + v\]

Hello,
I'm in this specific case. My question is: how to know if one BIC (or AIC) is significantly better than another?
Thank you very much
1 like
Comment

Announcement

Evaluating the magnitude of AIC and BIC reductions

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment