interpreting logistic regression model with interaction terms, data-sparsity and collinearity

Flo Bascombe

Join Date: Aug 2019

Posts: 3
#1

interpreting logistic regression model with interaction terms, data-sparsity and collinearity

03 Aug 2019, 01:05

Hello,

First off, I'm sorry to post such a basic question and basically ask someone to do my work for me... BUT... issues with data sparsity and collinearity were not covered in my stats for epi course, infact they were referred to... by saying they would be covered in the advanced course! That's no help to me as I am now desperately trying to complete my public health thesis and have come a cropper at this model! I simply don't know how to interpret it!! I know that there is evidence that 'nquad' (Number of quadrants) and ndox40e6 (positive E6 test) act as effect modifiers but I am a little lost. I thought that using margins would help but I still cannot interpret. I simply want to explain this output and present this in table (it's fair to say that anything further analysis would be too technical for me at this stage). I have only ever previously worked with a model with dichotomous variables as interaction terms, and there was no collinearity or lack of data.
Can anyone help?
Thanks
Flo
Tags: None
Flo Bascombe

Join Date: Aug 2019

Posts: 3
#2

03 Aug 2019, 16:56

Just bumping this up as im desperate for some help! Thanks!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#3

03 Aug 2019, 18:20

This is rather a mess. Before we despair, let's clarify a few things to make sure this model is even what you should be looking at.

First, you state that nquad and ndox40e6 are known to be "effect modifiers." Specifically what effect(s) are they known to modify? In your model you have them modifying each other's effects. Is that what you intended? If not, then we're simply playing in the wrong ball park here and you need to formulate a different model.

Assuming that you really did mean that nquad and ndox40e6 modify each other's effects, by including ndox40e6 (without an i. prefix, to boot) as your first predictor variable, and then having ndox40e6##i.nquad appear later on (where ndox40e6 is, by default, taken to be discrete) you have told Stata that it is both continuous and discrete. Moreover, you have tried to get Stata to enter it twice. Stata responded by including it as a continuous variable, and then omitting the discrete version. I'm not sure if that actually materially affects the results you got from the logistic regression, but it might well affect the margins output. Even if it didn't introduce any erroneous results, it just makes the outputs that much more confusing. So, first step would be to correct that and rerun the model as:

Code:

logistic sclearance hpv16 i.ndox40e6##i.nquad if immu == 1 & ....

(The right side of the original command is cut-off in the image of the results you show, so I cannot complete the command. This is one of the reasons why we ask people not to post images of their code and results, but rather to paste them directly into the Forum's editor and surround them by code delimiters. That will assure complete and faithful reproduction in a readable format. If you are not familiar with code delimiters, please read the Forum FAQ with particular attention to #12.)

By the way, if hpv16 is a discrete variable, do represent it as i.hpv16. The substantive consequences are minimal, but this kind of consistency may payoff later if you want to do a bit more analysis with this variable and use -margins-. -margins- absolutely requires knowing what is i. and what is c. When you don't specify in a regression command, -margins- assumes c. (unless the variable appears in an interaction term, in which case -margins- assumes i.)

This may or may not improve things substantively, it will at least make the output a little better organized so we might be able to deal better with it.

All of that said, these data are going to be difficult to work with. Some combinations of ndox40e6 and nquad are apparently not instantiated in the data, and some of the remaining combinations perfectly predict just one possible sclearance outcome. Logistic regression cannot estimate effects for such data, because the maximum likelihood estimate of such an effect is infinte (positive or negative, in the log-odds ratio metric).

One possibility, and I don't know if this is sensible from a scientific perspective, is to combine some of the categories of nquad. Perhaps if you just look at 0 and 1 vs 2 and 3 things will work out better. Or 0 vs 1, 2, 3, or 0, 1, 2 vs 3.

Another possibility is to use a different approach, such as -firthlogit-. The latter fits a logistic regression model, but uses a penalized maximum likelihood estimation method that can give finite results when there is perfect prediction. As I am a very infrequent user of -firthlogit- myself, I do not know if -margins- is available after -firthlogit-. But -firthlogit- will not save you from the uninstantiated combinations of ndox40e6 and nquad in your data: nothing can do that! Ultimately, it sounds like your data may not be up to the task you are asking of them.

Anyway, those are some things you can think about.
Comment

Announcement

interpreting logistic regression model with interaction terms, data-sparsity and collinearity

Comment

Comment