Data 'not concave' with logit function

Michael Smiths

Join Date: Feb 2016

Posts: 1
#1

Data 'not concave' with logit function

22 Feb 2016, 16:30

I have been analyzing some genetic data and recently ran into a problem trying to perform a multivariable analysis. I have a dichotomous outcome (disease or no disease) I've identified 4 separate genes that meet a threshold for inclusion in a multivariable analysis, and all are expressed as dichotomous data (present or absent). When I run a logistic regression, I am told the data will not converge (see below);

. logistic disease var1 var2 var3 var4
convergence not achieved

Logistic regression Number of obs = 23
LR chi2(1) = 22.72
Prob > chi2 = 0.0000
Log likelihood = -2.7725887 Pseudo R2 = 0.8038

------------------------------------------------------------------------------
disease | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
var1 | 2.00e+08 4.01e+08 9.56 0.000 3974457 1.01e+10
var2 | 1.34e+08 . . . . .
var3 | 1.01e-16 . . . . .
var4 | 6.73e-17 . . . . .
_cons | 7.42e+07 1.05e+08 12.81 0.000 4639470 1.19e+09
------------------------------------------------------------------------------
Note: 14 failures and 5 successes completely determined.
convergence not achieved
r(430);

Any thoughts on the root of my problem would be appreciated.

Thanks.
Tags: None
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#2

22 Feb 2016, 16:41

Dear Michael,

I believe you have prefect predictors in your sample (or close to it); also, estimating 5 parameters with 23 observations is a big ask.

All the best,

Joao
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#3

22 Feb 2016, 16:49

Non-convergence can be due to many problems. Here's one possibility that is suggested (but not proved) by what you have posted. I notice that 19 observations were excluded from the analysis because they were perfectly predicted. Where there is fire, there may also be smoke. The coefficient estimates at the points where Stata gave up are extremely large (and well beyond anything reasonably possible) for var1 and var2, and are extremely small (in the sense of close to zero, and, again, not remotely plausible in reality) for var3 and var4. The constant term is also extremely high. This configuration is compatible with your 23 observations in the estimation sample consisting of 22 cases with disease and 1 case of non disease (or maybe a slightly less extreme split), where var1 and var2 being positive almost guarantee disease, and var3 and var4 nearly guarantee its absence. Putting it in briefer, more technical terms, I think you have a nearly constant outcome accompanied by nearly perfect prediction of that outcome, and a small estimation sample to boot. If so, the maximum likelihood estimates for your coefficients will be plus and minus infinity, and Stata will never get there, no matter how long it tries.

Concrete advice: run -list disease var1 var2 var3 var4 if e(sample)- and see if my diagnosis is (at least close to) correct.

Finally, even if you sort this all out, it appears that your total potential sample here is the 23 cases in the estimation sample plus the 19 that got kicked out. With only 42 cases, fitting four predictors is really quite a stretch. I'm not optimistic you can get what you're looking for out of this data in any case. You may need to scale back your research objectives if you can't get more data (and better data).

Note added later: crossed in cyberspace with Joao who managed to say the same thing far more concisely!
Comment

Announcement

Data 'not concave' with logit function

Comment

Comment