Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Data 'not concave' with logit function

    I have been analyzing some genetic data and recently ran into a problem trying to perform a multivariable analysis. I have a dichotomous outcome (disease or no disease) I've identified 4 separate genes that meet a threshold for inclusion in a multivariable analysis, and all are expressed as dichotomous data (present or absent). When I run a logistic regression, I am told the data will not converge (see below);


    . logistic disease var1 var2 var3 var4
    convergence not achieved

    Logistic regression Number of obs = 23
    LR chi2(1) = 22.72
    Prob > chi2 = 0.0000
    Log likelihood = -2.7725887 Pseudo R2 = 0.8038

    ------------------------------------------------------------------------------
    disease | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    var1 | 2.00e+08 4.01e+08 9.56 0.000 3974457 1.01e+10
    var2 | 1.34e+08 . . . . .
    var3 | 1.01e-16 . . . . .
    var4 | 6.73e-17 . . . . .
    _cons | 7.42e+07 1.05e+08 12.81 0.000 4639470 1.19e+09
    ------------------------------------------------------------------------------
    Note: 14 failures and 5 successes completely determined.
    convergence not achieved
    r(430);



    Any thoughts on the root of my problem would be appreciated.

    Thanks.

  • #2
    Dear Michael,

    I believe you have prefect predictors in your sample (or close to it); also, estimating 5 parameters with 23 observations is a big ask.

    All the best,

    Joao

    Comment


    • #3
      Non-convergence can be due to many problems. Here's one possibility that is suggested (but not proved) by what you have posted. I notice that 19 observations were excluded from the analysis because they were perfectly predicted. Where there is fire, there may also be smoke. The coefficient estimates at the points where Stata gave up are extremely large (and well beyond anything reasonably possible) for var1 and var2, and are extremely small (in the sense of close to zero, and, again, not remotely plausible in reality) for var3 and var4. The constant term is also extremely high. This configuration is compatible with your 23 observations in the estimation sample consisting of 22 cases with disease and 1 case of non disease (or maybe a slightly less extreme split), where var1 and var2 being positive almost guarantee disease, and var3 and var4 nearly guarantee its absence. Putting it in briefer, more technical terms, I think you have a nearly constant outcome accompanied by nearly perfect prediction of that outcome, and a small estimation sample to boot. If so, the maximum likelihood estimates for your coefficients will be plus and minus infinity, and Stata will never get there, no matter how long it tries.

      Concrete advice: run -list disease var1 var2 var3 var4 if e(sample)- and see if my diagnosis is (at least close to) correct.

      Finally, even if you sort this all out, it appears that your total potential sample here is the 23 cases in the estimation sample plus the 19 that got kicked out. With only 42 cases, fitting four predictors is really quite a stretch. I'm not optimistic you can get what you're looking for out of this data in any case. You may need to scale back your research objectives if you can't get more data (and better data).

      Note added later: crossed in cyberspace with Joao who managed to say the same thing far more concisely!

      Comment

      Working...
      X