How to deal with separation in logistic regressions in STATA?

Annette Becker

Join Date: Feb 2015

Posts: 5
#1

How to deal with separation in logistic regressions in STATA?

10 Feb 2015, 16:02

I'm running a binary logistic regression on 15 independent variables for 180 observations in STATA (version 11). This I do for four different groups, i.e. four dependent variables. For three, it works fine, but the fourth gives me the warning message that one IV "predicts success perfectly" and that the IV is "dropped and 96 obs not used". This IV is a dummy (and considered important for this one out of the four groups).
This leaves me with 84 observations. In comparison to the other three estimations, the SE seem to be a bit inflated, but p-values are fine (if I drop the problematic IV, the SE go down, but the p-values go up). The LR chi squared shows significance at the 10% level and the pseudo R squared is 0.191.

I suppose this is a case of (quasi-)complete separation. My questions are

1. if I can still use the regression estimates after such a warning, because STATA dropped the problematic cases (=all of the 84 the observations, for which the dummy exists or is coded 1).
2. or if I should drop the problematic IV (which would be unfortunate, even though I would have a the original sample size of 180 then)
2. and if neither 1. or 2., how I can deal with this problem.

I know that 180 oberservations are not much for a model with 15 IVs. However, I'm hoping for some advice on how to solve the problem other than reducing an overfitted model - especially specific STATA knowledge would be of great help.

Cheers!

Last edited by Annette Becker; 10 Feb 2015, 16:39.
Tags: logit, overfit, separation
Nick Cox

Join Date: Mar 2014

Posts: 35724
#2

10 Feb 2015, 16:33

Cross-posted at http://stats.stackexchange.com/quest...ssion-in-stata

Please read the FAQ Advice for our policy on cross-posting, which is that you should tell us about it.
1 like
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#3

10 Feb 2015, 16:37

Consider using exlogistic. The help says

"exlogistic is an alternative to logistic, the standard maximum-likelihood-based logistic regression estimator; see [R] logistic. exlogistic produces more-accurate inference in small samples because it does not depend on asymptotic results and exlogistic can better deal with one-way causation, such as the case where all females are observed to have a positive outcome."

15 IVs is an awful lot though, Perhaps the condvars option will give you more of a fighting chance.

Also consider using firthlogit, available from SSC. The help says

"firthlogit fits logistic models by penalized maximum likelihood regression. The method originally was proposed to reduce bias in maximum likelihood estimates in generalized linear models. It also has utility in logistic regression in circumstances in which "separation" is problematic."

I suspect you should simplify your model, but you can try these options first and see if they seem to work.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Annette Becker

Join Date: Feb 2015

Posts: 5
#4

10 Feb 2015, 23:13

Thanks, Richard, for the quick advice!
I tried the command, but unfortunately, it warns that "exceeded memory limit of 10.0M bytes". I guess I have to try it at one of the computers at my university tomorrow then.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#5

10 Feb 2015, 23:23

exlogistic has a memory option. You can go as high as 2g (assuming your machine has at least 2g). See the help for exlogistic. The condvars option can also save you memory, e.g. if 13 of your vars are just there as controls and you don't care about their coefficients then specify them using condvars.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
1 like
Comment
Andrew Lover

Join Date: Apr 2014

Posts: 182
#6

11 Feb 2015, 06:21

I'd try -firthlogit- (SSC) first, which uses penalized likelihood to deal with this very situation.

Another alternative is to use a robust Poisson regression or -binreg- for RRs; the maximization routines are a bit different which might help to get (different) effect size measures.

http://www.ats.ucla.edu/stat/stata/f...ative_risk.htm

http://www.statalist.org/forums/foru...al-survey-data

http://www.statisticalhorizons.com/w...n.StatComp.pdf

Last edited by Andrew Lover; 11 Feb 2015, 06:39. Reason: added link.

__________________________________________________ __
Assistant Professor, Department of Biostatistics and Epidemiology
School of Public Health and Health Sciences
University of Massachusetts- Amherst
1 like
Comment
Annette Becker

Join Date: Feb 2015

Posts: 5
#7

11 Feb 2015, 18:21

Many thanks, the firthlogit worked very well. I wondered if I can run any postestimations though. Margins didn't work and I would like to include postestimation tests like goodness of fit and Pearson's chi squared.
Comment

Joseph Coveney

Join Date: Apr 2014
Posts: 4421

11 Feb 2015, 23:00

firthlogit wasn't written with margins in mind, and isn't designed to work with it. I think that margins relies upon the Hessian (or its negative inverse) for computing standard errors. For the same reason that coefficient standard errors should be ignored and Wald tests should be avoided after firthlogit, analogous standard errors and tests with margins after firthlogit should be avoided, even if the two commands could be made to work with each other.

As for the Pearson goodness-of-fit after firthlogit, you can compute the analogous test statistic manually using the method described in the user's manual for the entry "estat gof — Pearson or Hosmer–Lemeshow goodness-of-fit test". See below for an example of how. You might want to examine the test's operating characteristics before relying upon it as a test of goodness-of-fit after firthlogit. (I'd recommend the same for any testing with firthlogit, before relying upon it.)

In general, for goodness-of-fit, I recommend a calibration plot in lieu of testing. It's also illustrated below.

Code:

version 13.1

clear *
set more off
set seed `=date("2015-02-12", "YMD")'
quietly set obs 180

generate byte response = runiform() > 0.5
forvalues i = 1/15 {
    generate byte predictor`i' = runiform() > 0.5
}

logistic response i.predictor*, nolog
estat gof

*
* Begin here
*
preserve

predict double ML, pr
label variable ML "Maximum Likelihood"

tempname df
scalar define `df' = e(df_m) + 1

tempfile dataset predictor_patterns
quietly save `dataset'

contract predictor*, freq(n)
generate int pattern = _n
quietly save `predictor_patterns'
merge 1:m predictor* using `dataset', assert(match) nogenerate noreport

bysort pattern: egen int positives = sum(response)
bysort pattern: keep if _n == 1
generate double xi = (positives - n * ML)^2 / n / ML / (1 - ML)
summarize xi, meanonly
display in smcl as text "Pearson's chi-square = " as result chi2tail(r(N) - `df', r(sum))

*
* Analogously for firthlogit
*
use `dataset', clear
firthlogit response predictor*, nolog

predict double FL, xb
quietly replace FL = invlogit(FL)
label variable FL "Penalized Maximum Likelihood"

scalar define `df' = e(df_m) + 1
quietly save `dataset', replace

merge m:1 predictor* using `predictor_patterns', assert(match) nogenerate noreport

bysort pattern: egen int positives = sum(response)
bysort pattern: keep if _n == 1
generate double xi = (positives - n * FL)^2 / n / FL / (1 - FL)
summarize xi, meanonly
display in smcl as text "Pearson's chi-square = " as result chi2tail(r(N) - `df', r(sum))

*
* Here's an alternative goodness-of-fit test
*
use `dataset', clear
pause on

lowess response ML, mcolor(black) msize(vsmall) ///
    lineopts(lcolor(black)) ///
    addplot((line ML ML, sort lcolor(black) lpattern(dash))) ///
    ylabel( , angle(horizontal) nogrid) ///
    ytitle("Observed") xtitle("Predicted") title("") legend(off)
pause

lowess response FL, mcolor(black) msize(vsmall) ///
    lineopts(lcolor(black)) ///
    addplot((line FL FL, sort lcolor(black) lpattern(dash))) ///
    ylabel( , angle(horizontal) nogrid) ///
    ytitle("Observed") xtitle("Predicted") title("") legend(off)
pause

*
* What firthlogit's doing
*
local ytitle : variable label FL
graph twoway scatter FL ML, mcolor(black) msize(vsmall) || ///
    line ML ML, sort lcolor(black) lpattern(dash) ///
    ytitle(`ytitle') ylabel( , angle(horizontal) nogrid) legend(off)

exit

Comment

Andrew Lover

Join Date: Apr 2014

Posts: 182
#9

12 Feb 2015, 06:50

Since Firth isn't a maximum likelihood method options are somewhat limited- the only discussion of GOF or R2 for Firth logistic I've seen is here, using Tjur's R2:

http://www.statisticalhorizons.com/r2logistic

Or, if you have enough data using a subset for model validation might be good.

__________________________________________________ __
Assistant Professor, Department of Biostatistics and Epidemiology
School of Public Health and Health Sciences
University of Massachusetts- Amherst
Comment
Annette Becker

Join Date: Feb 2015

Posts: 5
#10

12 Feb 2015, 08:09

Thanks a lot, Tjur’s R²worked out very well and I integrated into my analysis. I read that the higher the values, the better (http://www3.nd.edu/~rwilliam/stats3/L05.pdf), but could anyone provide a number? Mine is 0.535.
Comment
Arash Dashti

Join Date: Mar 2015

Posts: 18
#11

15 Mar 2015, 18:29

Hi,
In the output of exlogistic where it reports p value, it says 2*Pr(Suff.) . Does the number 2 mean anything specific?
Comment

Announcement