Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to deal with separation in logistic regressions in STATA?

    I'm running a binary logistic regression on 15 independent variables for 180 observations in STATA (version 11). This I do for four different groups, i.e. four dependent variables. For three, it works fine, but the fourth gives me the warning message that one IV "predicts success perfectly" and that the IV is "dropped and 96 obs not used". This IV is a dummy (and considered important for this one out of the four groups).
    This leaves me with 84 observations. In comparison to the other three estimations, the SE seem to be a bit inflated, but p-values are fine (if I drop the problematic IV, the SE go down, but the p-values go up). The LR chi squared shows significance at the 10% level and the pseudo R squared is 0.191.

    I suppose this is a case of (quasi-)complete separation. My questions are

    1. if I can still use the regression estimates after such a warning, because STATA dropped the problematic cases (=all of the 84 the observations, for which the dummy exists or is coded 1).
    2. or if I should drop the problematic IV (which would be unfortunate, even though I would have a the original sample size of 180 then)
    2. and if neither 1. or 2., how I can deal with this problem.

    I know that 180 oberservations are not much for a model with 15 IVs. However, I'm hoping for some advice on how to solve the problem other than reducing an overfitted model - especially specific STATA knowledge would be of great help.

    Cheers!
    Last edited by Annette Becker; 10 Feb 2015, 16:39.

  • #2
    Cross-posted at http://stats.stackexchange.com/quest...ssion-in-stata

    Please read the FAQ Advice for our policy on cross-posting, which is that you should tell us about it.

    Comment


    • #3
      Consider using exlogistic. The help says

      "exlogistic is an alternative to logistic, the standard maximum-likelihood-based logistic regression estimator; see [R] logistic. exlogistic produces more-accurate inference in small samples because it does not depend on asymptotic results and exlogistic can better deal with one-way causation, such as the case where all females are observed to have a positive outcome."

      15 IVs is an awful lot though, Perhaps the condvars option will give you more of a fighting chance.

      Also consider using firthlogit, available from SSC. The help says

      "firthlogit fits logistic models by penalized maximum likelihood regression. The method originally was proposed to reduce bias in maximum likelihood estimates in generalized linear models. It also has utility in logistic regression in circumstances in which "separation" is problematic."

      I suspect you should simplify your model, but you can try these options first and see if they seem to work.
      -------------------------------------------
      Richard Williams, Notre Dame Dept of Sociology
      Stata Version: 17.0 MP (2 processor)

      EMAIL: [email protected]
      WWW: https://www3.nd.edu/~rwilliam

      Comment


      • #4
        Thanks, Richard, for the quick advice!
        I tried the command, but unfortunately, it warns that "exceeded memory limit of 10.0M bytes". I guess I have to try it at one of the computers at my university tomorrow then.

        Comment


        • #5
          exlogistic has a memory option. You can go as high as 2g (assuming your machine has at least 2g). See the help for exlogistic. The condvars option can also save you memory, e.g. if 13 of your vars are just there as controls and you don't care about their coefficients then specify them using condvars.
          -------------------------------------------
          Richard Williams, Notre Dame Dept of Sociology
          Stata Version: 17.0 MP (2 processor)

          EMAIL: [email protected]
          WWW: https://www3.nd.edu/~rwilliam

          Comment


          • #6
            I'd try -firthlogit- (SSC) first, which uses penalized likelihood to deal with this very situation.

            Another alternative is to use a robust Poisson regression or -binreg- for RRs; the maximization routines are a bit different which might help to get (different) effect size measures.

            http://www.ats.ucla.edu/stat/stata/f...ative_risk.htm

            http://www.statalist.org/forums/foru...al-survey-data

            http://www.statisticalhorizons.com/w...n.StatComp.pdf
            Last edited by Andrew Lover; 11 Feb 2015, 06:39. Reason: added link.
            __________________________________________________ __
            Assistant Professor, Department of Biostatistics and Epidemiology
            School of Public Health and Health Sciences
            University of Massachusetts- Amherst

            Comment


            • #7
              Many thanks, the firthlogit worked very well. I wondered if I can run any postestimations though. Margins didn't work and I would like to include postestimation tests like goodness of fit and Pearson's chi squared.

              Comment


              • #8
                firthlogit wasn't written with margins in mind, and isn't designed to work with it. I think that margins relies upon the Hessian (or its negative inverse) for computing standard errors. For the same reason that coefficient standard errors should be ignored and Wald tests should be avoided after firthlogit, analogous standard errors and tests with margins after firthlogit should be avoided, even if the two commands could be made to work with each other.

                As for the Pearson goodness-of-fit after firthlogit, you can compute the analogous test statistic manually using the method described in the user's manual for the entry "estat gof — Pearson or Hosmer–Lemeshow goodness-of-fit test". See below for an example of how. You might want to examine the test's operating characteristics before relying upon it as a test of goodness-of-fit after firthlogit. (I'd recommend the same for any testing with firthlogit, before relying upon it.)

                In general, for goodness-of-fit, I recommend a calibration plot in lieu of testing. It's also illustrated below.

                Code:
                version 13.1
                
                clear *
                set more off
                set seed `=date("2015-02-12", "YMD")'
                quietly set obs 180
                
                generate byte response = runiform() > 0.5
                forvalues i = 1/15 {
                    generate byte predictor`i' = runiform() > 0.5
                }
                
                logistic response i.predictor*, nolog
                estat gof
                
                *
                * Begin here
                *
                preserve
                
                predict double ML, pr
                label variable ML "Maximum Likelihood"
                
                tempname df
                scalar define `df' = e(df_m) + 1
                
                tempfile dataset predictor_patterns
                quietly save `dataset'
                
                contract predictor*, freq(n)
                generate int pattern = _n
                quietly save `predictor_patterns'
                merge 1:m predictor* using `dataset', assert(match) nogenerate noreport
                
                bysort pattern: egen int positives = sum(response)
                bysort pattern: keep if _n == 1
                generate double xi = (positives - n * ML)^2 / n / ML / (1 - ML)
                summarize xi, meanonly
                display in smcl as text "Pearson's chi-square = " as result chi2tail(r(N) - `df', r(sum))
                
                *
                * Analogously for firthlogit
                *
                use `dataset', clear
                firthlogit response predictor*, nolog
                
                predict double FL, xb
                quietly replace FL = invlogit(FL)
                label variable FL "Penalized Maximum Likelihood"
                
                scalar define `df' = e(df_m) + 1
                quietly save `dataset', replace
                
                merge m:1 predictor* using `predictor_patterns', assert(match) nogenerate noreport
                
                bysort pattern: egen int positives = sum(response)
                bysort pattern: keep if _n == 1
                generate double xi = (positives - n * FL)^2 / n / FL / (1 - FL)
                summarize xi, meanonly
                display in smcl as text "Pearson's chi-square = " as result chi2tail(r(N) - `df', r(sum))
                
                *
                * Here's an alternative goodness-of-fit test
                *
                use `dataset', clear
                pause on
                
                lowess response ML, mcolor(black) msize(vsmall) ///
                    lineopts(lcolor(black)) ///
                    addplot((line ML ML, sort lcolor(black) lpattern(dash))) ///
                    ylabel( , angle(horizontal) nogrid) ///
                    ytitle("Observed") xtitle("Predicted") title("") legend(off)
                pause
                
                lowess response FL, mcolor(black) msize(vsmall) ///
                    lineopts(lcolor(black)) ///
                    addplot((line FL FL, sort lcolor(black) lpattern(dash))) ///
                    ylabel( , angle(horizontal) nogrid) ///
                    ytitle("Observed") xtitle("Predicted") title("") legend(off)
                pause
                
                *
                * What firthlogit's doing
                *
                local ytitle : variable label FL
                graph twoway scatter FL ML, mcolor(black) msize(vsmall) || ///
                    line ML ML, sort lcolor(black) lpattern(dash) ///
                    ytitle(`ytitle') ylabel( , angle(horizontal) nogrid) legend(off)
                
                exit

                Comment


                • #9
                  Since Firth isn't a maximum likelihood method options are somewhat limited- the only discussion of GOF or R2 for Firth logistic I've seen is here, using Tjur's R2:

                  http://www.statisticalhorizons.com/r2logistic

                  Or, if you have enough data using a subset for model validation might be good.
                  __________________________________________________ __
                  Assistant Professor, Department of Biostatistics and Epidemiology
                  School of Public Health and Health Sciences
                  University of Massachusetts- Amherst

                  Comment


                  • #10
                    Thanks a lot, Tjur’s R2 worked out very well and I integrated into my analysis. I read that the higher the values, the better (http://www3.nd.edu/~rwilliam/stats3/L05.pdf), but could anyone provide a number? Mine is 0.535.

                    Comment


                    • #11
                      Hi,
                      In the output of exlogistic where it reports p value, it says 2*Pr(Suff.) . Does the number 2 mean anything specific?

                      Comment

                      Working...
                      X