Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Perfect predictors: option to drop variable instead of observation?

    Hi all,

    When running logit or tpm, is there a way to force stata to drop perfect predictors, instead of dropping the observations that cause the variable to be a perfect predictor? Every time I have drops, I end up going back to the code and dropping the variable to keep the observations.

    Thanks,
    Alex

  • #2
    To my knowledge, there is no way to do this. I can imagine writing a wrapper for -logit- that would accomplish this. But it raises a couple of questions:

    1. Often the perfect predictor is just one value of a set of indicators ("dummies") for a categorical variable. Dropping just those indicators, without also dropping the corresponding observations, would lead to an ill-specified model whose results would be uninterpretable, as it would, in effect, be reclassifying the value of that categorical variable to the reference category for all of those observations. That would only rarely be an acceptable thing to do. Yet dropping the entire categorical variable means throwing away all of the information it carries. That might be acceptable, but probably not something I would want to do automatically: choosing variables to include in models should be done thoughtfully and in light of the underlying science and circumstances, not algorithmically.

    2. If you are encountering this problem often enough to want to automate a solution to it, it is likely that the time it would take to program it would be better spent thinking more deeply about what variables to include in the model (or, if you have a very bizarre data set, figuring out how to improve that.)

    Comment


    • #3
      Thanks for the response Clyde.

      My model specifications include anywhere from 20 to 100 covariates. One example of a variable that drops is state (Minnesota, Alaska, etc) dummies. I include a dummy for every state that has an observation in my data (among other covariates). However, lets say the 12 observations from Alaska have dependent variable = 0, then the 12 obs are automatically dropped, and I lose other desired information from those variables.

      Perhaps you need more information on my methods. Do you see any immediate red flags in removing the Alaska dummy to preserve the 12 dropped observations?

      Comment


      • #4
        I would rather think into the direction of augmenting the data, i.e. add some pseudo observations with low weight to avoid perfect prediction. This is what is done during multiple imputation. But then again, MI is just a means to an end, i.e. getting good enough predictions, not for actually interpreting the coefficients. Just some spontaneous thoughts.

        Best
        Daniel

        Comment


        • #5
          Do you see any immediate red flags in removing the Alaska dummy to preserve the 12 dropped observations?
          Yes, I do. Some state is the reference category for the state indicators, i.e. the one that is coded zero on all of the state indicators. Let's say it's Alabama (the alphabetically first state name). By dropping the Alaska indicator and keeping those observations in the analysis, you are re-classifying Alaska as being the same as Alabama. If you can justify that on other grounds in the context of your research, then fine. But if you do it just to keep the observations in the estimation sample, you will mangle the model.

          If you are going to stick with logistic regression, I think you just need to identify which states exhibit perfect prediction (in either direction) and exclude their data from the model. The model doesn't apply to them. They are covered by simple external rules: Alaska -> dep var = 0, etc.

          Now there are some analytic alternatives. The problem of perfect prediction foils maximum likelihood estimation because the maximum likelihood estimate of the corresponding coefficient is infinite (positive or negative as the case may be). But there is also exact logistic regression, which does not use maximum likelihood and can handle cases like this. The problem is that it is both computing time and memory intensive and your problem may just involve too many variables and too much data to be able to use it. Or what about a linear probability model estimated by OLS? Another thought would be Bayesian logistic regression, though I have no direct experience with doing that. Daniel Klein's suggestion of augmenting the data set with a few observations that go in the opposite direction is another viable approach (and is conceptually somewhat related to Bayesian logistic regression). But if you are experiencing this with many of your states, then the augmented data could soon add up to an appreciable fraction of the total data, which could be a problem in its own right. Of course if you are really encountering perfect prediction with a large number of states it suggests that this data just isn't very suitable for logistic regression analysis.



          Comment

          Working...
          X