Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Margins "not estimable" in zero inflated poisson (ZIP) model

    I have run a zero inflated poisson model using the ZIP command as follows:

    global depvar er_visits
    global preg i.emfqhc
    global covariates i.mm_agedec male i.mm_race2 urban mh sub ///
    tob charl arv hmo i(2 3 4)b2.maxvol_08_4 imp_zip ///
    hgrad cgrad

    zip $depvar $preg $covariates, inflate($preg $covariates) irr nolog vce(robust)

    I subsequently want to run margins to estimate the number of ER visits according to the three levels of my primary regressor ($preg)
    margins, dydx($preg)

    This returns the error message that the results are "not estimable".

    I think the problem has to do with collinearity. Two of my variables, i.emfqhc ($preg) and maxvol_08_04 are categorical variables with three levels, and the first level of each is perfectly collinear because they both represent the people who had no primary care visits in the last year. Both variables represent different things: the type of clinic one receives care in, and the outpatient experience of the provider, but in both cases a separate category had to be made for those who had no primary care visits because they otherwise would be unclassifiable in each of those variables. In order to get around this collinearity, in my code above, I omitted level 1 from the maxvol_08_4 variable (i(2 3 4)b2.maxvol_08_4) so that there would not be perfect collinearity with level 1 of the emfqhc variable. I think this might be the reason for the "not estimable" error message.

    I also tried looking at the matrix, as I saw in other posts:

    matrix H = get(H)
    matrix list H

    All of the values are -1, 0, 1.

    I have tried using the noestimcheck option:
    margins, dydx($preg) noestimcheck

    This works and does output results, but I am wondering if it is the appropriate use of this option.

    Thank you all for your help!

  • #2
    Are there empty cells in your model? That is, are there combinations of your discrete variables (emfqhc mm_agedec mm_race2 maxvol_08_4) for which there are no observations in the estimation sample? Remember that any observation with missing values on any of the variables in the model is excluded from the estimation sample, so even if every combination is instantiated in the data set as a whole, there might be none left for some combination(s) in the estimation sample. Empty cells are the commonest cause for getting "not estimable." Given that emfqhc is the variable provoking the message, I would look carefully for non-existent observations for some values of emfqhc, or for non-existent combinations of emfqhc with other variables.

    It sounds from your description of the design of the variables that this is highly likely to be the case. One obvious solution is to drop one of the offending variables from the model, or somehow re-define it. There are other ways to coax -margins- into estimating things it doesn't really want to. Read the details in the [R] manual section on -margins-. But all of them have drawbacks in that they lead to statistics that are conditional on the way you manipulate the data to make them happen.

    Comment


    • #3
      Thanks for your suggestion. I have checked and there are no empty cells in the model. Do you think the problem has anything to do with the collinearity between the two variables that I mentioned in my initial post?

      Comment


      • #4
        I don't think that's it. If you had not removed the collinearity yourself, Stata would have removed it for you by removing one of the two offending variables. There is no collinearity remaining. The effect of removing the 1 level from maxvol_08_04 is to, in effect merge that level with the 2 level of that same variable. While that may or may not be conceptually defensible, depending on what the two levels of that variable actually mean, it shouldn't create any problems for estimability. In fact, if anything, by reducing the number of combinations of variables in the model, it reduces the chances of one being empty.

        If there are no empty cells in the estimation sample, then I don't really know what else to suggest. If you want to post back with the output of he zip itself, perhaps there are some clues in there of something going wrong. (Please post the zip output completely and exactly as it is by copying from the Results window or your log file to the clipboard and then pasting into a code block on this forum. Please don't edit it in any way--details are important. See FAQ #12 paragraph 7 for instructions on setting up a code block if you don't know how that's done.)

        Comment


        • #5
          I just tried the ZIP model with the maxvol_08_04 variable omitted and the margins command now works, suggesting to me that the problem is with that variable. Since it has no empty cells, my only logical conclusion is that it has to do with the collinear portion. Do you think the noestimcheck option is acceptable in this case?

          Also, I wasn't aware that removing the level 1 variable from maxvol_08_04 would combine it with level 2. Is there any way to completely omit level 1 without changing the overall estimation sample?

          Comment


          • #6
            I just tried the ZIP model with the maxvol_08_04 variable omitted and the margins command now works, suggesting to me that the problem is with that variable. Since it has no empty cells, my only logical conclusion is that it has to do with the collinear portion. Do you think the noestimcheck option is acceptable in this case?
            I'm not saying your conclusion is wrong, but I don't understand the logic leading up to it. I do agree that the problem has something to do with the pairing of maxvol_08_04 and emfqhc in the model. Are you absolutely certain that there are no empty cells among all combinations of those two variables in the estimation sample? That is, if you ran -tab emfqhc maxvol_08_04 if e(sample)-, there are no zeroes in the cross-tab? I don't feel able to advise you about the noestimcheck option. It will get you a result, but what that result means I would not be able to say.

            Also, I wasn't aware that removing the level 1 variable from maxvol_08_04 would combine it with level 2. Is there any way to completely omit level 1 without changing the overall estimation sample?
            So think about it this way. When you use i(2 3 4)b2.maxvol_08_04 in the model, what happens? You have indicator variables for 3.maxvol and 4.maxvol (I'm leaving off the _08_04 for brevity). Observations with maxvol = 2 have 0 for both 3.maxvol and 4.maxvol. What do observations with maxvol = 1 have for 3.maxvol and 4.maxvol: also both 0. There is therefore no variable in the model that distinguishes observations with maxvol = 2 from those with maxvol = 1: you have effectively collapsed the two levels together.

            That isn't necessarily bad. It depends on what the meaning of the 2 and 1 levels of maxvol are. If, as the name sort of suggests, it represents a categorization of some quantity, and if it respects the order of that quantity, then putting together what might be the two lowest levels would make sense when, as here, there is a reason you can't keep them separate. In fact, combining two adjacent levels of what is, in fact, an ordinal variable, is commonly done if one of the levels has too few observations using it. There are other circumstances where combining two levels of a categorical variable can make sense. If the variable reflects, say, religion, it may make sense to lump different Protestant sects together and just call it Protestant (depending on what is being analyzed). It would be less likely for lumping Jews and Hindus into a single category to make sense. So it all depends on the meaning.

            I can't think of any way to eliminate the level 1 of maxvol that doesn't involve either recoding it to merge with some other level (as you have done) or simply excluding from the estimation sample all observations with maxvol = 1. But the latter is what you say you want to avoid. That said, I would challenge the underlying logic of your approach. The two variables, emfqhc and maxvol are, as I understand it, both attributes of experience in the primary care setting that you are using to predict ER visits. Your level 1 in each case is used to denote the absence of any experience in the primary care setting. Perhaps it is worth doing the model just for people who have some exposure to the primary care setting, and then, perhaps, doing a separate model that omits these variables altogether, for those who have never been in the primary care setting? I don't really know enough about your setting and goals to know if that proposal makes sense or not, but my main point is to encourage you to think about alternative approaches to the modeling.

            Comment


            • #7
              Thanks for your detailed explanation! I now understand what you mean by empty cells. I had misunderstood what you meant by this and interpreted it as "missing values". In doing the crosstabs, yes there indeed are some zeros for some of the combinations (which would make sense given the collinearity). I will think of re-specifying one of these variables and maybe I can get around this issue and get estimable margins. Maybe combining the maxvol levels 1 and 2 might achieve this though I will have to think about the interpretability of this. Thanks again!

              Comment

              Working...
              X