Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple imputation: Restricting the range of imputed value for count model

    Hello!

    My goal is to impute the variable deaths (count) given the set of other variables:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long(deaths pop deaths_state pop_state) float gap
     .  55246 50189 4833722 4907
     .  55395 50215 4849377 4724
     .  55347 51909 4858979 4906
     .  55416 52466 4863300 4723
     . 195540 50189 4833722 4907
    16 200111 50215 4849377 4724
    20 203709 51909 4858979 4906
    18 208563 52466 4863300 4723
     .  27076 50189 4833722 4907
     .  26887 50215 4849377 4724
     .  26489 51909 4858979 4906
     .  25965 52466 4863300 4723
     .  22512 50189 4833722 4907
     .  22506 50215 4849377 4724
     .  22583 51909 4858979 4906
     .  22643 52466 4863300 4723
     .  57872 50189 4833722 4907
     .  57719 50215 4849377 4724
    11  57673 51909 4858979 4906
     .  57704 52466 4863300 4723
     .  10639 50189 4833722 4907
     .  10764 50215 4849377 4724
     .  10696 51909 4858979 4906
     .  10362 52466 4863300 4723
     .  20265 50189 4833722 4907
     .  20296 50215 4849377 4724
     .  20154 51909 4858979 4906
     .  19998 52466 4863300 4723
     . 116736 50189 4833722 4907
     . 115916 50215 4849377 4724
     . 115620 51909 4858979 4906
     . 114611 52466 4863300 4723
     .  34162 50189 4833722 4907
     .  34076 50215 4849377 4724
     .  34123 51909 4858979 4906
     .  33843 52466 4863300 4723
     .  26203 50189 4833722 4907
     .  26037 50215 4849377 4724
     .  25859 51909 4858979 4906
     .  25725 52466 4863300 4723
     .  43951 50189 4833722 4907
     .  43931 50215 4849377 4724
     .  43943 51909 4858979 4906
     .  43941 52466 4863300 4723
     .  13426 50189 4833722 4907
     .  13323 50215 4849377 4724
     .  13170 51909 4858979 4906
     .  12993 52466 4863300 4723
     .  25207 50189 4833722 4907
     .  24945 50215 4849377 4724
     .  24675 51909 4858979 4906
     .  24392 52466 4863300 4723
     .  13486 50189 4833722 4907
     .  13552 50215 4849377 4724
     .  13555 51909 4858979 4906
     .  13492 52466 4863300 4723
     .  14994 50189 4833722 4907
     .  15080 50215 4849377 4724
     .  15018 51909 4858979 4906
     .  14924 52466 4863300 4723
     .  50938 50189 4833722 4907
     .  50909 50215 4849377 4724
     .  51211 51909 4858979 4906
     .  51226 52466 4863300 4723
     .  54520 50189 4833722 4907
     .  54543 50215 4849377 4724
     .  54354 51909 4858979 4906
     .  54216 52466 4863300 4723
     .  12887 50189 4833722 4907
     .  12670 50215 4849377 4724
     .  12672 51909 4858979 4906
     .  12395 52466 4863300 4723
     .  10898 50189 4833722 4907
     .  10886 50215 4849377 4724
     .  10724 51909 4858979 4906
     .  10581 52466 4863300 4723
     .  37886 50189 4833722 4907
     .  37914 50215 4849377 4724
     .  37835 51909 4858979 4906
     .  37458 52466 4863300 4723
     .  13986 50189 4833722 4907
     .  13977 50215 4849377 4724
     .  13963 51909 4858979 4906
     .  13913 52466 4863300 4723
     .  80811 50189 4833722 4907
     .  81289 50215 4849377 4724
     .  82005 51909 4858979 4906
    13  82471 52466 4863300 4723
     .  49884 50189 4833722 4907
     .  49484 50215 4849377 4724
     .  49565 51909 4858979 4906
     .  49226 52466 4863300 4723
     .  41996 50189 4833722 4907
     .  41711 50215 4849377 4724
     .  41131 51909 4858979 4906
     .  40008 52466 4863300 4723
     .  71013 50189 4833722 4907
     .  71065 50215 4849377 4724
     .  71130 51909 4858979 4906
     .  70900 52466 4863300 4723
    end
    To do so, I use:

    Code:
    mi impute poisson deaths pop deaths_state pop_state, add(1)
    Importantly, the imputed values for deaths must be a count that falls within 1<=X<=9.

    Please advice how I can add this condition in the aforementioned -mi- command.
    Last edited by Anton Ivanov; 08 May 2018, 15:55. Reason: multiple imputation

  • #2
    Can you explain at greater length what you are trying to accomplish. First of all, it makes no sense to use multiple imputation to create a single imputed data set. -add(1)- is syntactically legal, but it defeats the whole point of -mi- which is to make many imputed data sets so that you can estimate the sampling-variation of the imputation and account for it in later analysis. And what is the reason that your variable must fall in the 1 to 9 range? A restriction like that would, also, not make sense in the context of multiple imputation. Also, at least in your example data, the vast majority of the values of deaths are missing, so, again, it is far from clear the multiple imputation will accomplish anything useful here. Perhaps you need some other approach.

    If you can explain the larger setting and purpose, perhaps somebody can suggest something that would be more appropriate to your goals.

    Comment


    • #3
      Clyde, thank you for your response. Sure, let me elaborate on the problem I am trying to solve.

      For the purpose my study, I obtained yearly (2013-2016) publicly available data on opioid-related drug overuse measured with number of deaths at the county level. Notably, due to confidentiality constraints, these data are suppressed when vital statistics (e.g., number or rate of death) are reported for less than ten persons. Because of this limitation, my current sample includes only 773 out of 3,007 counties in the United States. In other words, from 2013 to 2016 only 773 counties in the US had a number of deaths caused by opioids greater than 10. Knowing why there are suppressed values we can definitively say that the rest of the counties had less than 10 but greater than 0 deaths (zeros are reported) -- the exact number is "unknown" to us.

      To address this limitation, I tried using an approach suggested in the literature -- imputing state-level age-adjusted mortality rates (Tiwari, C., Beyer, K., & Rushton, G. (2014). The impact of data suppression on local mortality rates: The case of CDC WONDER. American journal of public health, 104(8), 1386-1388). However, this approach did not yield a plausible solution given such a significant portion of suppressed values.

      The following data are available to me: (incomplete) county deaths (N = 773); (complete) state death rates, county population, and state population (N = 3,007). In addition to that, I know the difference between the total number of deaths across all counties and the total number of deaths across non-suppressed counties. E.g., total across all counties = 25,082 and total across non-suppressed counties = 21,005. Therefore, I can calculate the total number of deaths associated with all suppressed counties (for which the value is >0 and <10).

      Given all the data in hand, I was hoping to find an approach to impute (or perhaps intelligently guess) the number of deaths for the suppressed counties, which happen to be constrained between 0 and 10. Any feedback would be greatly appreciated.
      Last edited by Anton Ivanov; 08 May 2018, 17:51.

      Comment


      • #4
        To be honest, I can't think of anything that I would call a good solution to your problem. With so much missing data, that is definitively missing NOT at random, and, in fact, clearly has a distribution whose support is disjoint from that of the observed data, I don't see any truly good way to go.

        If abandoning ship is not an option here, I would probably treat this as left-censored data. That is I would impute value of 10 to all of the missing observations of deaths, and would use analytic techniques for censored data like -tobit- to handle it. That is really what you have here: observations that are left censored at 10. The only thing I see that distinguishes you situation from that is that you also know the sum of all the missing values. But I don't see a way to incorporate that information into any kind of standard analysis.

        I hope others will read this, and perhaps somebody else has a better idea.

        Comment


        • #5
          Dear Clyde, I am very thankful for your feedback. At this stage, the limitation of the data seems to be a "fundamental limitation" of the study. Yet, I look forward to finding ways not to abandon the ship. I will also post an update if come across plausible solutions. Once again, thank you and everybody for feedback.

          Comment


          • #6
            I have been experimenting last night with an approach to take here and came up with the following solution. Please let me know what you think about it:

            Code:
            . table year, contents(mean deaths n deaths)
            
            --------------------------------------
                 year | mean(deaths)     N(deaths)
            ----------+---------------------------
                 2013 |        40.29           500
                 2014 |  40.89401709           585
                 2015 |  45.68071313           617
                 2016 |    53.532097           701
            --------------------------------------
            
            . *** Replace missing values with "state crude rate / county population", where state crude rate = death count / state population * 100,000
            
            . replace deaths = cruderate_state_/pop if deaths == .
            variable deaths was long now double
            (10,164 real changes made)
            
            . replace deaths=round(deaths, 1.0)
            (10,164 real changes made)
            
            . sum deaths if deaths < 10, det
            
                                       deaths
            -------------------------------------------------------------
                  Percentiles      Smallest
             1%            0              0
             5%            0              0
            10%            0              0       Obs              10,164
            25%            0              0       Sum of Wgt.      10,164
            
            50%            0                      Mean           .0327627
                                    Largest       Std. Dev.      .2816312
            75%            0              8
            90%            0              9       Variance       .0793161
            95%            0              9       Skewness       19.14204
            99%            1              9       Kurtosis       515.8533
            
            . tab deaths if deaths < 10
            
                 deaths |      Freq.     Percent        Cum.
            ------------+-----------------------------------
                      0 |      9,911       97.51       97.51
                      1 |        220        2.16       99.68
                      2 |         24        0.24       99.91
                      3 |          1        0.01       99.92
                      6 |          2        0.02       99.94
                      7 |          1        0.01       99.95
                      8 |          2        0.02       99.97
                      9 |          3        0.03      100.00
            ------------+-----------------------------------
                  Total |     10,164      100.00
            
            . sum deaths, det
            
                                       deaths
            -------------------------------------------------------------
                  Percentiles      Smallest
             1%            0              0
             5%            0              0
            10%            0              0       Obs              12,567
            25%            0              0       Sum of Wgt.      12,567
            
            50%            0                      Mean           8.761996
                                    Largest       Std. Dev.      32.68049
            75%            0            545
            90%           22            549       Variance       1068.015
            95%           46            561       Skewness       8.581437
            99%          166            972       Kurtosis        126.794
            Given a major increase in the number of zeros, I can now use -zip- (or -zinb-) estimator. Or I can simply go -xtpoisson, fe vce(robust)-, as it seems to be quite robust for dispersion assumptions. Does my logic seem plausible?
            Last edited by Anton Ivanov; 09 May 2018, 11:44.

            Comment


            • #7
              I'm not crazy about it. In effect, you are building into your analysis the assumption that all counties with small numbers of death experienced a crude mortality rate equal to that of the state as a whole, and then by rounding, you've jittered it with a little bit of noise. . I find that assumption implausible. If it were only a handful of counties, I suppose it might make little difference, but you're making an extremely strong assumption about a large majority of your data. If you do decide to go this route, I think you need to do some very rigorous sensitivity analyses to see how much whatever conclusions you draw depend on that assumption being right.

              I still think I would treat this as censored data and use an appropriate analysis plan for that. On additional thought since my original recommendation, -intreg- is actually more suitable, because we know that the number of deaths can't be negative. So really these missing observations are interval censored between 0 and 9.

              Comment


              • #8
                Clyde, thank you very much for feedback. It is extremely useful. I will listen to your advise and elaborate from that.

                Comment


                • #9
                  Please help me wrap my mind around setting up -xtintreg- (since I have panel data). As far as I understand, I need to create depvar_lower and depvar_upper. So, for the missing values I do:
                  Code:
                  gen depvar_lower = 0 if deaths == .
                  gen depvar_upper = 9 if deaths == .
                  But how should I code the rest of the observed values, which are >10? A little bit confused here...
                  Last edited by Anton Ivanov; 09 May 2018, 13:18.

                  Comment


                  • #10
                    Judging by this excerpt from Anderson-Bergman (2017), I guess you are dealing with mixed-case censored data as opposed to a scenario where all your observations had a range for the dependent variable. R appears to have a way to deal with this kind of censoring, but what about Stata?

                    "Two common forms of interval censored data are current status data (Hoel and Walburg Jr 1972) and mixed case censoring (Schick and Yu 2000). Current status data occurs when each subject is observed at a single time and all that is recorded is whether the event of interest has occurred or not. This results in all subjects being either left or right censored. A classic current status dataset includes mice that are sacrificed at random times and inspected for lung tumors. If tumors were detected, the mice were recorded to be left censored at time of sacrifice. If no tumors were found, they were recorded as right censored. The more general type of interval censoring, called mixed case censoring, can include left censored, right censored, uncensored and observations that are censored but neither right nor left censored." - Anderson-Bergman, C. (2017). icenReg: Regression models for interval censored data. Journal of Statistical Software, 81(12).
                    Last edited by Jasmina Tacheva; 09 May 2018, 13:30.

                    Comment


                    • #11
                      Code:
                      replace depvar_lower = deaths if !missing(deaths)
                      replace depvar_upper = deaths if !missing(deaths)

                      Comment


                      • #12
                        Dear Clyde,

                        I am very thankful for your feedback. Based on your suggestions, I was able to fit a reasonable model using -xtintreg-. I then addressed endogeneity caused by the omitted variable bias by the means of control function. To estimate the first-stage residuals I used a set of theory-based instruments, which passed a series of validity tests. The final estimated model (below) provided appropriate suggestive evidence for the impact of focal effects (x1/x5):

                        Code:
                        xtintreg depvar_lower depvar_upper c1/c35 res1 res2 res3 res4 res5 i.year x1/x5
                        There is only one additional question I would like to double-check with you: Is it fine that I use interval regression specification (with such outcome coding) given that the number of deaths is technically a count?

                        Comment


                        • #13
                          Well, it is an approximation. It is one of those models that might be described as wrong, like all models, but perhaps useful. If the model predictions make sense and are a reasonable fit to your data, I wouldn't worry about it. While one might prefer some sort of analysis that handles interval censored count variables, I'm not aware of any such analysis. You would have to describe this as a limitation of your approach.

                          That said, the boundary between count outcomes and continuous outcomes is somewhat permeable. It is often quite useful to model continuous outcomes with a Poisson regression, and modeling count outcomes with OLS regression is also common, and often leads to very useful results.

                          Comment

                          Working...
                          X