Multiple imputation: Restricting the range of imputed value for count model

Anton Ivanov

Join Date: Sep 2014
Posts: 267

Multiple imputation: Restricting the range of imputed value for count model

08 May 2018, 15:46

Hello!

My goal is to impute the variable deaths (count) given the set of other variables:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(deaths pop deaths_state pop_state) float gap
 .  55246 50189 4833722 4907
 .  55395 50215 4849377 4724
 .  55347 51909 4858979 4906
 .  55416 52466 4863300 4723
 . 195540 50189 4833722 4907
16 200111 50215 4849377 4724
20 203709 51909 4858979 4906
18 208563 52466 4863300 4723
 .  27076 50189 4833722 4907
 .  26887 50215 4849377 4724
 .  26489 51909 4858979 4906
 .  25965 52466 4863300 4723
 .  22512 50189 4833722 4907
 .  22506 50215 4849377 4724
 .  22583 51909 4858979 4906
 .  22643 52466 4863300 4723
 .  57872 50189 4833722 4907
 .  57719 50215 4849377 4724
11  57673 51909 4858979 4906
 .  57704 52466 4863300 4723
 .  10639 50189 4833722 4907
 .  10764 50215 4849377 4724
 .  10696 51909 4858979 4906
 .  10362 52466 4863300 4723
 .  20265 50189 4833722 4907
 .  20296 50215 4849377 4724
 .  20154 51909 4858979 4906
 .  19998 52466 4863300 4723
 . 116736 50189 4833722 4907
 . 115916 50215 4849377 4724
 . 115620 51909 4858979 4906
 . 114611 52466 4863300 4723
 .  34162 50189 4833722 4907
 .  34076 50215 4849377 4724
 .  34123 51909 4858979 4906
 .  33843 52466 4863300 4723
 .  26203 50189 4833722 4907
 .  26037 50215 4849377 4724
 .  25859 51909 4858979 4906
 .  25725 52466 4863300 4723
 .  43951 50189 4833722 4907
 .  43931 50215 4849377 4724
 .  43943 51909 4858979 4906
 .  43941 52466 4863300 4723
 .  13426 50189 4833722 4907
 .  13323 50215 4849377 4724
 .  13170 51909 4858979 4906
 .  12993 52466 4863300 4723
 .  25207 50189 4833722 4907
 .  24945 50215 4849377 4724
 .  24675 51909 4858979 4906
 .  24392 52466 4863300 4723
 .  13486 50189 4833722 4907
 .  13552 50215 4849377 4724
 .  13555 51909 4858979 4906
 .  13492 52466 4863300 4723
 .  14994 50189 4833722 4907
 .  15080 50215 4849377 4724
 .  15018 51909 4858979 4906
 .  14924 52466 4863300 4723
 .  50938 50189 4833722 4907
 .  50909 50215 4849377 4724
 .  51211 51909 4858979 4906
 .  51226 52466 4863300 4723
 .  54520 50189 4833722 4907
 .  54543 50215 4849377 4724
 .  54354 51909 4858979 4906
 .  54216 52466 4863300 4723
 .  12887 50189 4833722 4907
 .  12670 50215 4849377 4724
 .  12672 51909 4858979 4906
 .  12395 52466 4863300 4723
 .  10898 50189 4833722 4907
 .  10886 50215 4849377 4724
 .  10724 51909 4858979 4906
 .  10581 52466 4863300 4723
 .  37886 50189 4833722 4907
 .  37914 50215 4849377 4724
 .  37835 51909 4858979 4906
 .  37458 52466 4863300 4723
 .  13986 50189 4833722 4907
 .  13977 50215 4849377 4724
 .  13963 51909 4858979 4906
 .  13913 52466 4863300 4723
 .  80811 50189 4833722 4907
 .  81289 50215 4849377 4724
 .  82005 51909 4858979 4906
13  82471 52466 4863300 4723
 .  49884 50189 4833722 4907
 .  49484 50215 4849377 4724
 .  49565 51909 4858979 4906
 .  49226 52466 4863300 4723
 .  41996 50189 4833722 4907
 .  41711 50215 4849377 4724
 .  41131 51909 4858979 4906
 .  40008 52466 4863300 4723
 .  71013 50189 4833722 4907
 .  71065 50215 4849377 4724
 .  71130 51909 4858979 4906
 .  70900 52466 4863300 4723
end

To do so, I use:

Code:

mi impute poisson deaths pop deaths_state pop_state, add(1)

Importantly, the imputed values for deaths must be a count that falls within 1<=X<=9.

Please advice how I can add this condition in the aforementioned -mi- command.

Last edited by Anton Ivanov; 08 May 2018, 15:55. Reason: multiple imputation

Tags: count, multiple imputation, poisson, range

Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#2

08 May 2018, 17:13

Can you explain at greater length what you are trying to accomplish. First of all, it makes no sense to use multiple imputation to create a single imputed data set. -add(1)- is syntactically legal, but it defeats the whole point of -mi- which is to make many imputed data sets so that you can estimate the sampling-variation of the imputation and account for it in later analysis. And what is the reason that your variable must fall in the 1 to 9 range? A restriction like that would, also, not make sense in the context of multiple imputation. Also, at least in your example data, the vast majority of the values of deaths are missing, so, again, it is far from clear the multiple imputation will accomplish anything useful here. Perhaps you need some other approach.

If you can explain the larger setting and purpose, perhaps somebody can suggest something that would be more appropriate to your goals.
Comment
Anton Ivanov

Join Date: Sep 2014

Posts: 267
#3

08 May 2018, 17:49

Clyde, thank you for your response. Sure, let me elaborate on the problem I am trying to solve.

For the purpose my study, I obtained yearly (2013-2016) publicly available data on opioid-related drug overuse measured with number of deaths at the county level. Notably, due to confidentiality constraints, these data are suppressed when vital statistics (e.g., number or rate of death) are reported for less than ten persons. Because of this limitation, my current sample includes only 773 out of 3,007 counties in the United States. In other words, from 2013 to 2016 only 773 counties in the US had a number of deaths caused by opioids greater than 10. Knowing why there are suppressed values we can definitively say that the rest of the counties had less than 10 but greater than 0 deaths (zeros are reported) -- the exact number is "unknown" to us.

To address this limitation, I tried using an approach suggested in the literature -- imputing state-level age-adjusted mortality rates (Tiwari, C., Beyer, K., & Rushton, G. (2014). The impact of data suppression on local mortality rates: The case of CDC WONDER. American journal of public health, 104(8), 1386-1388). However, this approach did not yield a plausible solution given such a significant portion of suppressed values.

The following data are available to me: (incomplete) county deaths (N = 773); (complete) state death rates, county population, and state population (N = 3,007). In addition to that, I know the difference between the total number of deaths across all counties and the total number of deaths across non-suppressed counties. E.g., total across all counties = 25,082 and total across non-suppressed counties = 21,005. Therefore, I can calculate the total number of deaths associated with all suppressed counties (for which the value is >0 and <10).

Given all the data in hand, I was hoping to find an approach to impute (or perhaps intelligently guess) the number of deaths for the suppressed counties, which happen to be constrained between 0 and 10. Any feedback would be greatly appreciated.

Last edited by Anton Ivanov; 08 May 2018, 17:51.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#4

08 May 2018, 18:57

To be honest, I can't think of anything that I would call a good solution to your problem. With so much missing data, that is definitively missing NOT at random, and, in fact, clearly has a distribution whose support is disjoint from that of the observed data, I don't see any truly good way to go.

If abandoning ship is not an option here, I would probably treat this as left-censored data. That is I would impute value of 10 to all of the missing observations of deaths, and would use analytic techniques for censored data like -tobit- to handle it. That is really what you have here: observations that are left censored at 10. The only thing I see that distinguishes you situation from that is that you also know the sum of all the missing values. But I don't see a way to incorporate that information into any kind of standard analysis.

I hope others will read this, and perhaps somebody else has a better idea.
1 like
Comment
Anton Ivanov

Join Date: Sep 2014

Posts: 267
#5

08 May 2018, 19:06

Dear Clyde, I am very thankful for your feedback. At this stage, the limitation of the data seems to be a "fundamental limitation" of the study. Yet, I look forward to finding ways not to abandon the ship. I will also post an update if come across plausible solutions. Once again, thank you and everybody for feedback.
Comment

Anton Ivanov

Join Date: Sep 2014
Posts: 267

09 May 2018, 11:30

I have been experimenting last night with an approach to take here and came up with the following solution. Please let me know what you think about it:

Code:

. table year, contents(mean deaths n deaths)

--------------------------------------
     year | mean(deaths)     N(deaths)
----------+---------------------------
     2013 |        40.29           500
     2014 |  40.89401709           585
     2015 |  45.68071313           617
     2016 |    53.532097           701
--------------------------------------

. *** Replace missing values with "state crude rate / county population", where state crude rate = death count / state population * 100,000

. replace deaths = cruderate_state_/pop if deaths == .
variable deaths was long now double
(10,164 real changes made)

. replace deaths=round(deaths, 1.0)
(10,164 real changes made)

. sum deaths if deaths < 10, det

                           deaths
-------------------------------------------------------------
      Percentiles      Smallest
 1%            0              0
 5%            0              0
10%            0              0       Obs              10,164
25%            0              0       Sum of Wgt.      10,164

50%            0                      Mean           .0327627
                        Largest       Std. Dev.      .2816312
75%            0              8
90%            0              9       Variance       .0793161
95%            0              9       Skewness       19.14204
99%            1              9       Kurtosis       515.8533

. tab deaths if deaths < 10

     deaths |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      9,911       97.51       97.51
          1 |        220        2.16       99.68
          2 |         24        0.24       99.91
          3 |          1        0.01       99.92
          6 |          2        0.02       99.94
          7 |          1        0.01       99.95
          8 |          2        0.02       99.97
          9 |          3        0.03      100.00
------------+-----------------------------------
      Total |     10,164      100.00

. sum deaths, det

                           deaths
-------------------------------------------------------------
      Percentiles      Smallest
 1%            0              0
 5%            0              0
10%            0              0       Obs              12,567
25%            0              0       Sum of Wgt.      12,567

50%            0                      Mean           8.761996
                        Largest       Std. Dev.      32.68049
75%            0            545
90%           22            549       Variance       1068.015
95%           46            561       Skewness       8.581437
99%          166            972       Kurtosis        126.794

Given a major increase in the number of zeros, I can now use -zip- (or -zinb-) estimator. Or I can simply go -xtpoisson, fe vce(robust)-, as it seems to be quite robust for dispersion assumptions. Does my logic seem plausible?

Last edited by Anton Ivanov; 09 May 2018, 11:44.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#7

09 May 2018, 11:44

I'm not crazy about it. In effect, you are building into your analysis the assumption that all counties with small numbers of death experienced a crude mortality rate equal to that of the state as a whole, and then by rounding, you've jittered it with a little bit of noise. . I find that assumption implausible. If it were only a handful of counties, I suppose it might make little difference, but you're making an extremely strong assumption about a large majority of your data. If you do decide to go this route, I think you need to do some very rigorous sensitivity analyses to see how much whatever conclusions you draw depend on that assumption being right.

I still think I would treat this as censored data and use an appropriate analysis plan for that. On additional thought since my original recommendation, -intreg- is actually more suitable, because we know that the number of deaths can't be negative. So really these missing observations are interval censored between 0 and 9.
1 like
Comment
Anton Ivanov

Join Date: Sep 2014

Posts: 267
#8

09 May 2018, 11:50

Clyde, thank you very much for feedback. It is extremely useful. I will listen to your advise and elaborate from that.
Comment
Anton Ivanov

Join Date: Sep 2014

Posts: 267
#9

09 May 2018, 12:50

Please help me wrap my mind around setting up -xtintreg- (since I have panel data). As far as I understand, I need to create depvar_lower and depvar_upper. So, for the missing values I do:

Code:

gen depvar_lower = 0 if deaths == . gen depvar_upper = 9 if deaths == .

But how should I code the rest of the observed values, which are >10? A little bit confused here...

Last edited by Anton Ivanov; 09 May 2018, 13:18.
Comment
Jasmina Tacheva

Join Date: May 2018

Posts: 1
#10

09 May 2018, 13:20

Judging by this excerpt from Anderson-Bergman (2017), I guess you are dealing with mixed-case censored data as opposed to a scenario where all your observations had a range for the dependent variable. R appears to have a way to deal with this kind of censoring, but what about Stata?

"Two common forms of interval censored data are current status data (Hoel and Walburg Jr 1972) and mixed case censoring (Schick and Yu 2000). Current status data occurs when each subject is observed at a single time and all that is recorded is whether the event of interest has occurred or not. This results in all subjects being either left or right censored. A classic current status dataset includes mice that are sacrificed at random times and inspected for lung tumors. If tumors were detected, the mice were recorded to be left censored at time of sacrifice. If no tumors were found, they were recorded as right censored. The more general type of interval censoring, called mixed case censoring, can include left censored, right censored, uncensored and observations that are censored but neither right nor left censored." - Anderson-Bergman, C. (2017). icenReg: Regression models for interval censored data. Journal of Statistical Software, 81(12).

Last edited by Jasmina Tacheva; 09 May 2018, 13:30.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30192

#11

09 May 2018, 13:21

Code:

replace depvar_lower = deaths if !missing(deaths)
replace depvar_upper = deaths if !missing(deaths)

Comment

Anton Ivanov

Join Date: Sep 2014

Posts: 267
#12

12 May 2018, 12:20

Dear Clyde,

I am very thankful for your feedback. Based on your suggestions, I was able to fit a reasonable model using -xtintreg-. I then addressed endogeneity caused by the omitted variable bias by the means of control function. To estimate the first-stage residuals I used a set of theory-based instruments, which passed a series of validity tests. The final estimated model (below) provided appropriate suggestive evidence for the impact of focal effects (x1/x5):

Code:

xtintreg depvar_lower depvar_upper c1/c35 res1 res2 res3 res4 res5 i.year x1/x5

There is only one additional question I would like to double-check with you: Is it fine that I use interval regression specification (with such outcome coding) given that the number of deaths is technically a count?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#13

12 May 2018, 12:25

Well, it is an approximation. It is one of those models that might be described as wrong, like all models, but perhaps useful. If the model predictions make sense and are a reasonable fit to your data, I wouldn't worry about it. While one might prefer some sort of analysis that handles interval censored count variables, I'm not aware of any such analysis. You would have to describe this as a limitation of your approach.

That said, the boundary between count outcomes and continuous outcomes is somewhat permeable. It is often quite useful to model continuous outcomes with a Poisson regression, and modeling count outcomes with OLS regression is also common, and often leads to very useful results.
1 like
Comment
John Willett

Join Date: Feb 2025

Posts: 4
#14

06 Feb 2025, 11:09

Is there a way I can use tobit analysis to complete the imputation step in mi impute? That is, is there such a thing as a mi impute tobit command available in STATA and, if not, is it possible to program it? I am trying to impute the missing values of a variable whose distribution is normal, but truncated, and with a spike, at zero.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4496
#15

06 Feb 2025, 12:21

while tobit is not allowed, intreg is and you can, I believe have this as one-sided (which would be the same as tobit); see

Code:

help mi impute intreg
Comment

Announcement

Multiple imputation: Restricting the range of imputed value for count model

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment