Multiple imputation?

Evangelos Anagnostou

Join Date: Feb 2019

Posts: 4
#1

Multiple imputation?

18 Feb 2019, 01:39

Hello everyone,

First post here and I have to say I really enjoy my first months with Stata, even though I'm learning it pretty much under pressure of a timeline and a project.

I have a large dataset, total number of obs/participants:127535. Dependent variable: death vs injury (dichotomous). I have to deal with a missing data issue. Two of my independent variables under investigation have a large percentage of missing data. Use of helmet for bike riders and use of seatbelt for 4-wheel vehicle drivers, around 33% and 46% respectively. How am I going to continue?

The data seems not to be MCAR. The variable “geocode” shows that for one specific city, unknown may reach 80%, while for another may be below 10%. This is mostly due to a different approach from police departments around the country on the standard RTA form.

What’s more is that missingness seems naturally to be related with the dependent variable under investigation “outcome” (death vs inj/no inj). For example, we have the following results on whether a bike rider wore a helmet at the time of an accident. Yes:44% No:22% Unknown(missing):34%. However, if we stratify by severity, for the bike riders that had no or light injury, this information was missing at 36% and if severe injury or death was involved it was missing at 22%.

Is multiple imputation have a good idea and how am i going to proceed with the commands?

Thank you for your time,

Evangelos
Tags: None
Andrea Berni

Join Date: Jun 2016

Posts: 37
#2

18 Feb 2019, 07:36

Hello.

First issue: 33% and 46% of missing observations is pretty high and undermines research results in most fields, in my humble opinion, but this does not mean you cannot continue with your analysis, perhaps you should just make very clear the shortcomings of your research to whom is reading.

Second issue: sorry, but I cannot help for what regards the "MCAR" problem, even though I can say that having 80% of missings in one case and just 10% in another is very weird and makes me think about what already written above.

Third issue: missingness correlated to the Y variable might cause serious problems, even worse than missingness correlated to the X variable, such as overestimation or underestimation of effects,but this is "just" another theoretical/statistical problem which does not prevent you from running your analysis.

Finally, I must admit I am no expert of predicting missing data and replacing them on former missing observations, but the idea does not particularly make me enthusiast: it is like lying with statistic to me, especially when you know your missing observations might be strongly correlacted with the dependent variable. Finally, I do not know exactly what kind of analysis you are running, whether regressions or average mean differences, however, you can always use dummy variables/categories which capture missing observations in order to keep them in the analysis, even though it may not be very informative.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#3

18 Feb 2019, 08:39

You have an excellent sample size. You have an interesting question. The missing pattern for this observational study is a far cry from MCAR. So far so good.

Considering you will probably proceed with the analysis (why not?), I just wish to share a few comments.

The best approach IMHO would encompass multiple imputation techniques on a par with some sort of sensitivity analysis, now including complete-data analysis.

If you have "extra" variables which you believe to be linked to the missing data pattern, try to include them in the MI model.

Shall you need to demonstrate the results under the worst scenario (I mean, MNAR), you could make it happen under a sensitity analysis as well.

In short, I think that coping with ("real world") missing data tends to be the most decent approach, rather than sticking to the (probably biased) complete-case analysis.

Best regards,

Marcos
Comment
Evangelos Anagnostou

Join Date: Feb 2019

Posts: 4
#4

18 Feb 2019, 11:11

Thank you both for your replies. I think I will be proceeding with mi logit for this one and share my findings.
Comment
Evangelos Anagnostou

Join Date: Feb 2019

Posts: 4
#5

18 Feb 2019, 11:23

One more problem I'm dealing with.

The helmet indicator var was generated. I had another categorical var safetyeq for which use of helmet had a value of 2 and not use of helmet a value of 5. I also had a value of 9 for unknown.
gen helmet=1 if safetyeq==2 & (vehicle==4 | vehicle==5).
replace helmet=0 if safetyeq==5 & (vehicle==4 | vehicle==5)
replace helmet==2 if safetyeq==5 & (vehicle==4 | vehicle==5)
label define thehelmet 0"no" 1"yes" 2"unknown/missing"

Vehicle 4 and 5 are bicycle and motorbike. Stata gives me back missing values for helmet for the car/lorry/bus drivers/passengers as well.

As i already said, if I run tab helmet I get back
no 22.7%
yes 45%
unknown 32.3%

If I run tab helmet, m i get back
no 7%
yes 13.7%
unknown 9.8%
. 69.5%

How am i going to tell stata helmet is a var only for specific obs in order to have it look like
no 22.7%
yes 45%
. 32.3%

In other words, it's like having missing data in a dataset for males in the question "What's your bra size?". How can I fix this?

Thanks.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#6

18 Feb 2019, 13:41

The best approach when you ask for a solution in Stata is sharing the data (a toy example will do fine), as well as command and output under code delimiters. You may use - dataex - for the matter. The instructions are fully available in the FAQ. Please read the FAQ.

That being said, it seems you didn't share the exact command, for a couple of reasons: the second - replace - command has an extra "=".; also, it will 'apply' for the same condition stated in the previous - replace - command.

You may also wish to take a look at - mvencode - as well as - mvdecode - commands.

I hope you won't take it amiss, but most of the issues underlined in #5 are related to basic commands in Stata. This is to say that you probably need to make at least a "quick start" by reading, well, "the Quick Start" session (for each important command you intend to use) in the Stata Manual.

If this is true, please beware that rushing into complex commands (such as mi) before having a good grasp of the basics may become a journey full of mishaps.

To end, the "if" clause is one of the best ways to tell Stata that a given command shall be applied to a specific set of values of a variable. If you wish to select specific observations, and you have the ID variable, typing "if ID ==" is a good starting point.

Hopefully that helps.

Last edited by Marcos Almeida; 18 Feb 2019, 13:46.

Best regards,

Marcos
Comment

Announcement

Multiple imputation?

Comment

Comment

Comment

Comment

Comment