Optimal MISSING DATA technique in large retrospective study on RTA

Evangelos Anagnostou

Join Date: Feb 2019

Posts: 4
#1

Optimal MISSING DATA technique in large retrospective study on RTA

15 Feb 2019, 14:07

Hello everyone,

First post here and I have to say I really enjoy my first months with Stata, even though I'm learning it pretty much under pressure of a timeline and a project.

I’m performing a large observational, retrospective cohort analysis of data regarding road traffic accidents (RTA) that was collected prospectively by police. Total number of obs/participants:127535 (n of crashes:58955). Dependent variable: outcome=deaths vs injuries (binary). Number one goal is to identify risk factors that may be associated with (or be predictive of) poor outcome in order to identify potential injury prevention initiatives. (Later, in the course of these months I'll surely need some guidance with the stepwise multivariate logistic regression that is to follow at the final stage)

I’ve proceeded with data processing, labelling, generating of new indicator vars etc.

Before proceeding with basic descriptive stats for my first tables, I have to deal with a missing data issue. Two of my variables under investigation have a large percentage of missing data. Use of helmet for bike riders and use of seatbelt for 4-wheel vehicle drivers, around 33% and 46% respectively. How am I going to continue?

I know one option is partial deletion. But if I drop the observations with missing data I’ll lose statistical power and a huge amount of information regarding the rest of the 40 vars. Moreover, will the results for these same 2 variables be unbiased if I delete the obs with a missing value, since it seems they may not miss at random?

One thing is for sure that data is not missing completely at random. The variable “geocode” shows that for one specific city, unknown may reach 80%, while for another may be below 10%. This is mostly due to a different approach from police departments around the country on the standard RTA form.
What’s more is that missingness seems naturally to be related with the dependent variable under investigation “outcome” (death vs inj/no inj). Does that mean that the values are MNAR? For example, we have the following results on whether a bike rider wore a helmet at the time of an accident. Yes:44% No:22% Unknown(missing):34%. However, if we stratify by severity, for the bike riders that had no or light injury, this information was missing at 36% and if severe injury or death was involved it was missing at 22%.

How am I going to proceed with this? Do other techniques like multiple imputation or full information maximum likelihood estimation have a role here?

Thank you for your time,

Evangelos
Tags: None

Announcement

Optimal MISSING DATA technique in large retrospective study on RTA