Dealing with missing data

Guest
#1

Dealing with missing data

14 Dec 2017, 10:52

Hi,

I have a household data set (cross-sectional data) with over 36,000 observations. However, after running mdesc I've noticed that for most variables over 50% of the data is missing due to family members not responding to the survey question. I need to compute the missing values as dropping them would mean I'd lose the large majority of my sample. The variables I am looking at are a mix of categorical and numerical variables. For example I have highest educational attainment (in categories) and wages (numerical). I am not quite sure whether I should use multiple imputation or mean replacement to deal with my missing data? And if there are any useful codes you could suggest?

Thank you for your time.
Tags: categorical, cross sectional, data, missing data, multiple imputation
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#2

14 Dec 2017, 11:40

Guest:
you seem to have unit rather than item nonresponse (see page 5 in: http://eu.wiley.com/WileyCDA/WileyTi...471655740.html).
as per you description and assuming that the missingness of your data is not informative, -mi- might be an option (whereas the imputation of the mean is not; see https://missingdata.lshtm.ac.uk/).
However, 50% missing data for some variables woud be difficult to deal with.

Last edited by sladmin; 01 Feb 2018, 08:28. Reason: anonymize poster

Kind regards,
Carlo
(Stata 19.0)
Comment
Guest
#3

15 Dec 2017, 02:51

Thank you for your response Carlo. I just wanted to double check, the individuals in my dataset have chosen to respond to some questions and leave out some instead of just not completing the whole survey. Wouldn't that then clarify as item nonresponse rather than unit?
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#4

15 Dec 2017, 03:16

Why are they missing? It helps to know. For example, some questions may have been inappropriate, e.g. you don't ask about college education for a 12 year old. A household member may have been selected at random to answer Qs. Or maybe only the Head of Household was. Or, they may have simply refused to answer. With 50% MD I am guessing that many people just weren't asked the question for some reason. If, on the other hand, 50% really were refusing to answer, I would wonder how good the survey was.

To take an extreme example, you don't fill in the mean # of pregnancies for men who didn't answer that question. Nor would you use multiple imputation.

On the other hand, MI might be more reasonable for some variables if questions were only asked of a 50% random subsample.

In one of my handouts, I say

Stata has “soft” missing codes (coded as .) and “hard” missing codes (.a, .b, .c, …, .z). The former are eligible for imputation, the latter are not. This distinction can be useful when variables should not be imputed, e.g. “Number of times pregnant” is not applicable for men; either code it as zero or leave it as missing. Depending on the nature of the variable, you may need to change some soft codes to hard or hard codes to soft. Otherwise you may fail to impute values when you should or else impute values when you shouldn’t. As stated before, you need to understand why data are missing.

Without knowing the reasons data are missing I fear you'll adopt a blanket strategy that may be unwise.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#5

15 Dec 2017, 03:16

Guest.
if individuals in your dataset have simply skipped some questions, yes you have item rather than unit nonresponse.
Again, the main issue is whether the missingness is (or not) informative.

PS: crossed in the cyberspace with Richard's reply, which covers a set of relevant issues.

Last edited by sladmin; 01 Feb 2018, 08:29. Reason: anonymize poster

Kind regards,
Carlo
(Stata 19.0)
Comment
Guest
#6

16 Dec 2017, 04:59

Thank you both Carlo and Richard, you've been very helpful.
Comment

Announcement

Dealing with missing data

Comment

Comment

Comment

Comment

Comment