what to do about missing gender and ethnicity data?

Jennifer Carson

Join Date: Apr 2019

Posts: 21
#1

what to do about missing gender and ethnicity data?

23 Feb 2021, 16:40

Hi all,

I am working on a demographic analysis on who is applying for jobs and who is selected, and using Stata to analyze the data. I have missing data on gender and ethnicity (as some people preferred not to give this information), which ranges from 10-20%. The conclusions from the analysis depend on who is in the unknown category, so there is interest in knowing more about the unknown.

I’ve been asked which groups (e.g. white females, etc) are more likely to be in the unknown category. I have a hard time answering this question as by its very nature the unknown category is UNKNOWN, so how can we know this? I’m wondering what the best way to resolve this is. Here are some possibilities:

1. Impute the missing data. This seems like the best approach though I am having a hard time convincing others of this.
For gender: Impute gender based on first name (I have seen some studies that do this)

For ethnicity: more of a problem. not sure the reliability of imputing ethnicity based on last name. Perhaps do multiple imputation, although our data doesn’t have many additional fields to build a good model

2. review research on which groups are more or less likely to withhold their information in job applications. For example, maybe African American respondents are more likely to withhold their race on job applications. However, I’m not sure what to do with this information even if I could find a study on this. If it’s found in a study in a different context, why would this necessarily be true in my data set?

3. apply the same proportion of male/female to the missing that is in the non-missing. For example, among those who did answer the question, 80% are men and 20% are women. Then we say that the missing consists of 80% men and 20% women. This seems flawed, as we seem to be assuming that men and women withhold their information at the same rate.

4. apply the same proportion of male/female to the missing that is in the general population. Again, this seems flawed to me.

Any comments appreciated.
Tags: None

1 like
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#2

23 Feb 2021, 18:19

There are no good solutions to missing data. You try to find the least bad solution for your context.

Of your 4 choices, I lean towards 1. Numbers 3 and 4 are clearly bad--while there are some people who haphazardly withhold information on a question because they just didn't see it on the page, or they got distracted and then resumed the survey with the question after, for the post part withholding demographic information is purposeful, and is informative about the missing values themselves.

1 will probably be reasonably accurate for gender. There are, of course, some given names that are gender neutral, and you might have to probabilistically impute a gender to those. There are, additionally, algorithms that propose to distinguish Hispanic from non-Hispanic ethnicity by analyzing the surname--I've never used one of these myself, but if you Google it you will find several options, some of them used regularly by government agencies.

Race is harder. Names (given or family) are really not very informative for that (except for some distinctive African American female given names from certain birth cohorts in the 20th century). If you have information about place of residence at a fine-grained level, that can give you something that is reasonable at least for estimating proportions in each group, even if it's not terribly accurate at the individual level. Actually, even when you have no missing data, race is a difficult variable to work with. In longitudinal data sets you almost always find appreciable numbers of people who are designated as different races at different times. I've never really found a satisfactory solution to that and, whenever it is plausible to ignore race in analysis, I tend to avoid using race variables.

I'm sorry I'm not more helpful. And perhaps somebody else will have something better--I'd be happy to learn about other reasonable approaches if that's the case.
2 likes
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#3

24 Feb 2021, 05:34

I'm with Clyde here, this is a hard question. I know that if I were asked something like this, I would be cursing in private - statistics aren't magic. You could also consider reporting missing as its own category in whatever tabulations and regressions you do. That means recoding missing to a separate numeric value.

Here are some sources on imputing race using first and last name that I'm aware of. In the US, before 1980, the Social Security Administration collected voluntary race data using the categories White, Black, and other, and they assigned the category of missing to those who didn't respond. This information got transmitted to Medicare. One other limitation of these data is that Medicare would then impute a spouse's race to the principal beneficiary. The Research Triangle Institute developed first and last name lists for Hispanic and Asian/Pacific Islander names, and imputed race data for those two groups that way. Here is one report on their work. Here is one more recent peer-reviewed article on their work as well. I haven't tried searching for those name lists.

I'm not 100% sure how well this would translate to modern contexts, and I'm not sure that any list of names you came up with would be comprehensive for those two very diverse racial/ethnic groups. You also have the problem of intermarriage. I had a colleague, Mrs. Tan. Tan is the most common Southeast Asian Chinese surname (it's the equivalent of the Mandarin surname Chen). Only thing is, this person was White, and she married a Malaysian Chinese guy. For older adults, interracial marriage is less common, so this problem is likely ignorable (in a statistical sense, anyway).

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
2 likes
Comment
Andrew Lover

Join Date: Apr 2014

Posts: 182
#4

24 Feb 2021, 07:16

Agree with Weiwen; I think making it explicit as "Missing" and running your analyses is the best bet. Then perhaps run the imputations as a sensitivity analysis (and/or sanity check), but I'd be stuck on which to trust if you get very different conclusions.

(& it is a very interesting problem from a data analysis angle; but surely frustrating!)

__________________________________________________ __
Assistant Professor, Department of Biostatistics and Epidemiology
School of Public Health and Health Sciences
University of Massachusetts- Amherst
2 likes
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#5

24 Feb 2021, 07:57

This problem has been known in epidemiology for some time. The addition of a "missing" or similar ad hoc category in descriptive table is fine, but when estimating a model (be it for odds ratio, risk ratios, etc) the ad hoc approach is known to be biased. One early example can be found in Vach, W; Blettner, M. Biased estimation of the odds ratio in case-control studies due to the use of ad hoc methods of correcting for missing values for confounding variables. American journal of epidemiology. , 1991, Vol.134(8), p.895-907 DOI: 10.1093/oxfordjournals.aje.a116164. It seems W. Vach did a lot of work on this, and other authors such as Frank Harrell have also recommended against this approach in favour of either of imputation of propensity score methods.
3 likes
Comment
Jennifer Carson

Join Date: Apr 2019

Posts: 21
#6

24 Feb 2021, 19:32

Thank you all for taking the time to respond - I am very appreciative! It's helpful to know more about the challenges and issues involved with this, and that there is no easy solution to this issue. It seems that with gender I would likely go ahead with using first name to impute, knowing that it is not a perfect approach but it will most likely be reasonably accurate.

A crazy thought on race/ethnicity. It seems there are lots of issues with imputing race/ethnicity based on last name (and I fear this would be a time consuming exercise and not necessarily that reliable). Would a "less bad" approach be to randomly select say 50 people at a time from the unknown category, guess their race/ethnicity from a photo (assuming a photo could be obtained), and then use this to draw some conclusions about which race/ethnicity groups are more/less likely to be in the unknown category? Then randomly select another 50 people, do the same thing, and see if the proportions of each group in the unknown are similar as the first attempt. Then use this information when interpreting the descriptive data. I realize this approach also has lots of problems as guessing race/ethnicity is not necessary going to be accurate, and it won't give me race/ethnicity for everyone in my dataset. Or maybe going the last name approach is better.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#7

25 Feb 2021, 00:14

This thread is interesting but I feel it shifted slightly away from the original question in #1 that Jennifer has now brought back in:

Originally posted by Jennifer Carson View Post

which groups (e.g. white females, etc) are more likely to be in the unknown category

Obviously, using a separate missing category in the analyses will not help answer that question. However, since Leonard has criticized this approach, I would like to mention that it does not necessarily lead to bias. If the missing category is truly missing, such as the ISCO (International Standard Classification of Occupations) code of an unemployed person, then including a missing category might be fine. More subtle examples are possible. Suppose, Jennifer wanted to model the decision of an employer to hire someone based on their race and gender, Their application form does not include this information. In this situation, using a missing category might well represent the information that is available to the employer and, thus, might be more suitable for the research question than (multiple) imputation.

Regarding Jennifer's suggestion

Originally posted by Jennifer Carson View Post

Would a "less bad" approach be to randomly select say 50 people at a time from the unknown category, guess their race/ethnicity from a photo (assuming a photo could be obtained), and then use this to draw some conclusions about which race/ethnicity groups are more/less likely to be in the unknown category? Then randomly select another 50 people, do the same thing, and see if the proportions of each group in the unknown are similar as the first attempt.

it seems that the core problem remains the same: how accurately can we guess a persons' race based on a photo? Aside from this crucial question, Jennifer's approach is based on the theory of simple random sampling. Therefore, there is no need for the second draw. We already know from statistical theory that the proportions in the (sub-)sample will be an unbiased estimate of the proportion in the "population", where the population is the original sample of observations with missing values. Whether a (sub)sample size of 50 will provide reasonable confidence limits is up to Jennifer to decide.
1 like
Comment

Announcement

what to do about missing gender and ethnicity data?

Comment

Comment

Comment

Comment

Comment

Comment