Need help combining my race variables into one categorical variable

Leslie Miller

Join Date: May 2019

Posts: 13
#1

Need help combining my race variables into one categorical variable

19 Aug 2020, 09:56

Hello,
I need help with combining my race variables into one categorical variable. The eight answer choices downloaded/imported to Stata as separate variables from Qualtrics (see what it looks like in STATA below in the PDF).

I've outlined, in order, how I've approached this in STATA with resulting variable issue.

Essentially, the race categories downloaded/imported into STATA from Qualtrics as eight separate vars, I think, because it was a multiple-responses question. Please let me know if you need more information than what I have outlined on the PDF.

Any guidance is appreciated!
Attached Files

Help.pdf (89.2 KB, 1 view)
Tags: None

1 like
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

19 Aug 2020, 10:20

Well, race/ethnicity data is always a colossal mess, all the more so when, as in your case, US Census classification is used. (Well, you don't have exactly the US Census categories, but it's close.)

If everybody selected only a single category, the following code would work:

Code:

label define race_ethnicity 1 "Hispanic White" /// 2 "Hispanic Black" /// 3 "Amer Indian/Alaska" /// 4 "Black/African Amer" /// 9 "Asian" /// 5 "Asian American" /// 6 "White" /// 7 "Native Hawaiian/Pacific" /// 8 "Other race" gen race_ethnicity:race_ethnicity = . forvalues i = 1/9 { replace race_ethnicity = `i' if RACE_ETHNICITY_`i' == 1 }

In fact, there are usually multiracial people in any sufficiently large survey and they may well select multiple categories. That makes creating a single race/ethnicity variable rather dicey. The above code will simply select the highest-numbered choice they select. That is arbitrary and probably unwise. Nevertheless, this will typically cover a substantial majority of the cases and can serve as a starting point. You may want to supplement that code with some -replace race_ethnicity = …- commands that deal specifically with certain combinations of responses to the original variables.

As an aside, unless you are dealing with a very broad-based survey of the United States (and perhaps even then) it is likely that, in fact, there are 4 or 5 categories that account for nearly all respondents, and the remaining categories have too few selections to provide a meaningful subset for statistical analysis. Then, unless the focus of your study is specifically around racial and ethnic differences, it probably makes sense to combine all those smaller categories into a single "Miscellaneous" category for purposes of analysis.

Finally, you may want to think about how this race/ethnicity data was collected. Since it's from a Qualtrics survey, the most likely thing is that it is self-reported--which is considered the best practice. But if it was data extracted from, say medical records, these classifications may have been off-the-cuff impressions by care providers or staff. In that situation the quality of the data is typically poor, and you should probably avoid using the variable altogether unless absolutely necessary.

I should note that with the US Census category system, even self-reported data tends to be low-quality. The problem is that the system treats ethnicity and race as orthogonal dimensions. But most people don't think of it that way, so they find the categories confusing and often pick the wrong ones.

And I can also tell you from experience with longitudinal data sets that even with self-reporting, consistency of race/ethnicity classification of the same individual over time is mediocre to poor.

Although I am something of an outlier, I regard this kind of data with extreme suspicion. Because it is conventional to report frequency distributions of race and ethnicity, I do that. But I don't take those numbers very seriously. And I do not use these variables in analysis unless it is absolutely essential for achieving the research goals--the quality of this data is usually just too poor.
3 likes
Comment
Leslie Miller

Join Date: May 2019

Posts: 13
#3

19 Aug 2020, 10:54

Thank you very much for your response! I will try this and yeah, I'd have to do the replace coding, makes sense. Thank you!
Comment
Leslie Miller

Join Date: May 2019

Posts: 13
#4

19 Aug 2020, 11:10

Excellent. This worked so well. I can't explain how grateful I am for your help!
Comment
Ingrid Zambrano

Join Date: Sep 2021

Posts: 12
#5

21 Sep 2021, 17:02

Good afternoon,

I have a similar question to the one above, but the variables are related to Insurance and participants chose multiple answers. The variables are: private, medicare, medicaid, military, ihs, incarcerated, other, uninsured, unknown. Some participants chose, for instance, private and medicare. How can I combine these variables into once called insurance and be able to identify the number of people who not only chose more than one category, but which categories?

I would truly appreciate any assistance! Thank you in advance!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

21 Sep 2021, 18:08

-help egen- and look at the -group()- function.
Comment
Ingrid Zambrano

Join Date: Sep 2021

Posts: 12
#7

22 Sep 2021, 14:38

Thank you so much Dr.Schechter. I just tried what you mentioned, but I'm not quite sure what I should do still. I searched for the -group()- function and this is what I'm seeing.
Attached Files
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#8

22 Sep 2021, 14:56

Clyde's recommendation to consult -help egen- apparently slipped by here. The relevant -group()- function is an option on the -egen- command, so you need to look at the help for -egen-. You don't need or want -search- here. You want to enter -help egen- (without hyphens) in the Stata command window. You can then scroll through that help window to find group(varlist).
1 like
Comment
Ingrid Zambrano

Join Date: Sep 2021

Posts: 12
#9

22 Sep 2021, 17:06

Thank you Dr.Lacy for the clarification. I have done what you mentioned and tried the syntax below; however, I keep getting an error that reads "command group is unrecognized"
Would you mind pointing me in the right direction please?

Attached Files
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#10

22 Sep 2021, 17:16

Code:

egen wanted_variable = group(private medicare medicaid military ihs prison othinsur noinsur unkinsur), missing label lname(insurance) truncate(3)

Note also the absence of square brackets around the options.
Comment
Ingrid Zambrano

Join Date: Sep 2021

Posts: 12
#11

22 Sep 2021, 18:16

Thank you very very much! This was extremely helpful! I just have one last question, do you have any recommendations on how I can label the categories other than manually labeling them one by one? For instance, I know that the last row has 30 participants who had a combination of private, medicare, & medicaid, but is there a way I can label all the categories at once?
Attached Files
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

#12

22 Sep 2021, 18:36

Since this requires more than just a one-line answer, I've made a toy data set to illustrate the code.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(private medicare medicaid military ihs prison otherinsur noinsur unkinsur)
0 0 0 0 0 1 0 0 0
1 1 0 0 0 0 1 0 0
0 0 0 1 0 0 0 0 0
0 0 0 0 1 0 1 0 0
1 0 0 0 0 1 0 0 1
0 0 1 1 0 0 0 0 0
0 0 0 1 0 0 1 0 0
0 0 0 0 1 0 0 0 0
1 0 0 0 1 0 0 0 0
0 0 1 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0
1 0 0 0 0 1 0 0 0
0 0 1 0 0 0 1 1 0
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 1 1
0 0 0 1 1 0 0 0 0
1 0 1 0 0 0 1 0 0
1 1 0 0 0 0 0 0 1
1 0 1 0 0 1 1 0 0
0 0 0 0 1 1 0 1 0
1 0 0 0 0 1 0 0 0
0 0 0 1 0 1 0 0 0
0 1 0 0 0 0 1 0 1
0 0 0 0 0 1 0 0 0
1 1 0 1 1 0 0 0 0
0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0
1 1 0 1 0 0 1 1 0
1 1 1 0 0 0 1 1 0
0 0 0 1 0 0 0 0 0
end

foreach v of varlist private-unkinsur {
    label define `v'    0   "" 1 `"`=substr(`"`v'"', 1, 3)'"'
    label values `v' `v'
    decode `v', gen(_`v')
    
}

egen insurance = group(_private-_unkinsur), label missing
drop _*

tab insurance

It's a bit of a kludge. The problem is that the -label- option uses the values of the variables, which, for your original variables are numbers. So I get around that by creating a new string variable that contains the first three letters of the variable name when the original variable is 1, and is missing otherwise. Then I apply -egen, group()- to those variables. It's not a perfect solution: medicare and medicaid both have the same three letters. I could overcome that by going to 7 letters, but then the labels would be too long to display properly. You might consider renaming the variables medicare and medicaid to care and caid, respectively so that three letters will be distinctive. I've seen that done elsewhere.

Comment

Ingrid Zambrano

Join Date: Sep 2021

Posts: 12
#13

22 Sep 2021, 19:35

Wow! Thank you so much! I just tried it. I kept getting a "variable already defined" error, so I added a "1" at the end of every variable in the parenthesis after the input float command and it worked. I doubled checked the numbers and they seem correct! yay!
Attached Files
Comment
Ingrid Zambrano

Join Date: Sep 2021

Posts: 12
#14

29 Sep 2021, 18:53

Good evening,

I have a follow-up question and was wondering if I could please get some guidance. I would like to redistribute the Insurance variable above into 3 categories: 1) Has Insurance 2) No Insurance 3)Unknown. The "Has Insurance" category would include private, medicare, medicaid, military, ihs, prison, and otherinsurance. Additionally, the 'Unknown' category would include the missing. Any assistance would be extremely helpful! Thank you in advance!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#15

30 Sep 2021, 08:39

From a coding perspective this is simple, but from the perspective of actually defining these three categories, it is very complicated. So, the basic code would be

Code:

egen picked_insurance = rowtotal(private-otherinsur), miss label define insurance_class 1 "Has Insurance" /// 2 "No Insurance" /// 3 "Unknown" gen byte insurance_class:insurance_class = 1 if inrange(picked_insurance, 1, .) replace insurance_class = 2 if noinsur == 1 & inlist(picked_insurance, 0, .) replace insurance_class = 3 if unkinsur == 1

The problem with this is that in most real world situations, there will be plenty of observations with contradictory information. For example, the person may have selected one of the named insurances (private, medicare, medicaid, military, ihs, prison, otherinsur) but also checked noinsur or unkinsur. Or there may be people who skipped all those items so everything is missing values. Just how you want to classify those situations is up to you and depends on how you plan to use this variable later. I'll leave it to you to puzzle out this substantive question--modifying the code to accommodate those details shouldn't be hard.
1 like
Comment

Announcement