Converting Composite Categorical Variables to Binary Variables

Ben Kaplow

Join Date: Oct 2021

Posts: 2
#1

Converting Composite Categorical Variables to Binary Variables

02 Oct 2021, 12:18

Hi all,

I'm new to Stata and am running into trouble with setting up a dataset for use.

The issue is that I have a composite categorical variable, with multiple entries per observation that are separated by commas (e.g., "A,B,C"). Each entry is a single word, but as answers were written in, there are a great many different entries. My goal is to convert this variable into a series of binary variables, each taking the name of an entry, and taking a value of 1 if the particular word was present in the original variable.

I have tried following the advice given for a similar topic here, but have been unable to successfully adapt it to my case.

Any advice would be greatly appreciated!

Best,
Ben
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

02 Oct 2021, 12:30

If you know in advance that (e.g.) A B C D E are the possible answers then

Code:

foreach v in A B C D E { gen `v' = strpos(whatever, "`v'") > 0 }

yields 0, 1 indicator variables for the occurrence of each of those answers. More complicated real cases can be ... more complicated, so if that recipe doesn/t work, we will need a realistic data example that shows the complications.
Comment
Ben Kaplow

Join Date: Oct 2021

Posts: 2
#3

03 Oct 2021, 08:24

Thanks for your advice! Unfortunately, there are too many possible answers for me to implement that strategy as the answers were inductively coded, and this will probably be more complicated... I've copied below a few instances of the variable in questions, scicon. To simplify the process, I've removed all white space in the dataset, so all instances should be single words and delimited simply by commas.

Code:

ID Scicon 2 "arthritis,fibromyalgia,anatomy,physiology,pathology" 3 "cells,digestion,bacteria,lymphaticsystem" 4 "fascia,organs,anatomy,physiology,molecularstructure"
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#4

03 Oct 2021, 08:39

Whitespace is the least of your problems as this comes down to which categories you care about and how they are written in plain text.

Are you keywords clean enough that you don't have to worry about alternative spellings, typos, synonyms, further aggregating/grouping like terms? This is your most manual step because you need to make these changes as hoc.

For flagging keywords, there are minor variations on the helpful strategy suggested by Nick, but are still needed to be performed one at a time. Are you really concerned with all keywords, or just some subset that you can pre-specify? If you can pre-specify this list (e.g., in code or in an Excel file), you need only do it once and then it's feasible to write a program or loop that flags each one.

This is just a long way of trying to get you to narrow your question down further, if possible, in the hope to suggest more refined coding strategies.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#5

03 Oct 2021, 09:18

Leonardo Guizzetti says most of what I would have said, and more beyond.

See also split and indeed tabsplit within tab_chi at SSC.
Comment

Announcement

Converting Composite Categorical Variables to Binary Variables

Comment

Comment

Comment

Comment