String variable with multiple substrings -> Dichotomous variables for each unique substring

Jonathan Afilalo

Join Date: Nov 2016

Posts: 42
#1

String variable with multiple substrings -> Dichotomous variables for each unique substring

27 Oct 2022, 14:52

Hi, I have two string variables in my dataset that each contain multiple ICD codes delimited by ";". I'd like to:
Amalgamate a list of all unique ICD codes across these two string variables for all observations.

Generate a dichotomous variable for each unique ICD code.

Replace the dichotomous variable with 1 if the ICD is present in the parent string or 0 if it is absent.

For example, assuming this is my dataset:

id icd_primary icd_secondaries

1 I4890 J841;C3430;J90;J920

2 M4802 K100;J920

This is what I'd like to get to:

id icd_I4890 icd_J841 icd_C3430 icd_J90 icd_J920 icd_M4802 icd_K100

1 1 1 1 1 1 0 0

2 0 0 0 0 1 1 1

Thanks!
Jonathan

Last edited by Jonathan Afilalo; 27 Oct 2022, 15:46.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30354
#2

27 Oct 2022, 17:28

Let me start be recommending you not do that. There is almost no analysis in Stata for which you will need all these indicator variables. Most analyses where you would think to use them allow you to have just a single icd code variable and use factor-variable notation to create virtual indicators ("dummies") on the fly. So, for most purposes, what you would actually be best off with is:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte id str6 icd_primary str19 icd_secondaries 1 "I4890" "J841;C3430;J90;J920" 2 "M4802" "K100;J920" end egen icds = concat(icd_primary icd_secondaries), punct(";") drop icd_primary icd_secondaries split icds, gen(icd) parse(";") drop icds reshape long icd, i(id) j(seq) drop if missing(icd) // IT IS PROBABLY BEST TO STOP HERE

But, if you really are in a situation where the layout you asked for would be better, then you can follow-up the above code with:

Code:

levelsof icd, local(icds) foreach i of local icds { by id (seq): egen icd_`i' = max(icd == "`i'") } drop icd seq by id: keep if _n == 1

to get there.

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

id	icd_primary	icd_secondaries
1	I4890	J841;C3430;J90;J920
2	M4802	K100;J920

id	icd_I4890	icd_J841	icd_C3430	icd_J90	icd_J920	icd_M4802	icd_K100
1	1	1	1	1	1	0	0
2	0	0	0	0	1	1	1

Announcement

String variable with multiple substrings -> Dichotomous variables for each unique substring

Comment