Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • String variable with multiple substrings -> Dichotomous variables for each unique substring

    Hi, I have two string variables in my dataset that each contain multiple ICD codes delimited by ";". I'd like to:
    1. Amalgamate a list of all unique ICD codes across these two string variables for all observations.
    2. Generate a dichotomous variable for each unique ICD code.
    3. Replace the dichotomous variable with 1 if the ICD is present in the parent string or 0 if it is absent.
    For example, assuming this is my dataset:
    id icd_primary icd_secondaries
    1 I4890 J841;C3430;J90;J920
    2 M4802 K100;J920

    This is what I'd like to get to:
    id icd_I4890 icd_J841 icd_C3430 icd_J90 icd_J920 icd_M4802 icd_K100
    1 1 1 1 1 1 0 0
    2 0 0 0 0 1 1 1

    Thanks!
    Jonathan
    Last edited by Jonathan Afilalo; 27 Oct 2022, 15:46.

  • #2
    Let me start be recommending you not do that. There is almost no analysis in Stata for which you will need all these indicator variables. Most analyses where you would think to use them allow you to have just a single icd code variable and use factor-variable notation to create virtual indicators ("dummies") on the fly. So, for most purposes, what you would actually be best off with is:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte id str6 icd_primary str19 icd_secondaries
    1 "I4890" "J841;C3430;J90;J920"
    2 "M4802" "K100;J920"          
    end
    
    egen icds = concat(icd_primary icd_secondaries), punct(";")
    drop icd_primary icd_secondaries
    split icds, gen(icd) parse(";")
    drop icds
    reshape long icd, i(id) j(seq)
    drop if missing(icd) // IT IS PROBABLY BEST TO STOP HERE
    But, if you really are in a situation where the layout you asked for would be better, then you can follow-up the above code with:
    Code:
    levelsof icd, local(icds)
    foreach i of local icds {
        by id (seq): egen icd_`i' = max(icd == "`i'")
    }
    drop icd seq
    by id: keep if _n == 1
    to get there.

    In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment

    Working...
    X