Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging two datasets where "master" variable is within a range in "using" variable - ICD-O3

    Good morning!

    I have a dataset of cancer patients with 100,000 observations. The Master dataset has an ICD-O3 code for the type of cancer. The Using dataset (from SEER) has the Site codes and the Site Description. I want to merge the datasets so that the Site Description appears in the Master dataset. The challenge is that the Site Codes in the SEER data take the following form for several observations: “C000-C006,C008-C009”. I did split and reshape to end up with “C000-C006” and “C008-C009”, but can't figure out the next step.

    I have looked at using joinby, cross, inrange but can’t find a solution yet. I'm using Stata/SE16.1 Examples of Master, Using and Results are below. Any suggestions will be appreciated.


    MASTER dataset

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte PatientID str4 Cancer
     1 "C001"
     2 "C099"
     3 "C001"
     4 "C154"
     5 "C220"
     6 "C171"
     7 "C059"
     8 "C081"
     9 "C130"
    10 "C001"
    end
    USING dataset
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str49 Siterecode str34 SiteDescription
    "C000-C006,C008-C009"                               "LIP"                               
    "C019"                                              "BASE OF TONGUE"                    
    "C020-C024,C028-C029"                               "TONGUE (EXCL BASE OF TONGUE)"      
    "C030-C031,C039-C041,C048-C052,C058-C062,C068-C069" "GUM, FLOOR OF MOUTH, & OTHER MOUTH"
    "C079-C081,C088-C089"                               "SALIVARY GLAND"                    
    "C090-C091,C098-C104,C108-C109"                     "OROPHARNYX"                        
    "C110,C112-C113,C118-C119"                          "NASOPHARYNX (EXCL POSTERIOR WALL)" 
    "C111"                                              "POSTERIOR WALL OF NASOPHARYNX"     
    "C129-C132,C138-C139"                               "HYPOPHARYNX"                       
    "C140,C142,C148"                                    "PHARYNX"                           
    "C150-C155,C158-C159"                               "ESOPHAGUS"                         
    "C160-C166,C168-C169"                               "STOMACH"                           
    "C170-C173,C178-C179"                               "SMALL INTESTINE"                   
    "C180, C182-C189, C199"                             "LARGE INTESTINE, (EXCL. APPENDIX)" 
    "C181"                                              "APPENDIX"                          
    "C209"                                              "RECTUM"                            
    "C210-C212,C218"                                    "ANAL CANAL & ANUS"                 
    "C220"                                              "LIVER"                             
    "C221"                                              "INTRAHEPATIC BILE DUCTS"           
    end

    RESULT
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte PatientID str4 Site str34 SiteDescription
     1 "C001" "LIP"                               
     2 "C099" "OROPHARNYX"                        
     3 "C001" "LIP"                               
     4 "C154" "ESOPHAGUS"                         
     5 "C220" "LIVER"                             
     6 "C171" "SMALL INTESTINE"                   
     7 "C059" "GUM, FLOOR OF MOUTH, & OTHER MOUTH"
     8 "C081" "SALIVARY GLAND"                    
     9 "C130" "HYPOPHARYNX"                       
    10 "C001" "LIP"                               
    end




  • #2
    Kudos for looking at -joinby- and -cross-, which so many users aren't even aware of. But it's a bit more complicated than that and requires a lot of additional data management. The simplest approach is to use the -rangejoin- command, by Robert Picard, available from SSC. To use it, you will also need the -rangestat- command, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC. I will not use your nomenclature for the data sets because, for -rangejoin-, the one you call "using" actually has to play the role of the "master."

    The whole script looks like this:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str49 Siterecode str34 SiteDescription
    "C000-C006,C008-C009"                               "LIP"                               
    "C019"                                              "BASE OF TONGUE"                    
    "C020-C024,C028-C029"                               "TONGUE (EXCL BASE OF TONGUE)"      
    "C030-C031,C039-C041,C048-C052,C058-C062,C068-C069" "GUM, FLOOR OF MOUTH, & OTHER MOUTH"
    "C079-C081,C088-C089"                               "SALIVARY GLAND"                    
    "C090-C091,C098-C104,C108-C109"                     "OROPHARNYX"                        
    "C110,C112-C113,C118-C119"                          "NASOPHARYNX (EXCL POSTERIOR WALL)"
    "C111"                                              "POSTERIOR WALL OF NASOPHARYNX"     
    "C129-C132,C138-C139"                               "HYPOPHARYNX"                       
    "C140,C142,C148"                                    "PHARYNX"                           
    "C150-C155,C158-C159"                               "ESOPHAGUS"                         
    "C160-C166,C168-C169"                               "STOMACH"                           
    "C170-C173,C178-C179"                               "SMALL INTESTINE"                   
    "C180, C182-C189, C199"                             "LARGE INTESTINE, (EXCL. APPENDIX)"
    "C181"                                              "APPENDIX"                          
    "C209"                                              "RECTUM"                            
    "C210-C212,C218"                                    "ANAL CANAL & ANUS"                 
    "C220"                                              "LIVER"                             
    "C221"                                              "INTRAHEPATIC BILE DUCTS"           
    end
    tempfile site_codes
    save `site_codes'
    
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte PatientID str4 Cancer
     1 "C001"
     2 "C099"
     3 "C001"
     4 "C154"
     5 "C220"
     6 "C171"
     7 "C059"
     8 "C081"
     9 "C130"
    10 "C001"
    end
    tempfile patient_data
    save `patient_data'
    
    use `patient_data'
    destring Cancer, gen(cancer_code) ignore("C")
    tempfile holding
    save `holding'
    
    use `site_codes', clear
    split Siterecode, parse(",") gen(site_codes)
    gen `c(obs_t)' obs_no = _n
    reshape long site_codes, i(obs_no)
    drop if missing(site_codes)
    drop _j
    split site_codes, parse("-") gen(bound)
    drop site_codes
    destring bound*, ignore("C") replace
    replace bound2 = bound1 if missing(bound2)
    
    rangejoin cancer_code bound1 bound2 using `holding'
    drop if missing(PatientID)
    keep PatientID Cancer SiteDescription

    Comment


    • #3
      Thank you so much Clyde Schechter. This worked brilliantly!

      Comment

      Working...
      X