Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with fuzzy matching

    Hi all,

    I have a dataset of electronic medical records, and I am trying to categorize diagnoses ("dx1") by body system ("system").
    The problem is that there are misspellings and variants of certain words.
    I am looking for a way to be able to categorize variants of a string into one category, e.g., if the string contains some variant of "mammary" (e.g., "mammery", "mamery", "mamary"), assign it a system value of 8 (based on the defined labels below).
    Right now, I've been doing it manually (e.g., replace system=8 if strpos(dx1, "mammary")|strpos(dx1, "mammery")|strpos(dx1, "mamary") ) capturing all known instances but it is taking me way too long to do and it is a very large dataset!

    I've tried using matchit, but I can only find documentation on using that to merge two datasets. Here, my strings are all part of one variable within one dataset.



    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str517 dx1 float system
    "abdominal masses cranial to right nipple (13.4x14.8mm, 7.6x8.1mm, 8.8x8.8mm) - r/o fibroadenoma, fibrosarcoma, lipoma, mammary carcinoma"                                .
    "adenocarcinoma mammary gland"                                                                                                                                            .
    "adenocarcinoma mammary gland e"                                                                                                                                          .
    "adenocarcinoma mammery gland"                                                                                                                                            .
    "adenocarnicoma of the mammary glands"                                                                                                                                    .
    "adenoma mammary gland benign"                                                                                                                                            .
    "adenoma mammary gland benign (multiple)"                                                                                                                                 .
    "adenoma mammary gland benign (removed 5/15/13)"                                                                                                                          .
    "adenosquamous carcinoma mammary gland"                                                                                                                                   .
    "adenosquamous carcinoma, left fourth mammary gland (surgical excision 1/24/14, scar revision 2/7/14)"                                                                    .
    "benign adenoma--mammary gland"                                                                                                                                           .
    "benign mammary masses (adenoma)-surgically excised on 2/9/15"                                                                                                            .
    "benign mammary tumors (adenomas)"                                                                                                                                        .
    "bilateral high grade mammary adenocarcinoma"                                                                                                                             .
    "bilateral high-grade mamary gland adenocarcinoma"                                                                                                                       .
    "bleeding from mass suspected to be mammary gland adenocarcinoma"                                                                                                         .
    "complex adenoma with atypia - right caudal/inguinal mammary gland"                                                                                                       .
    "incision recheck - mammary mass excision l3, low-grade adenocarcinoma completely excised"                                                                                .
    "inflammatory mammery adenosquamous carcinoma (l4, excised)"                                                                                                              .
    "inguinal mass - r/o mammary adenocarcinoma"                                                                                                                              .
    "intraductal mammary papillary adenocarcinoma (r4)- excised 10/16/13"                                                                                                     .
    "mammary adenocarcinoma (right 5th gland) with lymph node involvement"                                                                                                    .
    "mammary adenocarcinoma (right fifth mammary gland)"                                                                                                                      .
    "mammary adenocarcinoma - right axillary gland (excised 11/2010)"                                                                                                         .
    "mammary adenocarcinoma, grade 2"                                                                                                                                         .
    "mammary adenocarinoma with metastasis to axillary lymph node"                                                                                                            .
    "mammary adenoma (excised)"                                                                                                                                               .
    "mammary adenosquamous carcinoma (l4, excised), with recurrence
 
 mammary nodules in r2,3,4 and l2"                                                            .
    "mammary cystadenocarcinoma"                                                                                                                                              .
    "mammary gland adenocarcinoma - surgically excised 3/2012"                                                                                                                .
    "mammary gland adenoma - left 4th gland, surgically removed 02/21/17"                                                                                                     .
    "mammary gland fibroadenoma - right caudal mammary chain"                                                                                                                 .
    "mammary glands enlargement - likely mammary fibroadenomatous hyperplasia"                                                                                                .
    "mammary intraductal papillary adenoma (multiple) - l4 mammary gland - regional mastectomy 09/03/2014"                                                                    .
    "mammary mass (likely adenocarcinoma)"                                                                                                                                    .
    "mammary masses -- suspect adenocarcinoma"                                                                                                                                .
    "mammary masses-suspect mammary adenocarcinoma recurrence"                                                                                                                .
    "mammary tumor: adenoma with an area with more malignant appearance.  clean margins."                                                                                     .
    "mammary tumors- adenomas and carcinoma"                                                                                                                                  .
    "mass along right caudal mammary chain (excised - adenoma)"                                                                                                               .
    "mesenteric/jejunal lymphadenopathy"                                                                                                                                      .
    "multiple mammary adenomas - completely excised 12/1"                                                                                                                     .
    "papillary mammary adenocarcinoma, glands 3 and 4 on the right side"                                                                                                      .
    "post-op mammary adenoma and anal sac adenocarcinoma removal (3/14/12)"                                                                                                   .
    "primary mammary adenocarcinoma"                                                                                                                                          .
    "right maxillary mammary adenoma - removed 10/23/13"                                                                                                                      .
    "simple mammary adenocarcinoma"                                                                                                                                           .
    end
    ------------------ copy up to and including the previous line ------------------

    label define System 1 "Oropharyngeal/nasal" 2 "Ocular" 3 "Aural" 4 "Respiratory" 5 "Cardiovascular/Hemotological" 6 "Gastrointestinal" 7 "Hepatobiliary" 8 "Urogenital" 9 "Musculoskeletal" 10 "Integument" 11 "Neurological" 12 "Behavior" 13 "Other" 14 "Healthy"




  • #2
    Here is sample code showing how I would begin an attempt to use the excellent matchit commnd from Julio Raffo to accomplish what you want. For my work to present properly on Statalist, I shortened the dx1 values to a maximum of 60 characters; you would not do that in your work.
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte syst str30 keyword
     8 "mammary"         
    42 "spare tire"
    end
    generate idkey = _n
    tempfile codes
    save `codes'
    
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte system str136 dx1
    . "abdominal masses cranial to right nipple (13.4x14.8mm, 7.6x8"
    . "adenocarcinoma mammary gland"                                
    . "adenocarcinoma mammary gland e"                              
    . "adenocarcinoma mammery gland"                                
    . "adenocarnicoma of the mammary glands"                        
    . "adenoma mammary gland benign"                                
    . "adenoma mammary gland benign (multiple)"                     
    . "adenoma mammary gland benign (removed 5/15/13)"              
    . "adenosquamous carcinoma mammary gland"                       
    . "adenosquamous carcinoma, left fourth mammary gland (surgical"
    . "benign adenoma--mammary gland"                               
    . "benign mammary masses (adenoma)-surgically excised on 2/9/15"
    . "benign mammary tumors (adenomas)"                            
    . "bilateral high grade mammary adenocarcinoma"                 
    . "bilateral high-grade mamary gland adenocarcinoma"            
    . "bleeding from mass suspected to be mammary gland adenocarcin"
    . "complex adenoma with atypia - right caudal/inguinal mammary "
    . "incision recheck - mammary mass excision l3, low-grade adeno"
    . "inflammatory mammery adenosquamous carcinoma (l4, excised)"  
    . "inguinal mass - r/o mammary adenocarcinoma"                  
    . "intraductal mammary papillary adenocarcinoma (r4)- excised 1"
    . "mammary adenocarcinoma (right 5th gland) with lymph node inv"
    . "mammery adenocarcinoma (right fifth mammary gland)"          
    . "mamary adenocarcinoma - right axillary gland (excised 11/201"
    . "mammary adenocarcinoma, grade 2"                             
    . "mammary adenocarinoma with metastasis to axillary lymph node"
    . "mammary adenoma (excised)"                                   
    . "mammary adenosquamous carcinoma (l4, excised), with recurren"
    . "mammary cystadenocarcinoma"                                  
    . "mammary gland adenocarcinoma - surgically excised 3/2012"    
    . "mammary gland adenoma - left 4th gland, surgically removed 0"
    . "mammary gland fibroadenoma - right caudal mammary chain"     
    . "mammary glands enlargement - likely mammary fibroadenomatous"
    . "mammary intraductal papillary adenoma (multiple) - l4 mammar"
    . "mammary mass (likely adenocarcinoma)"                        
    . "mammary masses -- suspect adenocarcinoma"                    
    . "mammary masses-suspect mammary adenocarcinoma recurrence"    
    . "mammary tumor: adenoma with an area with more malignant appe"
    . "mammary tumors- adenomas and carcinoma"                      
    . "mass along right caudal mammary chain (excised - adenoma)"   
    . "mesenteric/jejunal lymphadenopathy"                          
    . "multiple mammary adenomas - completely excised 12/1"         
    . "papillary mammary adenocarcinoma, glands 3 and 4 on the righ"
    . "post-op mammary adenoma and anal sac adenocarcinoma removal "
    . "primary mammary adenocarcinoma"                              
    . "right maxillary mammary adenoma - removed 10/23/13"          
    . "simple mammary adenocarcinoma"                               
    end
    
    generate id = _n
    tempfile master
    save `master'
    
    matchit id dx1 using `codes', idusing(idkey) txtusing(keyword) score(minsimple)
    drop dx1 keyword
    merge m:1 idkey using `codes'
    drop if _merge==2
    drop _merge
    merge m:1 id using `master'
    format dx1 %-60s
    format similscore %9.4f
    gsort id -similscore
    list id similscore syst dx1, clean noobs
    Code:
    . list id similscore syst dx1, clean noobs
    
        id   simils~e   syst   dx1                                                           
         1          .      .   abdominal masses cranial to right nipple (13.4x14.8mm, 7.6x8  
         2     1.0000      8   adenocarcinoma mammary gland                                  
         3     1.0000      8   adenocarcinoma mammary gland e                                
         4     1.0000      8   adenocarcinoma mammery gland                                  
         5     1.0000      8   adenocarnicoma of the mammary glands                          
         6     1.0000      8   adenoma mammary gland benign                                  
         7     1.0000      8   adenoma mammary gland benign (multiple)                       
         8     1.0000      8   adenoma mammary gland benign (removed 5/15/13)                
         9     1.0000      8   adenosquamous carcinoma mammary gland                         
        10     1.0000      8   adenosquamous carcinoma, left fourth mammary gland (surgical  
        11     1.0000      8   benign adenoma--mammary gland                                 
        12     1.0000      8   benign mammary masses (adenoma)-surgically excised on 2/9/15  
        13     1.0000      8   benign mammary tumors (adenomas)                              
        14     1.0000      8   bilateral high grade mammary adenocarcinoma                   
        15     1.0000      8   bilateral high-grade mamary gland adenocarcinoma              
        16     1.0000      8   bleeding from mass suspected to be mammary gland adenocarcin  
        17     1.0000      8   complex adenoma with atypia - right caudal/inguinal mammary   
        18     1.0000      8   incision recheck - mammary mass excision l3, low-grade adeno  
        19     1.0000      8   inflammatory mammery adenosquamous carcinoma (l4, excised)    
        20     1.0000      8   inguinal mass - r/o mammary adenocarcinoma                    
        21     1.0000      8   intraductal mammary papillary adenocarcinoma (r4)- excised 1  
        22     1.0000      8   mammary adenocarcinoma (right 5th gland) with lymph node inv  
        23     1.0000      8   mammery adenocarcinoma (right fifth mammary gland)            
        24     1.0000      8   mamary adenocarcinoma - right axillary gland (excised 11/201  
        25     1.0000      8   mammary adenocarcinoma, grade 2                               
        26     1.0000      8   mammary adenocarinoma with metastasis to axillary lymph node  
        27     1.0000      8   mammary adenoma (excised)                                     
        28     1.0000      8   mammary adenosquamous carcinoma (l4, excised), with recurren  
        28     0.5455     42   mammary adenosquamous carcinoma (l4, excised), with recurren  
        29     1.0000      8   mammary cystadenocarcinoma                                    
        30     1.0000      8   mammary gland adenocarcinoma - surgically excised 3/2012      
        31     1.0000      8   mammary gland adenoma - left 4th gland, surgically removed 0  
        32     1.0000      8   mammary gland fibroadenoma - right caudal mammary chain       
        33     1.0000      8   mammary glands enlargement - likely mammary fibroadenomatous  
        34     1.0000      8   mammary intraductal papillary adenoma (multiple) - l4 mammar  
        35     1.0000      8   mammary mass (likely adenocarcinoma)                          
        36     1.0000      8   mammary masses -- suspect adenocarcinoma                      
        37     1.0000      8   mammary masses-suspect mammary adenocarcinoma recurrence      
        37     0.7273     42   mammary masses-suspect mammary adenocarcinoma recurrence      
        38     1.0000      8   mammary tumor: adenoma with an area with more malignant appe  
        38     0.7273     42   mammary tumor: adenoma with an area with more malignant appe  
        39     1.0000      8   mammary tumors- adenomas and carcinoma                        
        40     1.0000      8   mass along right caudal mammary chain (excised - adenoma)     
        41          .      .   mesenteric/jejunal lymphadenopathy                            
        42     1.0000      8   multiple mammary adenomas - completely excised 12/1           
        43     1.0000      8   papillary mammary adenocarcinoma, glands 3 and 4 on the righ  
        43     0.5455     42   papillary mammary adenocarcinoma, glands 3 and 4 on the righ  
        44     1.0000      8   post-op mammary adenoma and anal sac adenocarcinoma removal   
        45     1.0000      8   primary mammary adenocarcinoma                                
        46     1.0000      8   right maxillary mammary adenoma - removed 10/23/13            
        47     1.0000      8   simple mammary adenocarcinoma

    Comment


    • #3
      Let me complement what William Lisowski suggests with an unfortunately well-hidden trick you can use from the -matchit- family. With -matchit- you need to use the command -freqindex- when employing weights and other stuff. But it can also help you create the codes in Bill's `code' tempfile.

      For instance using the example from your post, you can easily see the frequency of terms in your data and select which terms should be associated with each system:

      Code:
      freqindex dx1
      gsort -freq
      list in 1/5, noobs sep(0)
      
        +-----------------------+
        |          grams   freq |
        |-----------------------|
        |        mammary     48 |
        | adenocarcinoma     18 |
        |          gland     17 |
        |              -     14 |
        |        adenoma     10 |
        +-----------------------+
      
      */
      The previous example used the similarity function token which is the default for -freqindex- (equal to having the option sim(token)). But there is also the cotoken option which can be particularly useful for this kind of tasks.

      Code:
      freqindex dx1, sim(cotoken)
      gsort -freq
      list in 1/5, noobs sep(0)
      /*
        +-------------------------------+
        |                  grams   freq |
        |-------------------------------|
        |          gland mammary     13 |
        | adenocarcinoma mammary     10 |
        |        adenoma mammary      6 |
        |           benign gland      3 |
        |                - right      3 |
        +-------------------------------+
      */

      Comment

      Working...
      X