Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unique text string


    I have a dataset, approximately 7500 observations, where clinicians were requested to enter a treatment plan:

    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str66 q10
    "upper arch treatment only"                     
    "upper arch treatment only,extraction"          
    "upper arch treatment only"                     
    "upper and lower arch treatment,extraction"     
    "upper arch treatment only,non-extraction"      
    "upper arch treatment only,extraction"          
    "upper arch treatment only"                     
    "upper and lower arch treatment,extraction"     
    "upper and lower arch treatment,extraction,other"
    "upper arch treatment only"                     
    end
    ------------------ copy up to and including the previous line ------------------

    I would like to extract specific treatments such as 'upper and lower arch treatment,extraction'. I have used moss (SSC) and regex:
    . li if regexm(q10,"upper and lower arch treatment,extraction")

    +-------------------------------------------------+
    | q10 |
    |-------------------------------------------------|
    4. | upper and lower arch treatment,extraction |
    8. | upper and lower arch treatment,extraction |
    9. | upper and lower arch treatment,extraction,other |
    +-------------------------------------------------+


    I only need results 4 & 8, the real data set has several additional treatments.

    How can I find just the 'upper and lower arch treatment,extraction'

    Thank you,
    Martyn

  • #2
    Martyn, I don't know if this will help (since I'm not sure what other things you are trying to exclude after "upper and lower arch treatment,extraction"), but I added a couple of data points, and then tried a couple of options using regexm() and strpos()

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str66 q10
    "upper and lower arch treatment,extraction"                        
    "upper and lower arch treatment,extraction"                        
    "upper and lower arch treatment,extraction, other"                
    "upper and lower arch treatment,extraction, plus some other things"
    "upper and lower arch treatment,extraction,other"                  
    "upper and lower arch treatment,extraction,other, plus more"      
    "upper arch treatment only"                                        
    "upper arch treatment only"                                        
    "upper arch treatment only"                                        
    "upper arch treatment only"                                        
    "upper arch treatment only,extraction"                            
    "upper arch treatment only,extraction"                            
    "upper arch treatment only,non-extraction"                        
    end
    Code:
    sort q10
    gen w1 = 0  // I could do these on 1 line, but they're easier to read if on two
    replace w1 = 1 if strpos(q10, "upper and lower arch treatment,extraction") >0 & strpos(q10, "other")==0  // exclude if contains word "other"
    gen w2 = 0
    replace w2=1 if regexm(q10, "upper and lower arch treatment,extraction$")  // string must be at the end of q10
    gen w3 = 0
    replace w3 = 1 if strpos(q10, "upper and lower arch treatment,extraction") >0 & strpos(q10, "upper and lower arch treatment,extraction,")==0
    // above excludes if there is a comma after matching string
    
    . list
    
         +----------------------------------------------------------------------------------+
         |                                                               q10   w1   w2   w3 |
         |----------------------------------------------------------------------------------|
      1. |                         upper and lower arch treatment,extraction    1    1    1 |
      2. |                         upper and lower arch treatment,extraction    1    1    1 |
      3. |                  upper and lower arch treatment,extraction, other    0    0    0 |
      4. | upper and lower arch treatment,extraction, plus some other things    0    0    0 |
      5. |                   upper and lower arch treatment,extraction,other    0    0    0 |
         |----------------------------------------------------------------------------------|
      6. |        upper and lower arch treatment,extraction,other, plus more    0    0    0 |
      7. |                                         upper arch treatment only    0    0    0 |
      8. |                                         upper arch treatment only    0    0    0 |
      9. |                                         upper arch treatment only    0    0    0 |
     10. |                                         upper arch treatment only    0    0    0 |
         |----------------------------------------------------------------------------------|
     11. |                              upper arch treatment only,extraction    0    0    0 |
     12. |                              upper arch treatment only,extraction    0    0    0 |
     13. |                          upper arch treatment only,non-extraction    0    0    0 |
         +----------------------------------------------------------------------------------+

    Comment


    • #3
      Martyn,

      it is probably good to give some more description as to where the data is coming from to know what problems to anticipate. Just by looking at code posted by David, I see two potential problems:
      For these two cases:
      Code:
      "upper and lower arch treatment, extraction"                        
      "did not recommend upper and lower arch treatment,extraction"
      The first one realistically should be matched (since it is just a space which doesn't change the meaning), the second conceptually should not, since it is entirely opposite in meaning.

      Code:
           +----------------------------------------------------------------------------------+
           |                                                               q10   w1   w2   w3 |
           |----------------------------------------------------------------------------------|
        1. |       did not recommend upper and lower arch treatment,extraction    1    1    1 |
        2. |                        upper and lower arch treatment, extraction    0    0    0 |
        3. |                         upper and lower arch treatment,extraction    1    1    1 |
      If the data comes from a system that has the codes (but for some reason exported the data to you with labels) it is best to go back to the source and request the codes instead. If on the other hand the values were typed in by hand, then you will have all sorts of misspellings and variations to deal with.

      Best, Sergiy Radyakin

      Comment


      • #4
        David, Sergiy,

        Thank you both.

        The data originated from 'Excel' and, and as far as I know, there were a limited number of defined treatment options but the treatments (as text) are concatenated in a single cell.

        There could be a multiple number of treatments for each patient but only certain combinations are required

        Thus there could be multiple exclusions, and this would vary with each 'prime' chosen treatment. In this case 'upper and lower arch treatment' with 'extraction' as the secondary.

        Cheers,
        Martyn


        Comment

        Working...
        X