Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Text Matching Pattern

    Dear Stata Experts,

    I have lab data, which consists of two variables (test name and test result), and the types of the two variables are string and text, and they are separated by a comma (,). Also, they are randomly written with no specific sequences, such as for test name (al khurma pcr,chikungunya pcr,dengue igg,dengue igm,dengue ns1,dengue pcr,rift valley fever PCR), and for test results(not required,not required,not done,positive,not done,detected,not required)

    I am interested in three test names (dengue igm, dengue ns1, dengue PCR) and their test results if they are positive or detected.

    Therefore I used the below Command:

    replace testresult = lower(testresult ) // convert everything to lowercaes to be safe
    replace testname = lower(testname)


    gen check = ustrregexm(testresult, "positive| detected")
    gen new_test = ustrregexm(testname , "pcr| ns1 |igm")
    ******************
    gen text = ustrregexs(0) if ustrregexm(testname, "dengue igm| dengue ns1|dengue pcr")

    the problem is I can't locate each test name (dengue igm| dengue ns1|dengue PCR) with their test results because they are randomly located in the text sequence in the variable test name.


    I hope I have explained my issue very clearly, for your assistance, please!.

    Meshal






  • #2
    Insert word boundaries.

    Code:
    clear
    input strL(testname testresult)
    "al khurma pcr,chikungunya pcr,dengue igg,dengue" "not required,not required,not done,positive"
    "igm,dengue ns1,dengue pcr,rift valley fever PCR" "not done, not required"
    "igm,dengue ns1,dengue pcr,rift valley fever PCR" "positive,not done,detected,not required"
    "al khurma pcr,chikungunya pcr,dengue igg,dengue" "not required,not required,not done"
    end
    
    gen found= ustrregexm(" " + lower(testname) + " ", "\b(pcr|ns1|igm)\b") & ustrregexm(" " + lower(testresult) + " ", "\b(positive|detected)\b")
    Res.:

    Code:
    . gen found= ustrregexm(" " + lower(testname) + " ", "\b(pcr|ns1|igm)\b") & ustrregexm(" " + lower(testresult) + "
    >  ", "\b(positive|detected)\b")
    
    . l
    
         +-------------------------------------------------------------------------------------------------------+
         |                                        testname                                    testresult   found |
         |-------------------------------------------------------------------------------------------------------|
      1. | al khurma pcr,chikungunya pcr,dengue igg,dengue   not required,not required,not done,positive       1 |
      2. | igm,dengue ns1,dengue pcr,rift valley fever PCR                        not done, not required       0 |
      3. | igm,dengue ns1,dengue pcr,rift valley fever PCR       positive,not done,detected,not required       1 |
      4. | al khurma pcr,chikungunya pcr,dengue igg,dengue            not required,not required,not done       0 |
         +-------------------------------------------------------------------------------------------------------+

    Comment


    • #3
      Thank you Andrew,

      But as you can see below where the variable found==1 where the testresult is not positive nor detected for the "pcr| ns1 | igm"




      testname testresult found
      (al khurma pcr,chikungunya pcr,dengue igg,dengue igm,dengue ns1,dengue pcr,rift valley fever pcr) (not detected,not detected,low,negative,negative,not detected,not detected) 1

      Comment


      • #4
        You need to restructure your data first to match the test and result, then apply #2.

        Code:
        clear
        input strL(testname testresult)
        "al khurma pcr,chikungunya pcr,dengue igg,dengue" "not required,not required,not done,positive"
        "igm,dengue ns1,dengue pcr,rift valley fever PCR" "not done, not required"
        "igm,dengue ns1,dengue pcr,rift valley fever PCR" "positive,not done,detected,not required"
        "al khurma pcr,chikungunya pcr,dengue igg,dengue" "not required,not required,not done"
        end
        
        gen long which=_n
        gen count= length(testname)- length(subinstr(testname, ",", "", .)) + 1
        expand count
        bysort which: gen name= subinstr(word(subinstr(subinstr(trim(itrim(testname)), " ", "_", .), ",", " ", .), _n), "_", " ", .)
        bysort which: gen result= subinstr(word(subinstr(subinstr(trim(itrim(testresult)), " ", "_", .), ",", " ", .), _n), "_", " ", .)
        keep which name result
        Res.:

        Code:
        . l, sepby(which)
        
             +-----------------------------------------------+
             | which                    name          result |
             |-----------------------------------------------|
          1. |     1           al khurma pcr    not required |
          2. |     1         chikungunya pcr    not required |
          3. |     1              dengue igg        not done |
          4. |     1                  dengue        positive |
             |-----------------------------------------------|
          5. |     2                     igm        not done |
          6. |     2              dengue ns1    not required |
          7. |     2              dengue pcr                 |
          8. |     2   rift valley fever PCR                 |
             |-----------------------------------------------|
          9. |     3                     igm        positive |
         10. |     3              dengue ns1        not done |
         11. |     3              dengue pcr        detected |
         12. |     3   rift valley fever PCR    not required |
             |-----------------------------------------------|
         13. |     4           al khurma pcr    not required |
         14. |     4         chikungunya pcr    not required |
         15. |     4              dengue igg        not done |
         16. |     4                  dengue                 |
             +-----------------------------------------------+
        Last edited by Andrew Musau; 08 Jul 2024, 00:56.

        Comment


        • #5
          Originally posted by Meshal AlQhtani View Post
          Thank you Andrew,

          But as you can see below where the variable found==1 where the testresult is not positive nor detected for the "pcr| ns1 | igm"




          testname testresult found
          (al khurma pcr,chikungunya pcr,dengue igg,dengue igm,dengue ns1,dengue pcr,rift valley fever pcr) (not detected,not detected,low,negative,negative,not detected,not detected) 1
          Re-reading #3, if there is no one-to-one matching needed, then I see your point. The issue is that "detected" is contained within "not detected". The solution would be to eliminate spaces, so that "not detected" becomes "notdetected" and thus is a distinct "word".

          Code:
          clear
          input strL(testname testresult)
          "al khurma pcr,chikungunya pcr,dengue igg,dengue igm,dengue ns1,dengue pcr,rift valley fever pcr" "not detected,not detected,low,negative,negative,not detected,not detected"
          "al khurma pcr,chikungunya pcr,dengue igg,dengue igm,dengue ns1,dengue pcr,rift valley fever pcr" "detected,low,negative,negative"
          end
          
          gen found= ustrregexm(" " + lower(testname) + " ", "\b(pcr|ns1|igm)\b") & ustrregexm(" " + lower(subinstr(testresult, " ", "", .)) + " ", "\b(positive|detected)\b")
          Res.:

          Code:
          . l testresult found
          
               +-----------------------------------------------------------------------------------+
               |                                                                testresult   found |
               |-----------------------------------------------------------------------------------|
            1. | not detected,not detected,low,negative,negative,not detected,not detected       0 |
            2. |                                            detected,low,negative,negative       1 |
               +-----------------------------------------------------------------------------------+

          Comment


          • #6
            Consider this:

            Code:
            // CREATE TOY EXAMPLE
            clear
            input byte id strL(testname testresult)
            1 "al khurma pcr,chikungunya pcr,dengue igg,dengue igm,dengue ns1,dengue pcr,rift valley fever pcr" "not detected,not detected,low,negative,negative,not detected,not detected"
            2 "al khurma pcr,chikungunya pcr,dengue igg,dengue" "not required,not required,not done,positive"
            3 "igm,dengue ns1,dengue pcr,rift valley fever PCR" "positive,not done,detected,not required"
            end
            
            // SOLUTION STARTS HERE
            
            split testname, p(,)
            split testresult, p(,)
            
            rename (testname testresult) orig_=
            
            reshape long testname testresult, i(id) j(testnum)
            drop if testname==""
            
            replace testname = lower(trim(testname))
            replace testresult = lower(trim(testresult))
            
            gen byte is_dengue = inlist(testname, "dengue igm", "dengue ns1", "dengue pcr")
            gen byte is_positive = inlist(testresult, "positive", "detected")
            
            gen byte is_dengue_positive = is_dengue * is_positive
            egen wanted = max(is_dengue_positive), by(id)
            
            drop testname testresult is_* testnum
            duplicates drop
            rename orig_* *
            which produces:
            Code:
            . list, noobs
            
              +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
              | id                                                                                          testname                                                                  testresult   wanted |
              |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
              |  1   al khurma pcr,chikungunya pcr,dengue igg,dengue igm,dengue ns1,dengue pcr,rift valley fever pcr   not detected,not detected,low,negative,negative,not detected,not detected        0 |
              |  2                                                   al khurma pcr,chikungunya pcr,dengue igg,dengue                                 not required,not required,not done,positive        0 |
              |  3                                                   igm,dengue ns1,dengue pcr,rift valley fever PCR                                     positive,not done,detected,not required        1 |
              +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

            Comment


            • #7
              Many thanks to Mr.Andrew Musau and Hemanshu Kumar !!!

              Both codes did work very well! especially the last code

              Best regards,
              Meshal

              Comment

              Working...
              X