Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Strings. Selecting the first occurrence of specific words in a string then ordering the selected words in a consistent way.

    Hello Statalist
    I have a data set with a variable haart1 (combination antiretroviral) as shown below. I will like to get the independent molecules, ordered in a consistent way
    For example, the following values should be coded in the same way
    Code:
    “Lopinavirlamivudinetenofovir disoproxilLopinavirtenofovir disoproxillamivudine”
    “tenofovir disoproxilLopinavirlamivudinetenofovir disoproxilLopinavirlamivudine”
    Should be “lamivudine lopinavir tenofovir”
    I have tried to use strpos to identify the common known combinations manually (but I just cannot accurately generate all possible combinations given I have about 24 names to be combined in triads or quartets)
    Code:
    replace haart1 = "FTD_TDF_EFV" if strpos(haart1,"tenofovir")& strpos(haart1,"emtricitabine")& strpos(haart1,"efavirenz")>0
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str168 haart1
    "Lopinavirlamivudinezidovudine"             
    "lamivudinezidovudineLopinavir"             
    "Lopinavirlamivudinezidovudine"             
    "lamivudineLopinavirzidovudine"             
    "lamivudineLopinavirzidovudine"             
    "zidovudineLopinavirlamivudine"             
    "Lopinavir"                                                                 
    "Lopinavir"                                 
    "lamivudinetenofovir disoproxilLopinavir"   
    "lamivudineLopinavirtenofovir disoproxil"   
    "lamivudineLopinavirtenofovir disoproxil"   
    "lamivudinetenofovir disoproxilLopinavir"   
    "Lopinavirtenofovir disoproxillamivudine"   
    "lamivudineLopinavirtenofovir disoproxil"
    end
    Ps. the order of word is not important. So AABBCC, CCCBBAAA, CCAABBB should all be ABC, (where each letter represents a word in the string).

    Thanks in advance

    Vitalis

  • #2
    I can't easily follow #1. It seems that sometimes drug names are separated by spaces and sometimes not. Also, you are referring to three letter abbreviations (TLAs!) as if we knew what they are or should be.

    Disclaimer: I know nothing specialised about pharmacology, just enough to think you are talking about drugs.

    I think I can help some of the way. If you had words (in Stata's sense, separated by spaces) then you could split (an official command), sort within rows and then concatenate. Here is a silly example.

    Code:
    clear
    input str14 whatever 
    "cat dog"
    "dog cat" 
    "fox emu" 
    "emu dog"
    "frog toad newt"
    "newt frog toad"
    end 
    
    split whatever, gen(beast)  
    unab vars : beast*
    local nvars : word count `vars' 
    forval j = 1/`nvars' { 
        local new `new' new`j'
    }
        
    rowsort beast*, gen(`new') 
    
    egen sorted = concat(new*) , p(" ") 
    
    list whatever sorted, sep(0) 
    
         +---------------------------------+
         |       whatever           sorted |
         |---------------------------------|
      1. |        cat dog          cat dog |
      2. |        dog cat          cat dog |
      3. |        fox emu          emu fox |
      4. |        emu dog          dog emu |
      5. | frog toad newt   frog newt toad |
      6. | newt frog toad   frog newt toad |
         +---------------------------------+
    Code:
    
    


    To do that, you need rowsort which should be downloaded from the Stata Journal site:

    Code:
    . search rowsort, sj 
    
    SJ-9-1  pr0046  . . . . . . . . . . . . . . . . . . .  Speaking Stata: Rowwise
            (help rowsort, rowranks if installed) . . . . . . . . . . .  N. J. Cox
            Q1/09   SJ 9(1):137--157
            shows how to exploit functions, egen functions, and Mata
            for working rowwise; rowsort and rowranks are introduced
    NB: the earlier version of rowsort on SSC cannot help you. It is restricted to integer variables.

    However, even your example is messier, with at least three more issues:

    * removing duplicates

    * inserting spaces.

    * correcting for inconsistencies of case (you don't mention this)

    I fear also that in any real large dataset you will also see spelling mistakes or other inconsistencies.

    I think this works with your example. To get rid of the duplicates, we need to reshape temporarily. But then rowsort is no longer needed.

    For the full dataset, you will need do more. What you feed to foreach may need to be a longer list of drug names.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str168 haart1
    "Lopinavirlamivudinezidovudine"             
    "lamivudinezidovudineLopinavir"             
    "Lopinavirlamivudinezidovudine"             
    "lamivudineLopinavirzidovudine"             
    "lamivudineLopinavirzidovudine"             
    "zidovudineLopinavirlamivudine"             
    "Lopinavir"                                                                 
    "Lopinavir"                                 
    "lamivudinetenofovir disoproxilLopinavir"   
    "lamivudineLopinavirtenofovir disoproxil"   
    "lamivudineLopinavirtenofovir disoproxil"   
    "lamivudinetenofovir disoproxilLopinavir"   
    "Lopinavirtenofovir disoproxillamivudine"   
    "lamivudineLopinavirtenofovir disoproxil"
    end
    
    gen work = lower(haart1) 
    
    foreach drug in lopinavir lamivudine zidovudine { 
        replace work = subinstr(work, "`drug'", " `drug' ", .) 
    } 
    
    gen long id = _n 
    save safecopy, replace 
    keep id work 
    split work, gen(drug) 
    reshape long drug, i(id) j(which) 
    duplicates drop id drug, force  
    bysort id (drug) : replace which = _n 
    
    reshape wide drug, i(id) j(which) 
    egen all = concat(drug*), p(" ") 
    drop drug* 
    merge 1:1 id using safecopy 
    
    list all, sep(0) 
    
         +-------------------------------------------+
         |                                       all |
         |-------------------------------------------|
      1. |           lamivudine lopinavir zidovudine |
      2. |           lamivudine lopinavir zidovudine |
      3. |           lamivudine lopinavir zidovudine |
      4. |           lamivudine lopinavir zidovudine |
      5. |           lamivudine lopinavir zidovudine |
      6. |           lamivudine lopinavir zidovudine |
      7. |                                 lopinavir |
      8. |                                 lopinavir |
      9. | disoproxil lamivudine lopinavir tenofovir |
     10. | disoproxil lamivudine lopinavir tenofovir |
     11. | disoproxil lamivudine lopinavir tenofovir |
     12. | disoproxil lamivudine lopinavir tenofovir |
     13. | disoproxil lamivudine lopinavir tenofovir |
     14. | disoproxil lamivudine lopinavir tenofovir |
         +-------------------------------------------+


    Comment


    • #3
      Dear Nick,
      Thank you. Apologies for the assumptions. I was able to proceed with your advice.

      Vitalis

      Comment

      Working...
      X