Strings. Selecting the first occurrence of specific words in a string then ordering the selected words in a consistent way.

Vitalis Feteh

Join Date: Jun 2019
Posts: 6

Strings. Selecting the first occurrence of specific words in a string then ordering the selected words in a consistent way.

04 Jul 2019, 00:24

Hello Statalist
I have a data set with a variable haart1 (combination antiretroviral) as shown below. I will like to get the independent molecules, ordered in a consistent way
For example, the following values should be coded in the same way

Code:

“Lopinavirlamivudinetenofovir disoproxilLopinavirtenofovir disoproxillamivudine”
“tenofovir disoproxilLopinavirlamivudinetenofovir disoproxilLopinavirlamivudine”

Should be “lamivudine lopinavir tenofovir”
I have tried to use strpos to identify the common known combinations manually (but I just cannot accurately generate all possible combinations given I have about 24 names to be combined in triads or quartets)

Code:

replace haart1 = "FTD_TDF_EFV" if strpos(haart1,"tenofovir")& strpos(haart1,"emtricitabine")& strpos(haart1,"efavirenz")>0

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str168 haart1
"Lopinavirlamivudinezidovudine"             
"lamivudinezidovudineLopinavir"             
"Lopinavirlamivudinezidovudine"             
"lamivudineLopinavirzidovudine"             
"lamivudineLopinavirzidovudine"             
"zidovudineLopinavirlamivudine"             
"Lopinavir"                                                                 
"Lopinavir"                                 
"lamivudinetenofovir disoproxilLopinavir"   
"lamivudineLopinavirtenofovir disoproxil"   
"lamivudineLopinavirtenofovir disoproxil"   
"lamivudinetenofovir disoproxilLopinavir"   
"Lopinavirtenofovir disoproxillamivudine"   
"lamivudineLopinavirtenofovir disoproxil"
end

Ps. the order of word is not important. So AABBCC, CCCBBAAA, CCAABBB should all be ABC, (where each letter represents a word in the string).

Thanks in advance

Vitalis

Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35671

04 Jul 2019, 03:04

I can't easily follow #1. It seems that sometimes drug names are separated by spaces and sometimes not. Also, you are referring to three letter abbreviations (TLAs!) as if we knew what they are or should be.

Disclaimer: I know nothing specialised about pharmacology, just enough to think you are talking about drugs.

I think I can help some of the way. If you had words (in Stata's sense, separated by spaces) then you could split (an official command), sort within rows and then concatenate. Here is a silly example.

Code:

clear
input str14 whatever 
"cat dog"
"dog cat" 
"fox emu" 
"emu dog"
"frog toad newt"
"newt frog toad"
end 

split whatever, gen(beast)  
unab vars : beast*
local nvars : word count `vars' 
forval j = 1/`nvars' { 
    local new `new' new`j'
}
    
rowsort beast*, gen(`new') 

egen sorted = concat(new*) , p(" ") 

list whatever sorted, sep(0) 

     +---------------------------------+
     |       whatever           sorted |
     |---------------------------------|
  1. |        cat dog          cat dog |
  2. |        dog cat          cat dog |
  3. |        fox emu          emu fox |
  4. |        emu dog          dog emu |
  5. | frog toad newt   frog newt toad |
  6. | newt frog toad   frog newt toad |
     +---------------------------------+

Code:

To do that, you need rowsort which should be downloaded from the Stata Journal site:

Code:

. search rowsort, sj 

SJ-9-1  pr0046  . . . . . . . . . . . . . . . . . . .  Speaking Stata: Rowwise
        (help rowsort, rowranks if installed) . . . . . . . . . . .  N. J. Cox
        Q1/09   SJ 9(1):137--157
        shows how to exploit functions, egen functions, and Mata
        for working rowwise; rowsort and rowranks are introduced

NB: the earlier version of rowsort on SSC cannot help you. It is restricted to integer variables.

However, even your example is messier, with at least three more issues:

* removing duplicates

* inserting spaces.

* correcting for inconsistencies of case (you don't mention this)

I fear also that in any real large dataset you will also see spelling mistakes or other inconsistencies.

I think this works with your example. To get rid of the duplicates, we need to reshape temporarily. But then rowsort is no longer needed.

For the full dataset, you will need do more. What you feed to foreach may need to be a longer list of drug names.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str168 haart1
"Lopinavirlamivudinezidovudine"             
"lamivudinezidovudineLopinavir"             
"Lopinavirlamivudinezidovudine"             
"lamivudineLopinavirzidovudine"             
"lamivudineLopinavirzidovudine"             
"zidovudineLopinavirlamivudine"             
"Lopinavir"                                                                 
"Lopinavir"                                 
"lamivudinetenofovir disoproxilLopinavir"   
"lamivudineLopinavirtenofovir disoproxil"   
"lamivudineLopinavirtenofovir disoproxil"   
"lamivudinetenofovir disoproxilLopinavir"   
"Lopinavirtenofovir disoproxillamivudine"   
"lamivudineLopinavirtenofovir disoproxil"
end

gen work = lower(haart1) 

foreach drug in lopinavir lamivudine zidovudine { 
    replace work = subinstr(work, "`drug'", " `drug' ", .) 
} 

gen long id = _n 
save safecopy, replace 
keep id work 
split work, gen(drug) 
reshape long drug, i(id) j(which) 
duplicates drop id drug, force  
bysort id (drug) : replace which = _n 

reshape wide drug, i(id) j(which) 
egen all = concat(drug*), p(" ") 
drop drug* 
merge 1:1 id using safecopy 

list all, sep(0) 

     +-------------------------------------------+
     |                                       all |
     |-------------------------------------------|
  1. |           lamivudine lopinavir zidovudine |
  2. |           lamivudine lopinavir zidovudine |
  3. |           lamivudine lopinavir zidovudine |
  4. |           lamivudine lopinavir zidovudine |
  5. |           lamivudine lopinavir zidovudine |
  6. |           lamivudine lopinavir zidovudine |
  7. |                                 lopinavir |
  8. |                                 lopinavir |
  9. | disoproxil lamivudine lopinavir tenofovir |
 10. | disoproxil lamivudine lopinavir tenofovir |
 11. | disoproxil lamivudine lopinavir tenofovir |
 12. | disoproxil lamivudine lopinavir tenofovir |
 13. | disoproxil lamivudine lopinavir tenofovir |
 14. | disoproxil lamivudine lopinavir tenofovir |
     +-------------------------------------------+

Comment

Vitalis Feteh

Join Date: Jun 2019

Posts: 6
#3

04 Jul 2019, 19:08

Dear Nick,
Thank you. Apologies for the assumptions. I was able to proceed with your advice.

Vitalis
Comment

Announcement

Strings. Selecting the first occurrence of specific words in a string then ordering the selected words in a consistent way.

Comment

Comment