Concatenating distinct values contained within a string.

Craig Knott

Join Date: Jul 2014
Posts: 52

Concatenating distinct values contained within a string.

16 Aug 2018, 04:39

I currently have a string variable (drugs) that contains a list of drugs prescribed at each line of treatment. I'm looking to create a new string that contains only the unique drugs from each line. In the example below, for patient 6, treatment line 3 would be "CARBOPLATIN + ETOPOSIDE".

Any ideas?

Code:

clear
input float(makeid line) str44 analysis_group
 1 1 "CARBOPLATIN + ETOPOSIDE"                    
 1 2 "CYCLOPHOSPHAMIDE + DOXORUBICIN + VINCRISTINE"
 2 1 "CARBOPLATIN + ETOPOSIDE"                    
 3 1 "CARBOPLATIN + ETOPOSIDE"                    
 4 1 "CARBOPLATIN + ETOPOSIDE"                    
 5 1 "CARBOPLATIN + ETOPOSIDE"                    
 5 1 "CARBOPLATIN + ETOPOSIDE"                    
 6 1 "CARBOPLATIN + ETOPOSIDE"                    
 6 2 "PAZOPANIB"                                  
6 3 "CARBOPLATIN + ETOPOSIDE"                    
 6 3 "CARBOPLATIN"                                
 7 1 "CARBOPLATIN + ETOPOSIDE"                    
 8 1 "CISPLATIN + ETOPOSIDE"                      
 9 1 "CARBOPLATIN + ETOPOSIDE"                    
10 1 "CISPLATIN + ETOPOSIDE"                      
11 1 "CARBOPLATIN + ETOPOSIDE"                    
12 1 "CARBOPLATIN + ETOPOSIDE"                    
13 1 "CARBOPLATIN + ETOPOSIDE"                    
14 1 "CARBOPLATIN + ETOPOSIDE"                    
15 1 "CARBOPLATIN + ETOPOSIDE"                    
16 1 "CARBOPLATIN + ETOPOSIDE"                    
end

I'm currently trying to tweak some code found elsewhere, but currently unsuccessfully:

Code:

egen newid=group(tumourid line)
foreach n in newid {
local t `"`=analysis_group[`n']'"'
local t2 : list uniq t
replace analysis_group = `"`: list uniq t'"' in `n'
         }

Last edited by Craig Knott; 16 Aug 2018, 05:22.

Tags: None

Mike Lacy

Join Date: Apr 2014

Posts: 2411
#2

16 Aug 2018, 09:05

I don't understand your description of what you want. My interpretation of "string that contains only the unique drugs from each line" would be that the line 1 for makeid = 6 does contain a list of the "unique" (actually "distinct") drugs at that line, with no duplications. It's hard to induct your meaning hear from your example data, given that I gather that only makeid = 6 is an instance of interest to your problem. So, perhaps an augmented example along with a different description would help.

I also don't see the relevance of the code you report from elsewhere, but my guess is that it's unlikely to be useful to any purpose I can think of.

Finally, I'd have some suspicion that the data format you think you want might not be the best for your ultimate analytic purpose. So, I'd also suggest that you might describe what kind of question you want to answer with your data, as a different data format might serve you better than what you're thinking of at the moment.
Comment
Craig Knott

Join Date: Jul 2014

Posts: 52
#3

17 Aug 2018, 02:05

Morning, Mike.

My issue concerns the fact that lines of treatment can include a number of regimens (each row is one regimen) with a slight change in treatment. Thus, if I only keep the first row of a line, the analysis_group field concerning the drugs administered during the line of treatment will be inaccurate. So what I'm aiming to achieve is to pull into the first row of each line all of the distinct drugs administered throughout the duration of each line.

I hope that makes more sense!

Edit: In case it's helpful to anyone else, I've obtained an answer elsewhere that appears to solve the query, though 'm yet to test it on a particularly large dataset.

Code:

split analysis_group, parse("+") drop analysis_group gen n = _n reshape long analysis_group, i(n) j(drug_idx) replace analysis_group = trim(analysis_group) drop n drug_idx duplicates drop drop if missing(analysis_group) gen n = _n drop n bys makeid line: gen n = _n reshape wide analysis_group, i(makeid line) j(n) gen analysis_group = analysis_group1 /// + cond(!missing(analysis_group2), " + " + analysis_group2, "") /// + cond(!missing(analysis_group3), " + " + analysis_group3, "") drop analysis_group1 analysis_group2 analysis_group3

Last edited by Craig Knott; 17 Aug 2018, 02:24.
Comment

Announcement

Concatenating distinct values contained within a string.

Comment

Comment