Concatenating strings across observations based on ID + additional variable

Theresa Pfu

Join Date: Nov 2023

Posts: 8
#1

Concatenating strings across observations based on ID + additional variable

20 Aug 2024, 07:04

Hi everyone,

I'm working with a WGS dataset that is long format with multiple observations per participant. Ultimately I need to reshape the dataset to a wide format so that I can append it with the clinical data of the participants. Before that I want to make some modifications and restructure the data.

For example I want to work with 3 variables: generic_id (participant identifier), gene (specifies the gene located for each observation in the long data), and drug (specifies the antibiotic that the gene mutation conferred a resistance to), all 3 are strings.
Based on the generic_id an the drug I would like to concatenate the genes (multiple genes can confer resistanceto the same drug, likewise one gene can confer resistance to multiple drugs)
This way I'm hoping to have a variable that is the same for each drug within a particpant listing all genes responsible for conferring resistance to this drug.

I have read this article: https://journals.sagepub.com/doi/pdf...36867X20909698
and tried the following:

Code:

bysort generic_id(drug): generate gene2= gene[1] by generic_id: replace gene2= gene2[_n-1] + gene if _n>1 by generic_id: replace gene2=gene[_N]

But this does not work in my case. The number of observations for each generic_id is not constant in my dataset. Would appreciate any nudge or ideas of how to do this. Thanks!
Tags: None

Ken Chui

Join Date: Aug 2014
Posts: 1058

20 Aug 2024, 07:52

Hello!

Here is a sample data I created for you:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str10(generic_id drug gene)
"0001" "a" "g01"
"0001" "a" "g02"
"0001" "b" "g01"
"0001" "c" "g03"
"0001" "d" "g08"
"0002" "a" "g01"
"0002" "b" "g02"
"0002" "b" "g03"
"0002" "f" "g08"
"0002" "f" "g09"
end

Do you mean you want something like this?

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str10(generic_id drug gene_new)
"0001" "a" "g01 g02"
"0001" "b" "g01"    
"0001" "c" "g03"    
"0001" "d" "g08"    
"0002" "a" "g01"    
"0002" "b" "g02 g03"
"0002" "f" "g08 g09"
end

If this is the case, try this:

Code:

bysort generic_id drug: gen seq = _n

reshape wide gene, i(generic_id drug) j(seq)

egen gene_new = concat(gene1-gene2), punct(" ")
replace gene_new = trim(gene_new)

If not, please post what you envision the end product would be.

Last edited by Ken Chui; 20 Aug 2024, 08:05.

Comment

Theresa Pfu

Join Date: Nov 2023
Posts: 8

21 Aug 2024, 08:30

Hi Ken Chui,

Thanks a lot for your suggestion. Apologies for not being clear enough, using the data example you provided

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str10(generic_id drug gene)
"0001" "a" "g01"
"0001" "a" "g02"
"0001" "b" "g01"
"0001" "c" "g03"
"0001" "d" "g08"
"0002" "a" "g01"
"0002" "b" "g02"
"0002" "b" "g03"
"0002" "f" "g08"
"0002" "f" "g09"
end

what I would actually want as an end product after reshaping is the following:

Code:

clear
input str10(generic_id genes_druga genes_drugb genes_drugc genes_drugd genes_drugf)
"0001" "g01 g02" "g01" "go3" "g08" "NA"
"0002" "g01" "g02 g03" "NA" "NA" "g02 g03"
end

Thanks! Theresa

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35651

21 Aug 2024, 08:42

Code:

* Example generated by -dataex-. For more info, type help dataex 
clear 
input str10(generic_id drug gene)
"0001" "a" "g01"
"0001" "a" "g02"
"0001" "b" "g01"
"0001" "c" "g03"
"0001" "d" "g08"
"0002" "a" "g01"
"0002" "b" "g02"
"0002" "b" "g03"
"0002" "f" "g08"
"0002" "f" "g09"
end

bysort generic_id drug (gene) : gen genes_drug = gene if _n == 1 
by generic_id drug : replace genes_drug = genes_drug[_n-1] + " " + gene if _n > 1 
by generic_id drug : keep if _n == _N 
drop gene 
reshape wide genes_drug, i(generic_id) j(drug) string 

foreach v of var genes_drug* {
    replace `v' = "NA" if missing(`v')
}

list

Stata ascribes no special meaning to NA. If you need it for some other purpose, that's fine, but in Stata it is sometimes just a nuisance.

Comment

Theresa Pfu

Join Date: Nov 2023

Posts: 8
#5

22 Aug 2024, 05:18

Amazing this works great thanks a lot!
Comment
Theresa Pfu

Join Date: Nov 2023

Posts: 8
#6

30 Aug 2024, 06:30

In an addition to what I described above, the code you provided worked fine Nick. However I realized that this way the seperate strings that are concatenated in the new variable might repeat themselves if e.g. multiple observations in the long format dataset had the same gene and same drug. See an example below:
if my initial long dataset looks like this

Code:

clear input str10(generic_id drug gene) "0001" "a" "g01" "0001" "a" "g02" "0001" "a" "g01" "0001" "b" "g03" "0001" "b" "g03" "0002" "b" "g01" "0002" "c" "g02" "0002" "c" "g02" "0002" "c" "g08" "0002" "c" "g08" end

and I then use Nick's code to create new variables and reshape

Code:

bysort generic_id drug (gene) : gen genes_drug = gene if _n == 1 by generic_id drug : replace genes_drug = genes_drug[_n-1] + " " + gene if _n > 1 by generic_id drug : keep if _n == _N drop gene reshape wide genes_drug, i(generic_id) j(drug) string foreach v of var genes_drug* { replace `v' = "NA" if missing(`v') } list

for example my variable genes_druga for participant 1 will name g01 twice - how could I avoid those duplicate strings?

Thanks!
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35651

30 Aug 2024, 11:27

Several ways. You could use duplicates to get rid of duplicates temporarily and merge back at the end.

Here's one way to do it otherwise.

Code:

clear
input str10(generic_id drug gene)
"0001" "a" "g01"
"0001" "a" "g02"
"0001" "a" "g01"
"0001" "b" "g03"
"0001" "b" "g03"
"0002" "b" "g01"
"0002" "c" "g02"
"0002" "c" "g02"
"0002" "c" "g08"
"0002" "c" "g08"
end

gen long seq = _n 

bysort generic_id drug gene: gen first = _n == 1

bysort first generic_id drug (gene): gen wanted = gene if first & _n == 1 

by first generic_id drug: replace wanted = wanted[_n-1] + " " + gene if first & _n > 1 

by first generic_id drug: replace wanted = trim(wanted[_N])

bysort generic_id drug (wanted) : replace wanted = wanted[_N]

drop first 

sort seq 

list, sepby(generic_id drug)

     +----------------------------------------+
     | generi~d   drug   gene   seq    wanted |
     |----------------------------------------|
  1. |     0001      a    g01     1   g01 g02 |
  2. |     0001      a    g02     2   g01 g02 |
  3. |     0001      a    g01     3   g01 g02 |
     |----------------------------------------|
  4. |     0001      b    g03     4       g03 |
  5. |     0001      b    g03     5       g03 |
     |----------------------------------------|
  6. |     0002      b    g01     6       g01 |
     |----------------------------------------|
  7. |     0002      c    g02     7   g02 g08 |
  8. |     0002      c    g02     8   g02 g08 |
  9. |     0002      c    g08     9   g02 g08 |
 10. |     0002      c    g08    10   g02 g08 |
     +----------------------------------------+

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35651

30 Aug 2024, 14:35

Here's another way to do it.

Code:

clear
input str10(generic_id drug gene)
"0001" "a" "g01"
"0001" "a" "g02"
"0001" "a" "g01"
"0001" "b" "g03"
"0001" "b" "g03"
"0002" "b" "g01"
"0002" "c" "g02"
"0002" "c" "g02"
"0002" "c" "g08"
"0002" "c" "g08"
end

bysort generic_id drug : gen wanted = gene if _n == 1 
by generic_id drug : replace wanted = wanted[_n-1] + cond(strpos(wanted[_n-1], gene) == 0, " " + gene, "") if _n > 1  
by generic_id drug : replace wanted = wanted[_N]

tabdisp generic_id drug, c(wanted)

-------------------------------------
generic_i |           drug           
d         |       a        b        c
----------+--------------------------
     0001 | g01 g02      g03         
     0002 |              g01  g02 g08
-------------------------------------

Comment

Theresa Pfu

Join Date: Nov 2023

Posts: 8
#9

04 Sep 2024, 08:23

Thanks Nick. I've used your first option that worked well - in order to reshape I had to drop the observations

Code:

first==0
Comment

Announcement