Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Concatenating strings across observations based on ID + additional variable

    Hi everyone,

    I'm working with a WGS dataset that is long format with multiple observations per participant. Ultimately I need to reshape the dataset to a wide format so that I can append it with the clinical data of the participants. Before that I want to make some modifications and restructure the data.

    For example I want to work with 3 variables: generic_id (participant identifier), gene (specifies the gene located for each observation in the long data), and drug (specifies the antibiotic that the gene mutation conferred a resistance to), all 3 are strings.
    Based on the generic_id an the drug I would like to concatenate the genes (multiple genes can confer resistanceto the same drug, likewise one gene can confer resistance to multiple drugs)
    This way I'm hoping to have a variable that is the same for each drug within a particpant listing all genes responsible for conferring resistance to this drug.

    I have read this article: https://journals.sagepub.com/doi/pdf...36867X20909698
    and tried the following:

    Code:
    bysort generic_id(drug): generate gene2= gene[1]
    by generic_id: replace gene2= gene2[_n-1] + gene if _n>1
    by generic_id: replace gene2=gene[_N]
    But this does not work in my case. The number of observations for each generic_id is not constant in my dataset. Would appreciate any nudge or ideas of how to do this. Thanks!

  • #2
    Hello!

    Here is a sample data I created for you:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str10(generic_id drug gene)
    "0001" "a" "g01"
    "0001" "a" "g02"
    "0001" "b" "g01"
    "0001" "c" "g03"
    "0001" "d" "g08"
    "0002" "a" "g01"
    "0002" "b" "g02"
    "0002" "b" "g03"
    "0002" "f" "g08"
    "0002" "f" "g09"
    end
    Do you mean you want something like this?

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str10(generic_id drug gene_new)
    "0001" "a" "g01 g02"
    "0001" "b" "g01"    
    "0001" "c" "g03"    
    "0001" "d" "g08"    
    "0002" "a" "g01"    
    "0002" "b" "g02 g03"
    "0002" "f" "g08 g09"
    end
    If this is the case, try this:

    Code:
    bysort generic_id drug: gen seq = _n
    
    reshape wide gene, i(generic_id drug) j(seq)
    
    egen gene_new = concat(gene1-gene2), punct(" ")
    replace gene_new = trim(gene_new)
    If not, please post what you envision the end product would be.
    Last edited by Ken Chui; 20 Aug 2024, 08:05.

    Comment


    • #3
      Hi Ken Chui,

      Thanks a lot for your suggestion. Apologies for not being clear enough, using the data example you provided
      Code:
      * Example generated by -dataex-. For more info, type help dataex clear input str10(generic_id drug gene)
      "0001" "a" "g01"
      "0001" "a" "g02"
      "0001" "b" "g01"
      "0001" "c" "g03"
      "0001" "d" "g08"
      "0002" "a" "g01"
      "0002" "b" "g02"
      "0002" "b" "g03"
      "0002" "f" "g08"
      "0002" "f" "g09"
      end
      what I would actually want as an end product after reshaping is the following:

      Code:
      clear
      input str10(generic_id genes_druga genes_drugb genes_drugc genes_drugd genes_drugf)
      "0001" "g01 g02" "g01" "go3" "g08" "NA"
      "0002" "g01" "g02 g03" "NA" "NA" "g02 g03"
      end
      Thanks! Theresa

      Comment


      • #4
        Code:
        * Example generated by -dataex-. For more info, type help dataex 
        clear 
        input str10(generic_id drug gene)
        "0001" "a" "g01"
        "0001" "a" "g02"
        "0001" "b" "g01"
        "0001" "c" "g03"
        "0001" "d" "g08"
        "0002" "a" "g01"
        "0002" "b" "g02"
        "0002" "b" "g03"
        "0002" "f" "g08"
        "0002" "f" "g09"
        end
        
        bysort generic_id drug (gene) : gen genes_drug = gene if _n == 1 
        by generic_id drug : replace genes_drug = genes_drug[_n-1] + " " + gene if _n > 1 
        by generic_id drug : keep if _n == _N 
        drop gene 
        reshape wide genes_drug, i(generic_id) j(drug) string 
        
        foreach v of var genes_drug* {
            replace `v' = "NA" if missing(`v')
        }
        
        list
        Stata ascribes no special meaning to NA. If you need it for some other purpose, that's fine, but in Stata it is sometimes just a nuisance.

        Comment


        • #5
          Amazing this works great thanks a lot!

          Comment


          • #6
            In an addition to what I described above, the code you provided worked fine Nick. However I realized that this way the seperate strings that are concatenated in the new variable might repeat themselves if e.g. multiple observations in the long format dataset had the same gene and same drug. See an example below:
            if my initial long dataset looks like this

            Code:
            clear
            input str10(generic_id drug gene)
            "0001" "a" "g01"
            "0001" "a" "g02"
            "0001" "a" "g01"
            "0001" "b" "g03"
            "0001" "b" "g03"
            "0002" "b" "g01"
            "0002" "c" "g02"
            "0002" "c" "g02"
            "0002" "c" "g08"
            "0002" "c" "g08"
            end
            and I then use Nick's code to create new variables and reshape

            Code:
             
             bysort generic_id drug (gene) : gen genes_drug = gene if _n == 1  by generic_id drug : replace genes_drug = genes_drug[_n-1] + " " + gene if _n > 1  by generic_id drug : keep if _n == _N  drop gene  reshape wide genes_drug, i(generic_id) j(drug) string   foreach v of var genes_drug* {     replace `v' = "NA" if missing(`v') }  list
            for example my variable genes_druga for participant 1 will name g01 twice - how could I avoid those duplicate strings?

            Thanks!

            Comment


            • #7
              Several ways. You could use duplicates to get rid of duplicates temporarily and merge back at the end.

              Here's one way to do it otherwise.

              Code:
              clear
              input str10(generic_id drug gene)
              "0001" "a" "g01"
              "0001" "a" "g02"
              "0001" "a" "g01"
              "0001" "b" "g03"
              "0001" "b" "g03"
              "0002" "b" "g01"
              "0002" "c" "g02"
              "0002" "c" "g02"
              "0002" "c" "g08"
              "0002" "c" "g08"
              end
              
              gen long seq = _n 
              
              bysort generic_id drug gene: gen first = _n == 1
              
              bysort first generic_id drug (gene): gen wanted = gene if first & _n == 1 
              
              by first generic_id drug: replace wanted = wanted[_n-1] + " " + gene if first & _n > 1 
              
              by first generic_id drug: replace wanted = trim(wanted[_N])
              
              bysort generic_id drug (wanted) : replace wanted = wanted[_N]
              
              drop first 
              
              sort seq 
              
              list, sepby(generic_id drug)
              
                   +----------------------------------------+
                   | generi~d   drug   gene   seq    wanted |
                   |----------------------------------------|
                1. |     0001      a    g01     1   g01 g02 |
                2. |     0001      a    g02     2   g01 g02 |
                3. |     0001      a    g01     3   g01 g02 |
                   |----------------------------------------|
                4. |     0001      b    g03     4       g03 |
                5. |     0001      b    g03     5       g03 |
                   |----------------------------------------|
                6. |     0002      b    g01     6       g01 |
                   |----------------------------------------|
                7. |     0002      c    g02     7   g02 g08 |
                8. |     0002      c    g02     8   g02 g08 |
                9. |     0002      c    g08     9   g02 g08 |
               10. |     0002      c    g08    10   g02 g08 |
                   +----------------------------------------+

              Comment


              • #8
                Here's another way to do it.

                Code:
                clear
                input str10(generic_id drug gene)
                "0001" "a" "g01"
                "0001" "a" "g02"
                "0001" "a" "g01"
                "0001" "b" "g03"
                "0001" "b" "g03"
                "0002" "b" "g01"
                "0002" "c" "g02"
                "0002" "c" "g02"
                "0002" "c" "g08"
                "0002" "c" "g08"
                end
                
                bysort generic_id drug : gen wanted = gene if _n == 1 
                by generic_id drug : replace wanted = wanted[_n-1] + cond(strpos(wanted[_n-1], gene) == 0, " " + gene, "") if _n > 1  
                by generic_id drug : replace wanted = wanted[_N]
                
                tabdisp generic_id drug, c(wanted)
                
                -------------------------------------
                generic_i |           drug           
                d         |       a        b        c
                ----------+--------------------------
                     0001 | g01 g02      g03         
                     0002 |              g01  g02 g08
                -------------------------------------

                Comment


                • #9
                  Thanks Nick. I've used your first option that worked well - in order to reshape I had to drop the observations
                  Code:
                  first==0

                  Comment

                  Working...
                  X