Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating Dummy Variables from List of Drug Names

    I have a list of over 500 drug names, and I'm trying to write a loop to create a dummy variable for each individual drug. My data looks like this:

    Drugname
    brentuximab
    tadalafil
    riociguat
    riociguat
    iobenguane
    iobenguane
    everolimus
    ...
    I will then use those variables to search study abstracts using the strpos command to flag which studies evaluated which drugs. The current code is:

    gen brentuximab=0
    replace brentuximab = 1 if strpos(Abstract, "brentuximab")

    I would rather not have to write 500+ lines of code and switch out the drug name for each, but I can't figure out how to create a loop to do this. I'm definitely open to other ways of doing it if you have any suggestions.

    Thanks in advance!
    Last edited by Madison Smith; 10 Feb 2020, 09:41.

  • #2
    You could put your drug names in a local and loop over them:
    Code:
    local drug_names "brentuximab tadalafil riociguat iobenguane etc"
    foreach drug in `drug_names' {
        gen `drug' = strpos(Abstract, "`drug'")
    }

    Comment


    • #3
      Thank you, that's exactly what I want it to do! Is there a way to assign the values to a local without having to type them all out in quotes? It gets complicated when there are drugs with symbols or more than one word. Plus a list of all the drug names take up 70 lines.

      I was trying to play around with the levelsof function but can't quite get it.

      levelsof drugname, local(levels)

      Comment


      • #4
        Let me point out that when the drug name has more than one word or symbols, the drug name will not be suitable for use as a variable name.

        Comment


        • #5
          Oh true. Let's assume that I simplify my drug list to one word names. Any suggestions on how to put it in a local then?

          Comment


          • #6
            Your approach commits you to 1000 lines of code, although

            Code:
            gen brentuximab= strpos(Abstract, "brentuximab") > 0
            indicates a way to cut that in half.

            More crucially,
            Code:
            tabulate, generate()
            will do this without a loop.

            That said, I would back up and tell us why you think you need this. Factor variable notation should help.

            Comment


            • #7
              Now I have this code, but it stops after the 16th drug and says that too many variables are specified.

              Code:
              levelsof drugname, local(drug_names)
              foreach drugname in `drug_names' {
                  gen `drugname' = 0
                  replace `drugname' =1 if strpos(Abstract, "`drugname'")
              }
              My coworker suggested I do it this way, but I imagine that there must be a way to skip creating these dummy variables all together. Basically, what I'm trying to do is search study abstracts for certain drugs. Then, I want to pull the PubMed ID that corresponds to the abstract that mentioned the drug and create a list of relevant PubMed IDs for each drug. My final data set will look something like this:
              drugname PMID1 PMID2 PMID3 etc
              brentuximab 1234567 2039475 2814578 ...
              tadalafil
              riociguat
              iobenguane
              everolimus

              Comment


              • #8
                #7 has already been answered. Drug names with embedded spaces won't qualify as legal variable names. You don't need to write a loop here. You don't need two statements not one even if you are determined to write a loop. Please study all the answers in the thread.

                Comment


                • #9
                  I'm sorry, I don't quite understand what the code in your earlier comment does. It seems to create the binary variables I want, but I'm trying to avoid typing out all the drug names by hand.

                  I've simplified the drug names so they are only one word. My current code works, so it's just a matter of transforming the list of IDs into the PMID1 PMID2 PMID3 etc variables by drug name


                  Code:
                  list PubMedID if brentuximab==1

                  Comment


                  • #10
                    It is completely unclear to me what your target is. It appears to me you have
                    • a dataset with 500 drug names in it
                    • a dataset with abstracts in it
                    • a wish to add 500 indicator variables to the abstract dataset corresponding which of the 500 drugs appear in the abstract
                    If I got that right, here's an approach that uses my favorite programming technique: having the program write the code it needs to execute.

                    Code:
                    clear all
                    cls
                    * Example generated by -dataex-. To install: ssc install dataex
                    clear
                    input str11 drugname
                    "brentuximab"
                    "tadalafil"  
                    "riociguat"  
                    "riociguat"  
                    "iobenguane"
                    "iobenguane"
                    "everolimus"
                    "two words"
                    "@symbol"
                    end
                    
                    duplicates drop drugname, force
                    generate varname = ustrtoname(drugname,1)
                    // confirm that two drugnames don't yield the same varname
                    bysort varname (drugname): assert _N==1
                    
                    tempfile to_include
                    file open code using "`to_include'", write
                    forvalues d = 1/`c(N)' {
                        local drugname = drugname[`d']
                        local varname = varname[`d']
                        file write code `"generate `varname' = strpos(Abstract,`"`drugname'"')>0"' _newline
                        file write code `"label variable `varname' `"`drugname'"'"' _newline
                        }
                    file close code
                    type "`to_include'"
                    
                    * Example generated by -dataex-. To install: ssc install dataex
                    clear
                    input str47 Abstract
                    "first abstract with @symbol and riociguat in it"
                    "second abstract with nothing in it"            
                    end
                    
                    include "`to_include'"
                    describe
                    list _symbol-two_words, clean abbreviate(12)
                    Here are selected bits of the output.
                    Code:
                    . type "`to_include'"
                    generate _symbol = strpos(Abstract,`"@symbol"')>0
                    label variable _symbol `"@symbol"'
                    generate brentuximab = strpos(Abstract,`"brentuximab"')>0
                    label variable brentuximab `"brentuximab"'
                    generate everolimus = strpos(Abstract,`"everolimus"')>0
                    label variable everolimus `"everolimus"'
                    generate iobenguane = strpos(Abstract,`"iobenguane"')>0
                    label variable iobenguane `"iobenguane"'
                    generate riociguat = strpos(Abstract,`"riociguat"')>0
                    label variable riociguat `"riociguat"'
                    generate tadalafil = strpos(Abstract,`"tadalafil"')>0
                    label variable tadalafil `"tadalafil"'
                    generate two_words = strpos(Abstract,`"two words"')>0
                    label variable two_words `"two words"'
                    Code:
                    . describe
                    
                    Contains data
                      obs:             2                          
                     vars:             8                          
                    ------------------------------------------------------------------------------------------------
                                  storage   display    value
                    variable name   type    format     label      variable label
                    ------------------------------------------------------------------------------------------------
                    Abstract        str47   %47s                  
                    _symbol         float   %9.0g                 @symbol
                    brentuximab     float   %9.0g                 brentuximab
                    everolimus      float   %9.0g                 everolimus
                    iobenguane      float   %9.0g                 iobenguane
                    riociguat       float   %9.0g                 riociguat
                    tadalafil       float   %9.0g                 tadalafil
                    two_words       float   %9.0g                 two words
                    ------------------------------------------------------------------------------------------------
                    Sorted by:
                         Note: Dataset has changed since last saved.
                    
                    . list _symbol-two_words, clean abbreviate(12)
                    
                           _symbol   brentuximab   everolimus   iobenguane   riociguat   tadalafil   two_words  
                      1.         1             0            0            0           1           0           0  
                      2.         0             0            0            0           0           0           0
                    Last edited by William Lisowski; 10 Feb 2020, 13:17.

                    Comment

                    Working...
                    X