Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging two datasets in which the first dataset has the data as variables (numeric) and the other has it as observations

    Hi

    I am struggling to merge some costs to a dataset - I'm sure the answer is very easy, but I can't quite worked it out.

    I have a dataset where each observation is an individual person & the variables show whether that person has received a specific antibiotic (yes=1, no=2). There is another variable for "other" with a string response in name_other.

    In a separate table, I have the costs of the antibiotics, which vary by age (ages 0-5).

    What I want to do is, for each antibiotic in the original database, to generate a new column which contains the cost of the antibiotic (appropriate for the age of the child) if they received the antibiotic (& missing otherwise).

    I've tried reshaping the data & generating a string variable with the name of the antibiotic for matching with the cost data. But then I get lost again reshaping back if there are individuals who did not receive any antibiotics as these get dropped.

    Do you have any advice on how to proceed? I've pasted some dataex examples below of how the 2 datasets currently look & how I would like the final dataset to look.

    Thank you for any help!
    Bryony



    This is an example of the original dataset to add costs to:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(id age amoxi ceftri ampicillin ery cotri metro other) str10 name_other
     1 2 2 1 2 1 1 1 1 "doxy" 
     2 2 2 2 2 2 2 2 2 ""     
     3 1 2 1 1 1 2 1 2 ""     
     4 0 1 1 2 2 1 1 2 ""     
     5 2 2 2 2 2 1 1 2 ""     
     6 0 2 1 2 1 2 1 1 "cipro"
     7 2 2 1 2 2 1 1 2 ""     
     8 4 1 1 2 1 2 2 2 ""     
     9 1 1 2 1 2 1 2 2 ""     
    10 4 1 2 2 2 2 2 1 "cipro"
    11 2 1 1 2 2 2 2 1 "doxy" 
    12 3 1 1 1 2 2 2 2 ""     
    13 0 1 1 1 1 1 2 1 "doxy" 
    14 5 2 2 1 1 1 1 2 ""     
    15 5 1 2 2 1 2 2 1 "doxy" 
    end
    This is how the cost data looks:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str10 name byte age int cost
    "amoxi"      0   30
    "amoxi"      1   30
    "amoxi"      2   60
    "amoxi"      3   60
    "amoxi"      4   60
    "amoxi"      5   60
    "ampicillin" 0   50
    "ampicillin" 1   50
    "ampicillin" 2   70
    "ampicillin" 3   70
    "ampicillin" 4   70
    "ampicillin" 5   70
    "ceftri"     0  100
    "ceftri"     1  100
    "ceftri"     2  100
    "ceftri"     3  100
    "ceftri"     4  100
    "ceftri"     5  100
    "cotri"      0   50
    "cotri"      1   50
    "cotri"      2   50
    "cotri"      3  100
    "cotri"      4  100
    "cotri"      5  100
    "ery"        0  200
    "ery"        1  200
    "ery"        2  200
    "ery"        3  200
    "ery"        4  250
    "ery"        5  250
    "metro"      0  660
    "metro"      1  660
    "metro"      2  660
    "metro"      3  660
    "metro"      4 1320
    "metro"      5 1320
    "cipro"      0  405
    "cipro"      1  405
    "cipro"      2  510
    "cipro"      3  510
    "cipro"      4  510
    "cipro"      5  510
    "doxy"       0   30
    "doxy"       1   30
    "doxy"       2   30
    "doxy"       3   50
    "doxy"       4   50
    "doxy"       5   50
    end

    & this is how I want the final database to look:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(id age amoxi) int amoxi_cost float ceftri int ceftri_cost float ampicillin int ampicillin_cost float ery int ery_cost float cotri int cotri_cost float metro int metro_cost float other str5 name_other int other_cost
     1 2 2  . 1 100 2  . 1 200 1  50 1  660 2 ""        .
     2 2 2  . 2   . 2  . 2   . 2   . 2    . 2 ""        .
     3 1 2  . 1 100 1 50 1 200 2   . 1  660 2 ""        .
     4 0 1 30 1 100 2  . 2   . 1  50 1  660 2 ""        .
     5 2 2  . 2   . 2  . 2   . 1  50 1  660 2 ""        .
     6 0 2  . 1 100 2  . 1 200 2   . 1  660 1 "cipro" 405
     7 2 2  . 1 100 2  . 2   . 1  50 1  660 2 ""        .
     8 4 1 60 1 100 2  . 1 250 2   . 2    . 2 ""        .
     9 1 1 30 2   . 1 50 2   . 1  50 2    . 2 ""        .
    10 4 1 60 2   . 2  . 2   . 2   . 2    . 1 "cipro" 510
    11 2 1 60 1 100 2  . 2   . 2   . 2    . 1 "doxy"   30
    12 3 1 60 1 100 1 70 2   . 2   . 2    . 2 ""        .
    13 0 1 30 1 100 1 50 1 200 1  50 2    . 1 "doxy"   30
    14 5 2  . 2   . 1 70 1 250 1 100 1 1320 2 ""        .
    15 5 1 60 2   . 2  . 1 250 2   . 2    . 1 "doxy"   50
    end

  • #2
    First I would make the top table longer so each individual has a different observation (row) for each drug they took, this will facilitate matching on one drug at a time between the two tables. That will look something like

    Code:
    renvars amoxi-other, prefix("drug_")
    
    reshape long drug_, i(id) j(drug)
    
    drop if drug_ == 2
    
    replace name_other = "" if drug != "other"
    Then I would match the tables on drug and age, and reshape it to be wide.

    Comment


    • #3
      This problem is difficult for two reasons. On the one hand, you are working with the patient data in wide layout. That seldom works well in Stata. And the other difficulty arises from the "sporadic" drugs that are lumped as "other." If we create new variables for each sporadic drug, and then go to long layout in the patient data, everything after that is straightforward:

      Code:
       * Example generated by -dataex-. For more info, type help dataex
      clear
      input str10 name byte age int cost
      "amoxi"      0   30
      "amoxi"      1   30
      "amoxi"      2   60
      "amoxi"      3   60
      "amoxi"      4   60
      "amoxi"      5   60
      "ampicillin" 0   50
      "ampicillin" 1   50
      "ampicillin" 2   70
      "ampicillin" 3   70
      "ampicillin" 4   70
      "ampicillin" 5   70
      "ceftri"     0  100
      "ceftri"     1  100
      "ceftri"     2  100
      "ceftri"     3  100
      "ceftri"     4  100
      "ceftri"     5  100
      "cotri"      0   50
      "cotri"      1   50
      "cotri"      2   50
      "cotri"      3  100
      "cotri"      4  100
      "cotri"      5  100
      "ery"        0  200
      "ery"        1  200
      "ery"        2  200
      "ery"        3  200
      "ery"        4  250
      "ery"        5  250
      "metro"      0  660
      "metro"      1  660
      "metro"      2  660
      "metro"      3  660
      "metro"      4 1320
      "metro"      5 1320
      "cipro"      0  405
      "cipro"      1  405
      "cipro"      2  510
      "cipro"      3  510
      "cipro"      4  510
      "cipro"      5  510
      "doxy"       0   30
      "doxy"       1   30
      "doxy"       2   30
      "doxy"       3   50
      "doxy"       4   50
      "doxy"       5   50
      end
      tempfile drug_costs
      save `drug_costs'
      
      
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input float(id age amoxi ceftri ampicillin ery cotri metro other) str10 name_other
       1 2 2 1 2 1 1 1 1 "doxy"
       2 2 2 2 2 2 2 2 2 ""    
       3 1 2 1 1 1 2 1 2 ""    
       4 0 1 1 2 2 1 1 2 ""    
       5 2 2 2 2 2 1 1 2 ""    
       6 0 2 1 2 1 2 1 1 "cipro"
       7 2 2 1 2 2 1 1 2 ""    
       8 4 1 1 2 1 2 2 2 ""    
       9 1 1 2 1 2 1 2 2 ""    
      10 4 1 2 2 2 2 2 1 "cipro"
      11 2 1 1 2 2 2 2 1 "doxy"
      12 3 1 1 1 2 2 2 2 ""    
      13 0 1 1 1 1 1 2 1 "doxy"
      14 5 2 2 1 1 1 1 2 ""    
      15 5 1 2 2 1 2 2 1 "doxy"
      end
      
      //  CREATE NEW VARIABLES FOR THE "OTHER" DRUGS
      levelsof name_other, local(misc)
      foreach m of local misc {
          gen byte `m' = 1 if name_other == `"`m'"'
          replace `m' = 2 if missing(`m')
      }
      drop other name_other
      ds id age, not
      local drugs `r(varlist)'
      rename (`drugs') got_it=
      reshape long got_it, i(id) j(name) string
      merge m:1 name age using `drug_costs', assert(match using) nogenerate
      replace cost = . if got_it == 2
      
      //  IF YOU HAVE A CONVINCING REASON TO REVERT TO WIDE LAYOUT, THEN DO THE FOLLOWING:
      reshape wide got_it @cost, i(id) j(name) string
      rename got_it* *
      rename *cost *_cost
      As hinted above, there are few situations in Stata that work well with wide data. Unless you know for a fact that you will be doing one of those few things next, I advise you to leave the data in long layout. However, if you do have good reason to go back to a wide data set the last several lines of code above show you how to do it. But, again, I recommend against that.

      One other thing about your data. Although it is perfectly legal to code yes/no as 1/2, it is not convenient to work with that in Stata. I have not changed that in the code above. But I urge you to consider recoding those variables to 1/0. There are many situations in Stata where the use of 1/0 for yes/no will enable you to write cleaner, more compact code. For example, in the code above, if those variables were coded 1/0 we could simplify
      Code:
          gen byte `m' = 1 if name_other == `"`m'"'
          replace `m' = 2 if missing(`m')
      to the shorter and clearer
      Code:
          gen byte `m' = (name_other == `"`m'"')
      Added: Crossed with #2.

      Comment


      • #4
        This is brilliantly helpful - thank you both for taking your time to explain

        Comment

        Working...
        X