Merging two datasets in which the first dataset has the data as variables (numeric) and the other has it as observations

Bryony Simmons

Join Date: Jan 2018
Posts: 37

Merging two datasets in which the first dataset has the data as variables (numeric) and the other has it as observations

22 Sep 2021, 11:29

Hi

I am struggling to merge some costs to a dataset - I'm sure the answer is very easy, but I can't quite worked it out.

I have a dataset where each observation is an individual person & the variables show whether that person has received a specific antibiotic (yes=1, no=2). There is another variable for "other" with a string response in name_other.

In a separate table, I have the costs of the antibiotics, which vary by age (ages 0-5).

What I want to do is, for each antibiotic in the original database, to generate a new column which contains the cost of the antibiotic (appropriate for the age of the child) if they received the antibiotic (& missing otherwise).

I've tried reshaping the data & generating a string variable with the name of the antibiotic for matching with the cost data. But then I get lost again reshaping back if there are individuals who did not receive any antibiotics as these get dropped.

Do you have any advice on how to proceed? I've pasted some dataex examples below of how the 2 datasets currently look & how I would like the final dataset to look.

Thank you for any help!
Bryony

This is an example of the original dataset to add costs to:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(id age amoxi ceftri ampicillin ery cotri metro other) str10 name_other
 1 2 2 1 2 1 1 1 1 "doxy" 
 2 2 2 2 2 2 2 2 2 ""     
 3 1 2 1 1 1 2 1 2 ""     
 4 0 1 1 2 2 1 1 2 ""     
 5 2 2 2 2 2 1 1 2 ""     
 6 0 2 1 2 1 2 1 1 "cipro"
 7 2 2 1 2 2 1 1 2 ""     
 8 4 1 1 2 1 2 2 2 ""     
 9 1 1 2 1 2 1 2 2 ""     
10 4 1 2 2 2 2 2 1 "cipro"
11 2 1 1 2 2 2 2 1 "doxy" 
12 3 1 1 1 2 2 2 2 ""     
13 0 1 1 1 1 1 2 1 "doxy" 
14 5 2 2 1 1 1 1 2 ""     
15 5 1 2 2 1 2 2 1 "doxy" 
end

This is how the cost data looks:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str10 name byte age int cost
"amoxi"      0   30
"amoxi"      1   30
"amoxi"      2   60
"amoxi"      3   60
"amoxi"      4   60
"amoxi"      5   60
"ampicillin" 0   50
"ampicillin" 1   50
"ampicillin" 2   70
"ampicillin" 3   70
"ampicillin" 4   70
"ampicillin" 5   70
"ceftri"     0  100
"ceftri"     1  100
"ceftri"     2  100
"ceftri"     3  100
"ceftri"     4  100
"ceftri"     5  100
"cotri"      0   50
"cotri"      1   50
"cotri"      2   50
"cotri"      3  100
"cotri"      4  100
"cotri"      5  100
"ery"        0  200
"ery"        1  200
"ery"        2  200
"ery"        3  200
"ery"        4  250
"ery"        5  250
"metro"      0  660
"metro"      1  660
"metro"      2  660
"metro"      3  660
"metro"      4 1320
"metro"      5 1320
"cipro"      0  405
"cipro"      1  405
"cipro"      2  510
"cipro"      3  510
"cipro"      4  510
"cipro"      5  510
"doxy"       0   30
"doxy"       1   30
"doxy"       2   30
"doxy"       3   50
"doxy"       4   50
"doxy"       5   50
end

& this is how I want the final database to look:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(id age amoxi) int amoxi_cost float ceftri int ceftri_cost float ampicillin int ampicillin_cost float ery int ery_cost float cotri int cotri_cost float metro int metro_cost float other str5 name_other int other_cost
 1 2 2  . 1 100 2  . 1 200 1  50 1  660 2 ""        .
 2 2 2  . 2   . 2  . 2   . 2   . 2    . 2 ""        .
 3 1 2  . 1 100 1 50 1 200 2   . 1  660 2 ""        .
 4 0 1 30 1 100 2  . 2   . 1  50 1  660 2 ""        .
 5 2 2  . 2   . 2  . 2   . 1  50 1  660 2 ""        .
 6 0 2  . 1 100 2  . 1 200 2   . 1  660 1 "cipro" 405
 7 2 2  . 1 100 2  . 2   . 1  50 1  660 2 ""        .
 8 4 1 60 1 100 2  . 1 250 2   . 2    . 2 ""        .
 9 1 1 30 2   . 1 50 2   . 1  50 2    . 2 ""        .
10 4 1 60 2   . 2  . 2   . 2   . 2    . 1 "cipro" 510
11 2 1 60 1 100 2  . 2   . 2   . 2    . 1 "doxy"   30
12 3 1 60 1 100 1 70 2   . 2   . 2    . 2 ""        .
13 0 1 30 1 100 1 50 1 200 1  50 2    . 1 "doxy"   30
14 5 2  . 2   . 1 70 1 250 1 100 1 1320 2 ""        .
15 5 1 60 2   . 2  . 1 250 2   . 2    . 1 "doxy"   50
end

Tags: datasets, merge, observation, variable

Jackson Monroe

Join Date: Jul 2019

Posts: 60
#2

22 Sep 2021, 12:07

First I would make the top table longer so each individual has a different observation (row) for each drug they took, this will facilitate matching on one drug at a time between the two tables. That will look something like

Code:

renvars amoxi-other, prefix("drug_") reshape long drug_, i(id) j(drug) drop if drug_ == 2 replace name_other = "" if drug != "other"

Then I would match the tables on drug and age, and reshape it to be wide.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30050

22 Sep 2021, 12:28

This problem is difficult for two reasons. On the one hand, you are working with the patient data in wide layout. That seldom works well in Stata. And the other difficulty arises from the "sporadic" drugs that are lumped as "other." If we create new variables for each sporadic drug, and then go to long layout in the patient data, everything after that is straightforward:

Code:

 * Example generated by -dataex-. For more info, type help dataex
clear
input str10 name byte age int cost
"amoxi"      0   30
"amoxi"      1   30
"amoxi"      2   60
"amoxi"      3   60
"amoxi"      4   60
"amoxi"      5   60
"ampicillin" 0   50
"ampicillin" 1   50
"ampicillin" 2   70
"ampicillin" 3   70
"ampicillin" 4   70
"ampicillin" 5   70
"ceftri"     0  100
"ceftri"     1  100
"ceftri"     2  100
"ceftri"     3  100
"ceftri"     4  100
"ceftri"     5  100
"cotri"      0   50
"cotri"      1   50
"cotri"      2   50
"cotri"      3  100
"cotri"      4  100
"cotri"      5  100
"ery"        0  200
"ery"        1  200
"ery"        2  200
"ery"        3  200
"ery"        4  250
"ery"        5  250
"metro"      0  660
"metro"      1  660
"metro"      2  660
"metro"      3  660
"metro"      4 1320
"metro"      5 1320
"cipro"      0  405
"cipro"      1  405
"cipro"      2  510
"cipro"      3  510
"cipro"      4  510
"cipro"      5  510
"doxy"       0   30
"doxy"       1   30
"doxy"       2   30
"doxy"       3   50
"doxy"       4   50
"doxy"       5   50
end
tempfile drug_costs
save `drug_costs'


* Example generated by -dataex-. For more info, type help dataex
clear
input float(id age amoxi ceftri ampicillin ery cotri metro other) str10 name_other
 1 2 2 1 2 1 1 1 1 "doxy"
 2 2 2 2 2 2 2 2 2 ""    
 3 1 2 1 1 1 2 1 2 ""    
 4 0 1 1 2 2 1 1 2 ""    
 5 2 2 2 2 2 1 1 2 ""    
 6 0 2 1 2 1 2 1 1 "cipro"
 7 2 2 1 2 2 1 1 2 ""    
 8 4 1 1 2 1 2 2 2 ""    
 9 1 1 2 1 2 1 2 2 ""    
10 4 1 2 2 2 2 2 1 "cipro"
11 2 1 1 2 2 2 2 1 "doxy"
12 3 1 1 1 2 2 2 2 ""    
13 0 1 1 1 1 1 2 1 "doxy"
14 5 2 2 1 1 1 1 2 ""    
15 5 1 2 2 1 2 2 1 "doxy"
end

//  CREATE NEW VARIABLES FOR THE "OTHER" DRUGS
levelsof name_other, local(misc)
foreach m of local misc {
    gen byte `m' = 1 if name_other == `"`m'"'
    replace `m' = 2 if missing(`m')
}
drop other name_other
ds id age, not
local drugs `r(varlist)'
rename (`drugs') got_it=
reshape long got_it, i(id) j(name) string
merge m:1 name age using `drug_costs', assert(match using) nogenerate
replace cost = . if got_it == 2

//  IF YOU HAVE A CONVINCING REASON TO REVERT TO WIDE LAYOUT, THEN DO THE FOLLOWING:
reshape wide got_it @cost, i(id) j(name) string
rename got_it* *
rename *cost *_cost

As hinted above, there are few situations in Stata that work well with wide data. Unless you know for a fact that you will be doing one of those few things next, I advise you to leave the data in long layout. However, if you do have good reason to go back to a wide data set the last several lines of code above show you how to do it. But, again, I recommend against that.

One other thing about your data. Although it is perfectly legal to code yes/no as 1/2, it is not convenient to work with that in Stata. I have not changed that in the code above. But I urge you to consider recoding those variables to 1/0. There are many situations in Stata where the use of 1/0 for yes/no will enable you to write cleaner, more compact code. For example, in the code above, if those variables were coded 1/0 we could simplify

Code:

    gen byte `m' = 1 if name_other == `"`m'"'
    replace `m' = 2 if missing(`m')

to the shorter and clearer

Code:

    gen byte `m' = (name_other == `"`m'"')

Added: Crossed with #2.

Comment

Bryony Simmons

Join Date: Jan 2018

Posts: 37
#4

24 Sep 2021, 08:16

This is brilliantly helpful - thank you both for taking your time to explain
Comment

Announcement

Merging two datasets in which the first dataset has the data as variables (numeric) and the other has it as observations

Comment

Comment

Comment