I have data two datasets that have a common variable. I would like to merge the two datasets using this common variable and only keep observations included in the first dataset.
The issue is that in none of the datasets the common variable identifies the rows. As I know that merge m:m is not recommended, I usually reshape one of the datasets in a way that I can use a m:1 or 1:m merge. I show this below.
My issue now is that my data is massive which makes the code below unfeasible. Are there any alternatives to this that would be feasible with large datasets?
The issue is that in none of the datasets the common variable identifies the rows. As I know that merge m:m is not recommended, I usually reshape one of the datasets in a way that I can use a m:1 or 1:m merge. I show this below.
My issue now is that my data is massive which makes the code below unfeasible. Are there any alternatives to this that would be feasible with large datasets?
Code:
clear input id hospid hospid_peer 1 1000 101 1 1000 102 1 1001 101 2 1002 102 end tempfile data1 save `data1', replace clear input hospid_peer peerid 101 3 101 4 102 3 1000 1 1001 1 1002 2 2000 7 2003 8 end tempfile data2 save `data2', replace use `data1', clear preserve use `data2', clear bys hospid_peer: gen n = _n reshape wide peerid, i(hospid_peer) j(n) tempfile data1_temp save `data1_temp' restore merge m:1 hospid_peer using `data1_temp', keep(master match) nogenerate
Comment