I have data two datasets that have a common variable. I would like to merge the two datasets using this common variable and only keep observations included in the first dataset.
The issue is that in none of the datasets the common variable identifies the rows. As I know that merge m:m is not recommended, I usually reshape one of the datasets in a way that I can use a m:1 or 1:m merge. I show this below.
My issue now is that my data is massive which makes the code below unfeasible. Are there any alternatives to this that would be feasible with large datasets?
The issue is that in none of the datasets the common variable identifies the rows. As I know that merge m:m is not recommended, I usually reshape one of the datasets in a way that I can use a m:1 or 1:m merge. I show this below.
My issue now is that my data is massive which makes the code below unfeasible. Are there any alternatives to this that would be feasible with large datasets?
Code:
clear
input id hospid hospid_peer
1 1000 101
1 1000 102
1 1001 101
2 1002 102
end
tempfile data1
save `data1', replace
clear
input hospid_peer peerid
101 3
101 4
102 3
1000 1
1001 1
1002 2
2000 7
2003 8
end
tempfile data2
save `data2', replace
use `data1', clear
preserve
use `data2', clear
bys hospid_peer: gen n = _n
reshape wide peerid, i(hospid_peer) j(n)
tempfile data1_temp
save `data1_temp'
restore
merge m:1 hospid_peer using `data1_temp', keep(master match) nogenerate

Comment