merge with variables which can vary a little bite

kirikakou

Join Date: Sep 2014

Posts: 1
#1

merge with variables which can vary a little bite

11 Sep 2014, 03:33

Hi,
I'm looking how to merge 2 files with common variables which can vary (but only as i define the variation)

Ex: the birth weight can be accepted with a variation of 100g or an age with a variation of 1 year.

In advance, Thank you for your help
Tags: None
David Poensgen

Join Date: Jun 2014

Posts: 27
#2

11 Sep 2014, 05:49

One approach would be to merge the datasets "unconditionally", i.e. generate a dummy with value 1 in both, and then use joinby to "merge" on that dummy:

in both datasets:
gen dummy = 1

in file1.dta:
joinby dummy using file2.dta

Now you have all possible combinations of observations between the datasets (the result has n*m observations - might be problematic if both files have many observations to begin with. I have no clue how this solutions performs with large files).

Then check the difference between the variables you want the merge to be conditional on (say age1 is age in the first, age2 age in the second dataset).
Now you drop all combinations where the absolute value of that difference exceeds your tolerance limit:

drop if abs(age1-age2)>1

/Edit:

Oh, I should add that you should make sure that there are no identically named variables in the datasets, as otherwise the command will just keep the the one from file1. You can use ren before creating the dummy in file2 to ensure this doesn't happen, for example:

ren * *_2

(then go on as above)

Last edited by David Poensgen; 11 Sep 2014, 05:57.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

11 Sep 2014, 08:25

I'm not sure this calls for a joinby. If the age or birth weight can vary a bit, then presumably they refer to the same person. And presumably the datasets contains identifiers variables that uniquely identify each person. If that's the case, then all that is needed is to rename variables and merge both version. You can then do variable-to-variable comparisons. There are some user-written programs on SSC that will handle the merge but a simple example can show how to do this from first principles:

Code:

sysuse auto, clear

* make sure you have a variable that uniquely idenfies observations
isid make, sort

* make a few changes
replace price = price + _n if mod(_n,2)
replace trunk = trunk + _n if mod(_n,2)

* change the number of observations
drop if price > 10000

* rename variables that may be common in both datasets (help rename group)
rename price-foreign =_

* save the mock data with differences
tempfile auto2
save "`auto2'"

* go back to where we started and compare both datasets
sysuse auto, clear
merge 1:1 make using "`auto2'"

* Do a variable-to-variable comparison and decide what to do
list price price_ if abs(price - price_) / price > .01 & _merge == 3
list trunk trunk_ if abs(trunk - trunk_) / trunk > 5 & _merge == 3

Comment

Phil Clayton

Join Date: Mar 2014

Posts: 50
#4

11 Sep 2014, 18:49

Note also that rather than using joinby with a dummy variable set to 1, you could instead use cross
Comment

Announcement