Is there a way to efficiently perform a "fuzzy" merge between two large datasets? Emphasis on large datasets (see below). For example, I have two datasets like this
and I want to consider the following criteria, in order:
Does Stata have any support for this at all? Or is Stata simply not the tool for data work like this?
Code:
* Dataset 1 clear input str3 mortgage_id long(balance zip_code) byte open_date "AA1" 200000 12345 1 "AA2" 150000 68593 4 "AA3" 98000 39825 2 "AA4" 176400 57924 6 end
Code:
* Dataset 2 clear input str3 mortgage_id long(balance zip_code) byte open_date double interest_rate "XX1" 200000 12345 1 7.6 "YY1" 151000 68593 3 8.1 "YY2" 149000 68593 4 9.2 "YY3" 345600 20886 7 4.3 "ZZ1" 175900 57924 6 11.1 "ZZ2" 176900 57924 6 5.5 end
- A mortgage in the first dataset is considered a match to a mortgage in the second dataset if their balances are within $1,000 of each other. In my example, mortgage AA1 matches precisely one mortgage in the second dataset: XX1.
- If a mortgage in dataset 1 matches multiple mortgages in dataset 2 based on their balances, e.g. AA2 matches both YY1 and YY2, then the open_date and zip_code should be used to break ties.
- Whether or not ties exist, stop. In this example, AA4 matches both ZZ1 and ZZ2, but that's ok. Consider AA4 matched to both of these mortgages.
Does Stata have any support for this at all? Or is Stata simply not the tool for data work like this?
Comment