Fuzzy merge on several numeric variables

paulvonhippel

Join Date: Apr 2014

Posts: 502
#1

Fuzzy merge on several numeric variables

07 Jan 2021, 12:09

I would like to do a 1:1 merge using two lists of 80 schools. There are about 10 variables to merge on, all numeric, such as number of students, % of students who are black, etc. The merge variables do not match perfectly, so it is a fuzzy merge problem.

One possible solution is find the merge that, across matched pairs, minimizes the sum of the Mahalanobis distances between the merging variables. Is there a Stata command that implements this or something similar?

The Stata commands that I know for fuzzy merging are designed for different problems and would not work for mine (I think):
-matchit- and -reclink- merge on strings, but I want to match on numeric variables.
-nearmrg- and -rangejoin- merge on a numeric variable, but only one. I want to merge on several numeric variables.
Note that close numeric matches are not necessarily close string matches, or vice versa. For example 59 and 60 are similar numbers but not similar strings.

Many thanks if you can alert me to a command that I have not found yet.
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

07 Jan 2021, 12:37

I'd look at -ultimatch- or -calipmatch-, both available at -ssc-. I've used the second one for matching and found it easy to use, but -ultimatch- (which I haven't tried) looks to have more capabilities. I have not tried a matching program for a purpose like yours, but I should think they would work.
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 502
#3

07 Jan 2021, 12:58

Interesting suggestion, Mike Lacy. Let me make sure I understand. Instead of doing a 1:1 merge between datasets A and B, I would append B to A, define cases from A as the "treatment" group and B as the "control group," and then use -ultimatch- to define matched pairs between treated units and control units. And then I could reshape the data to put each matched pair on a single line.

Is that what you're thinking? It just might work....

Last edited by paulvonhippel; 07 Jan 2021, 13:53.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#4

07 Jan 2021, 14:12

I can't say that I worked out the details here, and I haven't used -ultimatch-. My idea was just that those programs permit find the "nearest neighbors" of each observation in two groups, using a without-replacement option. There are probably various ways to use the resulting matching, but what you describe sounds like what I was thinking. I suppose that not all matches would be perfect or completely distinct, but I would think that all but a few would be so. Maybe it would be better to try with-replacement as well and compare the results, as a means to help find possibly difficult cases.
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 502
#5

08 Jan 2021, 11:32

OK, it worked. Surprisingly, I found the Euclidean distance gave more intuitive matches than the Mahalanobis distance....

Thanks! I've been wrestling with this question for a long time.
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 502
#6

08 Jan 2021, 11:34

Bonus question: What to do if you want to match on both string and numeric variables. It seems there are tools for one or the other but not both.

Again, treating numbers as strings does not produce good matches. E.g., 59.75 and 60 are similar numbers but not similar strings.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4462
#7

08 Jan 2021, 11:38

another option that does allow string matches is the user written (on SSC) -vmatch-; I use this one a lot as it allows numeric (with caliper), string and matching w/replacement; since you don't show any code, I can't comment on what you have done but the logic seems right to me
Comment
Anders Alexandersson

Join Date: Apr 2014

Posts: 203
#8

08 Jan 2021, 12:55

In general, I still recommend the R package fastLink for a "fuzzy merge" (a.k.a. a probabilistic record linkage). For example, see my Statalist post here. Please provide more details if you need more help.
Comment

Announcement

Fuzzy merge on several numeric variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment