Find matching observations based on euclidean distance

Roberto Liebscher

Join Date: Mar 2014

Posts: 92
#1

Find matching observations based on euclidean distance

14 Jul 2014, 04:47

Dear Statalisters,

I am stuck with the following problem. I have data of the following form:

Code:

input id group xvar1 xvar2 1 1 0 1 2 1 0.5 1.2 3 0 0 1.9 4 0 0.25 1.3 5 0 0.15 1.1 6 0 0.1 0.7 7 0 0.6 1.7 8 0 0.8 0.5 9 0 0.5 0.8 10 0 0.8 1 end

where group is a dummy stating the affiliation of an observation to a certain group and xvar1 as well as xvar2 are continous measures. Now, for each observation of Group 1 I would like to know the three observations of Group 0 which are closest in terms of the euclidean distance computed over xvar1 and xvar2.

A simple R code tells me that for observation 1 these are observations 5,6 and 10.

Code:

x <- matrix(c(0, 1, 0.5, 1.2, 0, 1.9, 0.25, 1.3, 0.15, 1.1, 0.1, 0.7, 0.6, 1.7, 0.8, 0.5, 0.5, 0.8, 0.8, 1), ncol=2, byrow=TRUE) dist <- dist(x, method = "euclidean", diag = TRUE, upper = FALSE, p = 2) dist

Now, I would like to tell Stata to keep only those observations of Group 0 that are matched with Group 1 observations. Does anyone have an idea how I can achieve this in Stata? I used teffects nnmatch but this does not work for more than the maximum of observations in Group 1 and also estimates ATE or ATT which my exercise isn't about -- I only want my sample to be constrained to matching observations before I do further analyses. In addition to my question, is there also a chance to control for replacement/no replacement?

Any help is highly appreciated!
Tags: None
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#2

14 Jul 2014, 09:22

I think this works, with the caveat of Stata expanding your database temporarily, so it may or may not work depending on the size of your original database.

Code:

clear all set more off *----- example data ----- input /// id group xvar1 xvar2 1 1 0 1 2 1 0.5 1.2 3 0 0 1.9 4 0 0.25 1.3 5 0 0.15 1.1 6 0 0.1 0.7 7 0 0.6 1.7 8 0 0.8 0.5 9 0 0.5 0.8 10 0 0.8 1 end list gen byte i = 1 tempfile orig save "`orig'" *----- all pairwise ----- rename (id group xvar*) =0 joinby i using "`orig'" drop if group >= group0 drop i *----- compute distance ----- gen eucld = ((xvar10 - xvar1)^2 + (xvar20 - xvar2)^2) ^ (1/2) bysort id0 (eucld): gen nearest = _n <= 3 list, sepby(id0)

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
1 like
Comment
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#3

14 Jul 2014, 09:55

I think looping through observations is another strategy that could work. No problem with the size of the data in that case.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment
Roberto Liebscher

Join Date: Mar 2014

Posts: 92
#4

23 Jul 2014, 11:09

Thanks, Roberto. Using the joinby command is a nice solution I have not thought of. Thanks for your help!
Comment

Announcement

Find matching observations based on euclidean distance

Comment

Comment

Comment