Identifying 'nearest neighbours' without using teffects

Andrew Wade

Join Date: Aug 2017

Posts: 28
#1

Identifying 'nearest neighbours' without using teffects

26 Nov 2019, 17:58

Hi,

I have data for almost 7,000 organisations.

I want to use say, 6 variables, to identify the 60 most 'similar' organisations, for each of the 7,000 organisations.

So each organisation is assigned 60 nearest neighbors.

is there a way to do this outside of teffects?

Or is the 'best' way to use teffects in a loop where I have 1 treatment observation, and 6999 'controls'. With the neighbors identified one by one?

This would probably require a lot of computer time. However, I could speed it up by breaking the job up and running in parallel.

Regards,

Andrew
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#2

26 Nov 2019, 18:24

Well, you don't say what you mean by "most similar." Since you have 6 variables, the organization that is closest on one variable could be far away on another. There are a variety of metrics for combining these things. You need to choose one that you can calculate.

The following shows the general approach. As you do not provide example data, I have illustrated the technique using the grunfeld.dta file from StataCorp's website. It is much smaller than yours, so I only match the 6 nearest instead of the 60 nearest. Modify the code accordingly, and also put in the real variables and expression for calculating similarity.

Code:

webuse grunfeld, clear keep if year == 1954 // DEFINE SIMILARITY BASED ON A LINEAR COMBINATION OF INVEST MVALUE AND STOCK gen index = .6*invest + .5*mvalue + .4*kstock // SUBSTITUTE CODE FOR YOUR INDEX preserve keep company index invest mvalue kstock rename * *_m drop if missing(index) save match_file, replace restore capture program drop one_company program define one_company cross using match_file drop if company == company_m gen delta = abs(index - index_m) sort delta keep in 1/6 // MAKE THIS 1/60 TO GET 60 MATCHES drop delta exit end runby one_company, by(company) status

You can delete the file match_file when you're done unless you can think of some further use for it.
-runby- is written by Robert Picard and me, and is available from SSC.

This can also be done using only official StataCorp functions, but because your dataset contains 7000 organizations, doing it that way will require, temporarily at least, a data set containing 49,000,000 observations--which may be a problem for your setup and will, in any case, take a very long time to create--time during which you may worry that your machine has hung. The -runby- approach is much faster and is not as demanding of memory (you will never need a larger data set than the final result containing 42,000 observations), and it provides you with periodic progress updates as you go. Even though -runby- is very fast compared to other approaches, this is still a fairly large job, so be patient.
1 like
Comment
David Radwin

Join Date: Mar 2014

Posts: 368
#3

27 Nov 2019, 15:36

You could try one of the other user-written matching programs, like psmatch2 (Leuven and Sianesi, SSC). Use the option neighbor(60) and probably also the option noreplacement.

David Radwin
Senior Researcher, California Competes
californiacompetes.org
Pronouns: He/Him
Comment

Announcement

Identifying 'nearest neighbours' without using teffects

Comment

Comment