Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying 'nearest neighbours' without using teffects

    Hi,

    I have data for almost 7,000 organisations.

    I want to use say, 6 variables, to identify the 60 most 'similar' organisations, for each of the 7,000 organisations.

    So each organisation is assigned 60 nearest neighbors.

    is there a way to do this outside of teffects?

    Or is the 'best' way to use teffects in a loop where I have 1 treatment observation, and 6999 'controls'. With the neighbors identified one by one?

    This would probably require a lot of computer time. However, I could speed it up by breaking the job up and running in parallel.

    Regards,

    Andrew

  • #2
    Well, you don't say what you mean by "most similar." Since you have 6 variables, the organization that is closest on one variable could be far away on another. There are a variety of metrics for combining these things. You need to choose one that you can calculate.

    The following shows the general approach. As you do not provide example data, I have illustrated the technique using the grunfeld.dta file from StataCorp's website. It is much smaller than yours, so I only match the 6 nearest instead of the 60 nearest. Modify the code accordingly, and also put in the real variables and expression for calculating similarity.

    Code:
    webuse grunfeld, clear
    keep if year == 1954
    
    //  DEFINE SIMILARITY BASED ON A LINEAR COMBINATION OF INVEST MVALUE AND STOCK
    
    gen index = .6*invest + .5*mvalue + .4*kstock   // SUBSTITUTE CODE FOR YOUR INDEX
    
    preserve
    keep company index invest mvalue kstock
    rename * *_m
    drop if missing(index)
    save match_file, replace
    restore
    
    capture program drop one_company
    program define one_company
        cross using match_file
        drop if company == company_m
        gen delta = abs(index - index_m)
        sort delta
        keep in 1/6 // MAKE THIS 1/60 TO GET 60 MATCHES
        drop delta
        exit
    end
    
    runby one_company, by(company) status
    You can delete the file match_file when you're done unless you can think of some further use for it.
    -runby- is written by Robert Picard and me, and is available from SSC.

    This can also be done using only official StataCorp functions, but because your dataset contains 7000 organizations, doing it that way will require, temporarily at least, a data set containing 49,000,000 observations--which may be a problem for your setup and will, in any case, take a very long time to create--time during which you may worry that your machine has hung. The -runby- approach is much faster and is not as demanding of memory (you will never need a larger data set than the final result containing 42,000 observations), and it provides you with periodic progress updates as you go. Even though -runby- is very fast compared to other approaches, this is still a fairly large job, so be patient.

    Comment


    • #3
      You could try one of the other user-written matching programs, like psmatch2 (Leuven and Sianesi, SSC). Use the option neighbor(60) and probably also the option noreplacement.
      David Radwin
      Senior Researcher, California Competes
      californiacompetes.org
      Pronouns: He/Him

      Comment

      Working...
      X