Nearest Neighbour matching with exact matches and without replacement

Ferdinand Elfers

Join Date: Dec 2018

Posts: 11
#1

Nearest Neighbour matching with exact matches and without replacement

11 Dec 2018, 09:21

Dear all,

without actually being interested in the estimation of a treatment effect, I want to find a (replicable) way of creating (from a large overall sample with a binary treatment variable) two equal-sized treatment and control samples that are matched on industry (exact) and size (nearest neighbor) - the equal size condition makes it necessary that there is no replacement, and it should be possible to influence the order in which matches are assigned (like the descending option in psmatch2). I have looked for a while, but among the more popular options it seems that
psmatch2 doesn't allow exact matches

teffects nnmatch always uses replacement

calipmatch only randomly matches within the caliper, but has no nearest neighbor functionality

Is there an appropriate Stata command that I have overlooked, or do I have to do it manually (ca. 50k observations, so computing time is a problem for joinby etc. procedures...)

Best, Ferdinand
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

11 Dec 2018, 13:14

To my knowledge there is no Stata command available that satisfies all of your requirements. A match of that kind will require hand-coding the matching algorithm. I should also point out that it is not always possible to satisfy all of your requirements: in some data sets there will be observations that find no match that meets all your restrictions, and that is even more likely to occur when replacement is not permitted.

I should add that all of the commands you refer to draw on -joinby- or -merge- or similar commands anyway, so computing time would not be improved by using them compared to a hand-coded matching program. In fact, a hand-coded program might actually be more efficient because you might be able to draw on newer commands like -rangejoin- that are faster than what was available to the authors of the ones you mentioned. I would also add that in my experience a data set of 50k observations will not require unreasonbly long times to run commands of the kind you mention anyway. It's not until you get to millions of observations that things get really slow.
Comment

Ferdinand Elfers

Join Date: Dec 2018
Posts: 11

13 Dec 2018, 06:52

Thank you very much Clyde for your help!

I have tried to create what I want (exact matches on industry, no replacement, nearest neighbor on size) with the following code. It generally does what I want (exact matching order can be adjusted later), but obviously is very inefficient as a nested loop. Moreover, rangejoin fails here for because of too little memory ("op. sys. refuses to provide memory") for the whole sample with ca. 15.000 treatment firms and ca. 30.000 potential controls.

Code:

use "treatment.dta" , clear
gen high = totalassets*1.20
gen low = totalassets*0.80
rangejoin totalassets low high using "controls.dta" , by(sic1)  keepusing(ID totalassets a b c) suffix(_control)
drop if missing(totalassets_control)
gen sizediff = 100*(abs(totalassets-totalassets_control))/totalassets)

*    Give priority to observations with the best matches
*    Sort according to difference, take the first occurrence of a treatment firm, then delete all other matches for (a) that treatment firm (b) that matched firm

sort sizediff

local N = _N
forvalues i = 1/`N' {
    local N2 = _N
    local j = `i'+1
    while `j' <= `N2' {
        if ID[`j'] == ID[`i'] {
            drop in `j'
            local j = `j' - 1
        }
        else if ID_control[`j'] == ID_control[`i'] {
            drop in `j'
            local j = `j' - 1
        }
        local j = `j'+1
    }
}

Do you think there is an alternative to rangejoin that doesn't require as much memory, and an alternative to the loops that runs faster?

It would be great if maybe you could give (again) advice on this, thanks a lot!

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#4

13 Dec 2018, 08:47

It is not -rangejoin- itself that is eating up memory, rather it is the very large number of potential pairings that -rangejoin- has to consider here. Since you are restricting to an exact match on industry, you can reduce the burden on memory by breaking up the data into separate data sets, one for each industry, and then use this code inside a loop over industry, finally appending the results all together as you go. I have occasionally had to do this myself in other contexts where the memory required to process the entire data set becomes excessive.

That will make it possible to run it within your existing memory resources. It will not, however, speed things up. In fact, it will probably be slower overall because there will be more disk thrashing involved even though the computations will be much faster.

As for the two loops, I believe you can accomplish the same thing with the following simpler code:

Code:

local N = _N forvalues i = 1/`N' { local ip1 = `i' + 1 drop in `ip1'/L if ID == ID[`i'] drop in `ip1'/L if conrol == control[`i'] }

I'm not sure whether it will run much faster, but I think it will.
Comment
Ferdinand Elfers

Join Date: Dec 2018

Posts: 11
#5

13 Dec 2018, 13:36

Yes, much faster!

Thanks!
Comment

Maarten Cerpentier

Join Date: Jun 2019
Posts: 5

29 Aug 2019, 01:56

Hi all,

I try to do something similar (nn matching after exact match, without replacement)
What I try to do in my do file, is matching 2012 to 2014 observations (which I call 'control') exact on nuts and nace_b.
Afterwards, I add another criterion: maximum relative difference of X percent (let's say 5) on turn12 and emp12.
As a last step, I only want to keep the closest 2014 'control' observation for each 2012 observation. (in line with both your suggestions for the loop). This is where I run into trouble.
After sorting total relative difference low->high this means keeping the first occurence of a 2012 observation, deleting all next cases containing this 2012 observation, as well as keeping the matched control, and deleting all next cases that contain this 2014 'control' in the sorted list.
I tried both your loops, but they resulted in an error.

As I am quite new in Stata, can anyone help me with the correct specification of my loop? It has to work on approximately a dataset of 18000 observations where the loop has to filter out the more 'distant' matches of 2012 to a 2014 'control'.

Kind regards, Maarten

CODE:

Code:

// An id variable is needed
gen int ID = _n
//
// A file with just year 2014
preserve
keep if year == 2014
rename (turn12 emp12 ID) (turn12_control emp12_control ID_control)
keep(turn12_control emp12_control ID_control nuts nace_b)
tempfile temp14
save `temp14'
restore
//
// Just year 2012
keep if year == 2012
rename ID ID_12
//
joinby nuts nace_b using `temp14'
//
local closepct = 5
gen turnpct = 100 * abs(turn12-turn12_control)/turn12
gen emppct = 100 * abs(emp12-emp12_control)/emp12
gen match = (turnpct < `closepct') & (emppct < `closepct')
drop if match==0
//
//Compute relative difference taking into account both non-exact matched variables
gen absdiffemp = abs(emp12-emp12_control)
gen reldiffemp = (absdiffemp/emp12)*100
drop absdiffemp
//
gen absdiffturn = abs(turn12-turn12_control)
gen reldiffturn =(absdiffturn/turn12)*100
drop absdiffturn
//
gen reldiff = reldiffemp + reldiffturn
//
//Give priority to observations with the best matches on emp12 and turn12 together (small score on reldiff)
//Sort according to relative difference (reldiff), take the first occurrence of a 2012 firm ('treatment firm'), then delete all other matches for (a) that treatment firm (b) that matched firm 'control'
//This satisfies the requirements:
*Exact match on nuts and nace_b.
*Keep only 2014 'control' observations with a max X % deviation on emp12 and turn12.
*Take closest 2014 'control' observation.
*And make sure every 2014 observations is matched only once (best, no replacement) to a 2012 observation.

sort reldiff

local N = _N
forvalues i = 1/`N' {
    local ip1 = `i' + 1
    drop in `ip1'/L if ID_12 == ID_12[`i']
    drop in `ip1'/L if ID_control == ID_control[`i']
}

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str2 nuts str5 nace_b double turn12 float(emp12 year)
"HU" "31"    16979712.259999998   5659 2014
"HU" "23"              39522475  13174 2012
"NO" "58"              11520000   3840 2014
"ES" "52"              28719661   9573 2012
"ES" "46"              56909291  18969 2012
"DE" "22"              13906200   4635 2014
"ES" "59-60"         1224437.48    408 2014
"ES" "10-12"           16174637   5391 2012
"DE" "28"               5540000   1846 2012
"NO" "26"                209000     69 2014
"ES" "73"               8702138   2900 2012
"HR" "10-12"          232659000  77553 2014
"PT" "49"             1700603.8    566 2014
"ES" "62-63"            1163809    387 2012
"DE" "26"              94100000  31366 2014
"ES" "62-63"            7486122   2495 2014
"ES" "23"    47452171.339999996  15817 2014
"ES" "46"             178943368  59647 2014
"BG" "46"                729113    243 2012
"BG" "14-15"           17036.32      5 2014
"DE" "25"               3650000   1216 2014
"CZ" "13"              30904529  10301 2012
"CZ" "31"             931757.54    310 2014
"EE" "23"               1690129    563 2012
"PT" "13"               6110728   2036 2012
"ES" "62-63"            1627808    542 2012
"CZ" "46"              15372063   5124 2014
"DE" "22"              14190000   4730 2012
"CZ" "24"             165548690  55182 2012
"DE" "37-39"            2583280    861 2014
"BG" "16"    315175.83999999997    105 2014
"DE" "79"              10645000   3548 2012
"NO" "62-63"            5090000   1696 2012
"DE" "53"                341040    113 2014
"DE" "64"            1.8332e+09 611066 2014
"DE" "71"                663000    221 2012
"DE" "27"              52000000  17333 2012
"ES" "28"           17050399.66   5683 2014
"CZ" "17"              30007714  10002 2012
"BG" "25"                432560    144 2012
"ES" "46"              20227923   6742 2012
"BG" "14-15"            4449330   1483 2014
"ES" "18"               1735012    578 2012
"DE" "37-39"            1862000    620 2014
"DE" "29"               1086000    362 2012
"ES" "71"                277066     92 2014
"CZ" "62-63"            2137898    712 2014
"LT" "69-75"              44732     14 2012
"ES" "19-20"            6101640   2033 2014
"BG" "71"                736783    245 2012
"DE" "29"               1064280    354 2014
"ES" "42"              71051622  23683 2014
"NO" "30"              29731000   9910 2012
"ES" "21"              39389172  13129 2012
"HU" "49"               6668902   2222 2014
"ES" "31"            9495946.18   3165 2014
"ES" "5-9"              2012564    670 2012
"HU" "52"               3192252   1064 2012
"CZ" "58"            3606504.86   1202 2014
"DE" "81"              26698000   8899 2012
"ES" "72+75"             152397     50 2014
"ES" "10-12"           12804166   4268 2014
"CZ" "26"               1946519    648 2012
"PT" "73"               5490983   1830 2012
"DE" "17"              25382000   8460 2014
"CZ" "64"             366761899 122253 2012
"ES" "23"           11951247.98   3983 2014
"DE" "61"               1176000    392 2014
"ES" "62-63"         1595251.84    531 2014
"DE" "19-20"            9310000   3103 2014
"DE" "25"              33781000  11260 2012
"PT" "62-63"        61362084.56  20454 2014
"DE" "62-63"             650000    216 2012
"LT" "58-63"             513704    171 2012
"DE" "37-39"             900000    300 2014
"DE" "22"               2100000    700 2014
"NO" "26"                209000     69 2012
"ES" "10-12"            2454043    818 2012
"LT" "13-15"            2075714    691 2014
"EE" "10-12"        39682392.26  13227 2014
"DE" "25"               1000000    333 2014
"BG" "14-15"            1866244    622 2012
"ES" "28"              11253027   3751 2012
"CZ" "46"               8739672   2913 2012
"BG" "46"               6931690   2310 2012
"HU" "28"          415090643.38 138363 2014
"LT" "24-25"          406238.42    135 2014
"LT" "31-32"           39420181  13140 2012
"CZ" "17"              30007714  10002 2014
"NO" "71"              20237980   6745 2014
"DE" "64"            1.8332e+09 611066 2012
"NO" "33"               5890780   1963 2014
"DE" "23"               1645420    548 2014
"ES" "72+75"          3817565.5   1272 2014
"ES" "74"               7394730   2464 2012
"ES" "71"               8414352   2804 2012
"HU" "62-63"           32595865  10865 2014
"ES" "87"               1603969    534 2012
"DE" "28"               5429200   1809 2014
"ES" "71"               2360006    786 2012
end

Last edited by Maarten Cerpentier; 29 Aug 2019, 02:01.

Comment

Sanjay Sharma

Join Date: Apr 2020

Posts: 19
#7

09 Jun 2021, 12:56

Hi Ferdinand,
I am currently using 1:1 nnm matching with replacement using teffects nnmatch command. However I am getting equal number of treatment and control samples after matching leading me to believe there has been no replacement. You have written that teffects nnmatch always uses replacement. Can you provide any reference for that? Thanks in advance!

Originally posted by Ferdinand Elfers View Post

Dear all,

without actually being interested in the estimation of a treatment effect, I want to find a (replicable) way of creating (from a large overall sample with a binary treatment variable) two equal-sized treatment and control samples that are matched on industry (exact) and size (nearest neighbor) - the equal size condition makes it necessary that there is no replacement, and it should be possible to influence the order in which matches are assigned (like the descending option in psmatch2). I have looked for a while, but among the more popular options it seems that
psmatch2 doesn't allow exact matches

teffects nnmatch always uses replacement

calipmatch only randomly matches within the caliper, but has no nearest neighbor functionality

Is there an appropriate Stata command that I have overlooked, or do I have to do it manually (ca. 50k observations, so computing time is a problem for joinby etc. procedures...)

Best, Ferdinand
Comment

Announcement