The matched sample of teffects nnmatch

Tunga Kantarci

Join Date: Oct 2015

Posts: 90
#1

The matched sample of teffects nnmatch

09 Aug 2020, 09:55

I am interested in obtaining the matched sample (observations in the treatment group that are matched with observations in the control group) by teffects nnmatch. I want this sample because I want to run a regression on this sample. So I am not interested in the ATE or ATT estimates of teffects nnmatch, but only the matched sample it uses to produce these estimates. Is there a way to obtain the matched sample of teffects nnmatch?
Tags: None

1 like

Mike Lacy

Join Date: Apr 2014
Posts: 2411

09 Aug 2020, 14:40

Here's something that appears to work. The basic idea is to use the -generate- option of -teffects nnmatch- to put the observation numbers of the matching observations into a file. Then, one saves and de-duplicates those obs. numbers, and uses them to tag the relevant observations in the original file.

Code:

// Simulate some data
clear
set seed 1447
set obs 1000
forval i = 1/5 {
   gen x`i' = rnormal()
}
gen y  = runiform()
gen byte tx = _n <= 100
quiet count if tx
local ntx = r(N)
tab tx
// I'll assume the general case of multiple controls.
teffects nnmatch (y x*) (tx), nneighbor(2) control(0) generate(naynum)
// end simulate
//
// Real stuff starts.
// You need to make an observation number variable in the original file.
gen obsnum = _n
preserve // original data
// Make a dataset of observations that served as neighbors.
keep naynum*
stack naynum*, into(obsnum) clear
drop _stack
bysort obsnum: keep if (_n == 1)
gen byte in_sample = 1
tempfile matchsample
save `matchsample'
//
// Tag original data observations that were in the matched sample
restore // original
merge 1:1 obsnum using `matchsample'
// verify
tab tx if in_sample

Comment

Tunga Kantarci

Join Date: Oct 2015

Posts: 90
#3

12 Aug 2020, 08:19

Thanks Mike. This could not be more helpful. There is a follow-up question I wanted to ask. Based on the provided code, I identified the matched observations. I will estimate a DiD regression on the matched sample. The thing is, some matched observations are matched multiple times (ties or draws). I do not want to drop the ties: teffects do not drop them when estimating ATT or ATE. Therefore, in the DiD regression I want to weigh the observations that are matched multiple times: to represent them more in the matched sample. For this, I can create a weight variable and use the [fweight = weight] syntax at the end of the DiD regression syntax. To my imagination, the weight variable should be a straightforward one: it should contain the frequency of matches for each observation. E.g., assume a small sample with 6 persons with ID numbers: T1, T2 in treatment group, and C1, C2, C3, C4 in control group. Assume that T1 is matched with C1, C2, and T2 is matched with C2, C3, C4. So, C2 is matched two times. Then the frequency weight variable (W) would look like this:

Code:

ID W .. . T1 2 T2 3 C1 1 C2 2 C3 1 C4 1

Do you see anything controversial in this approach?
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2411
#4

12 Aug 2020, 10:20

Sorry, I don't know whether your approach is valid or not. In case someone else wants to answer here, though, a clarification would be helpful: When you say "some matched observations are matched multiple times," if I understand correctly, you would mean: "Some of the control observations are the nearest neighbor to more than one treatment observation." So, if the observation with ID = 99 is the nearest neighbor for the observations correspond to ID = 66, 77 and 88, you want the observation with ID = 99 to be weighted by 3 in your sample. (Your use of the terms "draws" and "ties" is confusing in this regard. The issue appears rather to be that the matching is done "with replacement.") Presuming I am correct in my clarification, perhaps this will help someone respond who is more knowledgeable about analysis when matching is with replacement.
Comment
Tunga Kantarci

Join Date: Oct 2015

Posts: 90
#5

12 Aug 2020, 12:17

Thanks for pointing out that a clarification may help. I try to clarify my example with a simple schema below. T1 is matched with C1 and C2. T2 is matched with C3, C4 and C2. C1 and C2 are ties for observation T1 (C1 and C2 are both nearest neighbors to T1). C3, C4 and C2 are ties for observation T2. T1 and T2 are ties for C2. I hope this clarifies my example. I consider that matching is with replacement.

Code:

T1 C1 C2 T2 C3 C4 C2

Last edited by Tunga Kantarci; 12 Aug 2020, 12:19.
Comment
Tunga Kantarci

Join Date: Oct 2015

Posts: 90
#6

24 Aug 2020, 04:06

Hi Mike. I studied your code but I wonder if it indeed generates a sample that contains the intended observations. Let me go through an example. In the schema below, the first column shows the original observations in the sample, and the second column shows the matches. M1 denotes Match 1. Suppose we want to find nearest one neighbor. The nearest neighbor to each original observation is shown under M1. If I understand it correctly, your code collects together these neighbors, and deletes if they are repeating. This results in a sample that consists of only the neighbors C1 and T1. This sample corresponds to the Omega set indicated on page 273 in the teffects manual here: https://www.stata.com/manuals/te.pdf. Your code omits T2 and C2 from the sample. This is fine as far as the aim is to find the nearest neighbors. However, a sample consisting of only nearest neighbors is not the sample teffects is using. The y_hat formula on page 274 in the teffects manual suggests that the original treatment and control observations, for which nearest neighbors are found, are also considered in the sample used to calculate a treatment effect (ATE or ATT). Therefore, I think the sample of interest should not only include T1 and C1, but also T2 and C2. So the question is, why does your code exclude T2 and C2 from the sample to be used to estimate a treatment effect, or to estimate another regression which is what I am trying to do? Should not we include also the original observations T2 and C2 to the sample?

Code:

M1 C1 T1 C2 T1 T1 C1 T2 C1
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2411
#7

24 Aug 2020, 17:48

Thanks, first of all, for forcing me to read the manual more carefully. <grin>
Unfortunately, though, I don't understand your follow-up. It would have helped me if you posted data in Stata format, with actual variable, using -dataex- as described in the StataList FAQ, or, better yet, described the problem in terms of my example data above that is available in Stata format to anyone.

Because I can't link your example to actual variables and observation numbers, I have trouble understanding you. I'd encourage you to discuss the presentation of your problem with a friend or colleague in hopes of making it clearer. I am unable to respond in terms of your notation of M1, T1, and C1.

Here's my understanding of what my code does, which I hope will help bring our minds together. I also might have misunderstood previously what my code did, so perhaps my description below will help both of us.

1) My code creates a file of all the observation numbers of any observation that was assigned as a nearest neighbor to any other observation in the -teffects- analysis. Note that not all observations have that role. In my example, it happens that N = 286 observations happened to be a nearest neighbor to one or more other observations.

2) Using the -merge- command, each observation in the original file that served as a neighbor is marked with in_sample == 1 to show that it served as a neighbor for *some* other subject or subjects.

So, here are some example observations drawn from the data set created by my code:

Code:

list tx-in_sample in 1, ab(12) +---------------------------------------------+ | tx naynum1 naynum2 obsnum in_sample | |---------------------------------------------| 1. | 1 269 638 1 1 | +---------------------------------------------+

That shows that the observation with obsnum ==1 was a tx== 1 subject, and served as a nearest neighbor for one or more other subjects in the analysis, as indicated by in_sample. I also show the observation numbers of the neighbors assigned to it, but we already knew that.

By contrast,

Code:

list tx-in_sample in 106/107, ab(12) +---------------------------------------------+ | tx naynum1 naynum2 obsnum in_sample | |---------------------------------------------| 106. | 0 60 70 106 1 | 107. | 0 26 32 107 . | +---------------------------------------------+

shows that the observation with obsnum == 106 was observed as tx ==0, and served as a nearest neighbor for one or more other observations. Observation 107, though, was observed as tx ==0, and did not happen serve as a nearest neighbor to any other observation. This can be verified by

.

Code:

count if inlist(107, naynum1, naynum2) 0

You say you want "only the matched sample it [-teffects-] uses to produce these estimates." My understanding--perhaps wrong--is that every subject in the original file that received a list of neighbors (!missing(naynum1, naynum2)) would be used here, along with every subject that served as a neighbor (in_sample ==1). (In this file, every original subject did receive a list of neighbors, but that might not be true if the -caliper()- or -ematch()- options were used.)

I understand that my response does not directly connect with what you tried to describe in your most recent followup, but perhaps it will prove helpful by providing a concrete context for discussion using actual data available to anyone who might respond here.

To go further, I would need you to describe what you want in terms of the concrete example I have provided here. If you can do that, I might be able to help. If you can't provide such a description, I'll have to bow out of further conversation here. I would note, by the way, that the mathematical description in the -teffects- manual to which you refer is beyond my easy understanding, so while I acknowledge the expertise of those to whom that description makes sense, it happens not to work well for me.
Comment
Tunga Kantarci

Join Date: Oct 2015

Posts: 90
#8

27 Aug 2020, 05:15

Thanks for your elaborate reply. I will use your example to explain my question. I should have sticked to your example. My bad.

In the output below, observations 26 and 32 from the treatment group are found to be nearest neighbors to observation 107 in the control group. Observations 26 and 32 are included in the sample your code generates. Observation 107 is not included in the sample because observation 107 was not found to be a neighbor to an original observation in the treatment group. But I think observation 107 should be included in the sample.

Code:

list tx-in_sample in 106/107, ab(12) +---------------------------------------------+ | tx naynum1 naynum2 obsnum in_sample | |---------------------------------------------| 106. | 0 60 70 106 1 | 107. | 0 26 32 107 . | +---------------------------------------------+

So your code creates a sample. It keeps an observation if it is a nearest neighbor to an original observation. It drops an observation if it is not a nearest neighbor to an original observation, despite that another observation is a nearest neighbor to it. I think it should not drop this observation. In your words "every subject in the original file that received a list of neighbors would be used here, along with every subject that served as a neighbor".

I further demonstrate what your code does with an example. To keep this example simple, I assume that we search for one nearest neighbor. In the picture below, black dots are observations in the control group, and red dots are observations in the treatment group. Distances between the dots are intentional.

The nearest neighbor of 1 in the control group is 3 in the treatment group.
The nearest neighbor of 2 in the control group is 3 in the treatment group.
The nearest neighbor of 3 in the treatment group is 1 in the control group.
The nearest neighbor of 4 in the treatment group is 1 in the control group.

These mean that 1 and 3 are nearest neighbors to original observations, and your code creates a sample which includes 1 and 3. 2 and 4 are not nearest neighbors to original observations. Therefore they are not included in your sample. But other observations (1 and 3) are nearest neighbors to 2 and 4, and therefore, I think, 2 and 4 should be included in your sample. But your code excludes them.

When calculating ATE, teffects considers 1 and 3 in the matched sample, but also 2 and 4. The ATE estimator in the middle of page 274 in the manual of teffects shows this - I do not elaborate since you have a reservation on the manual of teffects. My aim is to find the matched sample teffects is using, not to estimate ATE though, but a custom regression (DiD) on the matched sample. Your code does not serve to generating the matched sample of teffects. It correctly and neatly serves to the purpose of finding nearest neighbors, which are 1 and 3. So I think we should adjust your code so that it includes also 2 and 4.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2411
#9

29 Aug 2020, 06:27

Your questions go to the more theoretical aspects of -teffects- with the -nn- approach, about which I'm not expert. What might code did was only to identify which observations served as nearest neighbors; what you want to do with that, i.e., which observations you want to select is up to you.

1) "It keeps an observation if it is a nearest neighbor to an original observation. It drops an observation"
I don't think this is correct, as no observations are "dropped."

2. "Observation 107 is not included in the sample because observation 107 was not found to be a neighbor to an original observation in the treatment group. But I think observation 107 should be included in the sample."
I believe this is different than what you said you wanted. "(observations in the treatment group that are matched with observations in the control group)" 107 is not a treatment subject, and it was not a nearest neighbor to any treatment subject, so it sounds to me like it's not of interest.

I'm sorry, but I've lost interest in continuing with this thread, as I have too much difficulty understanding what you want, and my knowledge of nearest neighbor procedures is apparently lacking.
Comment
Tunga Kantarci

Join Date: Oct 2015

Posts: 90
#10

30 Aug 2020, 07:35

"Your questions go to the more theoretical aspects of -teffects- with the -nn- approach, about which I'm not expert."

Right.

"What my code did was only to identify which observations served as nearest neighbors; what you want to do with that, i.e., which observations you want to select is up to you."

Right.

"I don't think this is correct, as no observations are "dropped.""

It is correct that your code is not "dropping" observations. Let me rephrase my claim as that your code does not include the regarding observations (original observations for which nearest neighbors are found for them) in the sample from the perspective I look at how the sample should be constructed.

"I believe this is different than what you said you wanted. "(observations in the treatment group that are matched with observations in the control group.)""

It is not different. Though I admit that this sentence of mine in my original post needs clarification. The clarification is as follows. I want a sample which includes observations if they are nearest neighbours to orginal observations, and original observations for which nearest neigbours are found for them. This sample definiton is not arbitrary. It is the sample teffects is using to estimate ATE (for ATT only treatment observations and observations that are nearest neighbors to them are considered since ATT is about the average treatment effect for the treated). A ceveat here is that when predicting the missing potential outcome in the treatment (control) group, nearest neiughbours to the original observation in the control (treatment) group from the treatment (control) group are weighted using a frequency weight if the original observation is observed in the control (treatment) group - the missing potential outcome discussion in the teffects manual helps to understand this. When constructing the sample I want, I abstract from weighting observations because what I want to do is to estimate a DiD regression on a sample where observations from the treatment and control groups are similar to ach other. Since my aim does not regard predicting a missing counterfactual outcome, weighting does not play a role for what I want to achieve. At least this is what I think and I am trying to verify with someone who I hope can comment.

"107 is not a treatment subject, and it was not a nearest neighbor to any treatment subject, so it sounds to me like it's not of interest."

107 is a control observation. It is not a nearest neighbor to any treatment observation. But other observations from the treatment group are nearest neighbours to 107. So it is of interest from my perspective of how I should construct the sample I want.

"I'm sorry, but I've lost interest in continuing with this thread, as I have too much difficulty understanding what you want, and my knowledge of nearest neighbor procedures is apparently lacking."

Your code has been helpful in finding the nearest neighbors that teffects is considering. Thanks once again. All I have to do is to also include the original observations if nearest neighbors are found for them. And then I want to estimate a DiD regression on this sample. So at least conceptually, I believe I did my best to make it easy to understand what I want. I hope someone who has experience in this can confirm my approach. So I would appreciate if someone can pick up my question.
Comment

Sanjay Sharma

Join Date: Apr 2020
Posts: 19

#11

05 Jun 2021, 06:18

Originally posted by Mike Lacy View Post

Code:

// Simulate some data
clear
set seed 1447
set obs 1000
forval i = 1/5 {
gen x`i' = rnormal()
}
gen y = runiform()
gen byte tx = _n <= 100
quiet count if tx
local ntx = r(N)
tab tx
// I'll assume the general case of multiple controls.
teffects nnmatch (y x*) (tx), nneighbor(2) control(0) generate(naynum)
// end simulate
//
// Real stuff starts.
// You need to make an observation number variable in the original file.
gen obsnum = _n
preserve // original data
// Make a dataset of observations that served as neighbors.
keep naynum*
stack naynum*, into(obsnum) clear
drop _stack
bysort obsnum: keep if (_n == 1)
gen byte in_sample = 1
tempfile matchsample
save `matchsample'
//
// Tag original data observations that were in the matched sample
restore // original
merge 1:1 obsnum using `matchsample'
// verify
tab tx if in_sample

Hello Mike, I am currently trying to find out the matched sample (observations in the treatment group that are matched with observations in the control group) by using teffects nnmatch, and applying a caliper of 0.5. I want to do one to one matching with replacement. i.e, one control can match with more than one treatment observations. In my case I have 100 treatment, and 350 control observations. I want a one to one matching with replacement that will yield at most 100 treated observations (i.e. if none of the treatment obvs are dropped due to lack of suitable match within the specified caliper), and less than 100 control observations (assuming some controls can match with more than one treated obvs and therefore will be less than the total treatment observations).

I want to see difference in "OutcomeY" between treatment groups (TreatmentX = 0 Vs 1) after matching based on following variables: age, education, male, farmsize, employment, hh_type.
My sample dataset with 100 observations is as follows, where TreatmentX (1)= 29 and TreatmentX (0)= 71.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(age education male farmsize employment hh_type OutcomeY) byte TreatmentX
52  5 0  .625 1 1        0 0
45 10 1  1.25 1 1     3000 0
73  4 1  .002 1 1     -200 0
60  0 0   .75 1 1        0 0
59  3 0  .625 1 1  56646.2 0
53  5 0   .25 1 1     2500 0
52  3 0   1.5 1 0 99967.39 0
48  3 0 1.875 1 1        0 0
69  3 1  1.25 1 1   297640 0
67  0 0     1 1 1        0 0
54  7 1  .125 1 1        0 0
45  0 0  .875 1 1        0 0
75  0 1   1.5 1 1        0 0
70  0 1  1.25 1 1   -20700 1
30 15 0    .1 1 0     9930 1
82  0 1  .625 1 1    -1590 1
58  0 0     1 1 1     4200 1
40  5 0  .125 1 1     8500 1
50  7 0  .125 1 0    11350 1
65 10 1 1.375 1 1     2500 1
42  3 0 .0256 1 1     3350 1
60  0 0    .7 1 1   -27290 1
67  8 1  .875 1 1   -27000 1
49  5 1    .3 1 1        0 0
55  5 1   .75 1 1      200 0
56  0 1    .5 1 1    10800 0
50 12 1   .75 1 0        0 0
55  0 0   .75 0 1        0 0
25  8 0  .125 1 1        0 0
46 12 1     1 1 1     5250 0
45  0 0  .375 1 1        0 0
68  0 0 1.125 1 0     3000 0
65  0 0 1.375 0 1        0 0
43  5 0  .625 1 1        0 0
58  5 1   .25 1 1        0 0
62  5 1   .75 1 1        0 0
65  0 0    .5 1 1        0 0
27 10 0  .125 1 0  63796.2 0
68  5 1  1.75 1 1        0 0
64  0 0  .125 1 1        0 0
57  8 1    .5 1 1        0 0
84  0 0   .75 1 1     1000 0
48  0 0  .125 1 0        0 0
72  0 0   1.5 1 0        0 0
70  0 1   2.5 0 1        0 0
85  3 1  .375 1 1        0 0
47  7 1     2 0 1        0 0
48  7 0  .125 1 0        0 0
51  0 0   .75 0 1        0 0
80  0 0   .25 1 0        0 0
40 10 0     0 1 1        0 0
72  0 0 1.875 1 0        0 0
37 10 0  .375 1 0    54300 0
50 10 1 1.125 0 1     7250 1
51 10 1   .05 1 1    -7740 1
68  9 1    .6 1 1   -21000 1
54 10 1  .625 0 1   239900 1
63 10 1    .7 1 1     7805 1
32  6 0  .375 1 0     4100 1
61 10 1  .875 1 1     -300 1
46  8 1  .375 1 1     -300 1
51  2 0   .45 1 1     2875 1
61 11 1   2.5 1 1   296640 0
56  3 1  .375 1 1        0 0
60  0 0   1.5 1 0        0 0
53 12 1  1.75 1 1        0 0
37  8 0  .875 1 1    -2600 0
64  0 1   .15 1 0        0 0
73  0 1  .625 1 1        0 0
65  0 0  1.25 1 0  45116.3 1
29 12 1   .75 1 1  12299.9 1
38 12 1   1.5 1 1    90250 1
64  0 0   2.5 1 1   -23600 1
66  0 1  .375 1 1   -20200 1
32  6 0 .4375 1 1    42300 1
68 10 1    .4 1 1   -18600 1
54  5 1 1.875 1 1     -750 1
55 10 1   .75 0 1   5933.3 1
49  0 0   .75 1 1     5400 1
56  0 0  .875 1 1        0 0
46 12 1  1.25 1 0    13500 0
52  0 0  .125 1 1        0 0
34 10 0   .25 1 0        0 0
68  0 0  .625 1 0        0 0
24  7 0   .25 1 0        0 0
80  8 1  .875 1 1        0 0
85  0 0    .5 1 0        0 0
53  5 1  .375 1 1        0 0
30  5 0  .125 1 0    -2900 0
44  8 1  .002 0 1        0 0
35  8 0    .5 1 0        0 0
36  7 0   .25 1 1    -4258 0
58  3 0    .5 1 1        0 0
43  0 0   .25 0 1        0 0
21  5 0  .125 1 0        0 0
53  5 1  .125 1 1        0 0
55  6 1    .4 1 1        0 0
70  0 1  .625 1 1        0 0
58  0 1    .5 1 1        0 0
42  9 1  .125 1 1        0 0
end

I used the following formula

Code:

gen rownum = _n
teffects nnmatch (OutcomeY  age education male farmsize employment hh_type) ( TreatmentX) , atet biasadj ( age education farmsize hh_type) ematch ( male ) dmvariables nneighbor(1) caliper(0.5) gen(nnid)

save complete.dta, replace
keep if TreatmentX == 1 & !mi(nnid1)
save treated.dta, replace

use complete.dta
keep if TreatmentX == 0 & !mi(nnid1)
save control.dta, replace

drop nnid1
rename rownum nnid1
save control.dta, replace

merge 1:m nnid1 using treated.dta, keepusing(nnid1)
keep if _merge == 3
save Matched.dta, replace

append using treated.dta

save Final_Matched.dta, replace

The above code gave me 29 treatment observations and 29 control observations implying that there has been a one to one matching WITHOUT replacement. I would like to do 1:1 matching WITH replacement which means I am likely to get lesser number of controls than treated observations in the final matched sample.

Can you provide me any suggestion on this?

Last edited by Sanjay Sharma; 05 Jun 2021, 06:47.

Announcement