How to run Nearest Neighbor Matching (nnmatch) on panel data?

Tahseen Hasan

Join Date: Feb 2018

Posts: 33
#1

How to run Nearest Neighbor Matching (nnmatch) on panel data?

05 Jan 2019, 10:26

I am trying to run a Nearest Neighbor Matching in order to run a DID inference on it.

Background:

Stata 13

This dataset contains panel data from Compustat merged with Thomson Reuters M&A dataset.

What I need to do:

Due to the nature of this data set, I have to match firms according to their size (ln(Assets) + Industry code (SIC).

In order to match the firms by Asset and SIC, I will use Nearest Neighbor Matching (nnmatch).

I have to match year and SIC code as exact (ematch)

(Later on) After matching is done, I need to perform a DID estimation to see if there is a causal effect.

However, when I use the teffects nnmatch code using ematch, I get an error.

Code:

gen treatment = 0 replace treatment = 1 if merger==1 teffects nnmatch (income industry_sic assets firm_year) (treatment), biasadj(assets) ematch(firm_year industry_sic) vce(robust) dmvariables

Error: "12 observations have no exact matches"

It runs fine if I don't use ematch but I need it in order to compare apples to apples. What do I do to deal with this problem?

Dataset:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int(id firm_year) byte industry_sic int(assets income merger_year) byte merger 111 2000 22 10 10 . . 111 2001 22 12 20 2002 1 111 2002 22 30 400 . . 111 2003 22 50 470 . . 111 2004 22 60 490 . . 333 2000 22 15 10 2001 1 333 2001 22 40 100 2002 1 333 2002 22 70 200 . . 333 2003 22 80 260 . . 333 2004 22 85 270 2007 1 333 2005 22 90 280 . . 333 2006 22 95 290 . . 333 2007 22 120 700 . . 555 2000 37 40 10 2001 1 555 2001 37 60 50 . . 555 2002 37 70 70 . . end

Last edited by Tahseen Hasan; 05 Jan 2019, 10:29.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

05 Jan 2019, 11:15

Caveat: I have rarely used -teffects-, and I have never used it with nearest-neighbor matching.

But it seems as if Stata is simply telling you that there are twelve firms for which no exact match on year and sic code exist in your data. If that is the case, then it seems to me you have the following possible workarounds:

1. Get some additional data that will match the currently unmatchable firms..
2. Omit the 12 unmatchable firms from your analysis.
3. Loosen the requirement for an exact match on year. For example, settle for a match within 2 or 5 years or something like that.
4. Loosen the requirement for an exact match on sic code. Accept a match between one industry and a closely related one (I don't know enough about SIC codes, nor, for that matter about industries, to suggest how you might implement "closely related.")
1 like
Comment
Tahseen Hasan

Join Date: Feb 2018

Posts: 33
#3

05 Jan 2019, 12:35

Thank you so much for the advice Clyde, I really appreciate it. In my case, point #2 is what I have to find a solution to. The options won't work in my case because I don't have access to any more data. For my DID to have a valid control group, even if I could loosen up Industry matching, I won't be able to do the same for year. Even if I run ematch individually on year or SIC, I still have unmatched observations.

The only solution that I'm seeing is figuring out a way to omit the unmatchable firms from the analysis, as you suggested. However, I have over 5K unmatched observations in my full dataset so I need to figure out how to exclude those observations in order to make this work.

I have looked through all the relevant forum posts and people often suggest using the -osample()- function. However, even when I include it in my code I still face the same error so I don't know what else to do.

Code:

gen treatment = 0 replace treatment = 1 if merger==1 teffects nnmatch (income industry_sic assets firm_year) (treatment), biasadj(assets) ematch(firm_year industry_sic) osample(Unobserved) vce(robust) dmvariables

These are the two relevant threads that relates to my problem, but unfortunately I didn't find a solution for Nearest Neighbor matching in either thread:

https://www.statalist.org/forums/for...ffects-nnmatch

https://www.statalist.org/forums/for...ffects-nnmatch

I really appreciate your help Clyde. I've been struggling with this for days now and I am not being able to find a solution.

Last edited by Tahseen Hasan; 05 Jan 2019, 12:44.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

05 Jan 2019, 13:04

What about -drop if Unobserved- and then trying the -teffects nnmatch- again? Does that go through?
1 like
Comment
Tahseen Hasan

Join Date: Feb 2018

Posts: 33
#5

05 Jan 2019, 13:15

Still no luck because the -osample- function creates a new variable "Unobserved" =1 if observation does not have an exact match.

However, my code does not run at all for the Unobserved variable to be created in the first place.

These are the codes I tried but I got the same error.

Code:

drop if Unobserved==1 teffects nnmatch (income industry_sic assets firm_year) (treatment), biasadj(assets) ematch(firm_year industry_sic) osample(Unobserved) vce(robust) dmvariables

Unobserved not found

Code:

teffects nnmatch (income industry_sic assets firm_year) (treatment), biasadj(assets) ematch(firm_year industry_sic) osample(Unobserved) vce(robust) dmvariables drop if Unobserved==1

12 observations have no exact matches
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

05 Jan 2019, 13:38

OK. I guess I don't properly understand how the -osample()- option works, or something like that.

Well, we can go back to basics and identify the unmatchable observations outside of -teffects- and then remove them before calling -teffects-. I assume that treatment is the name of the variable that distinguishes treatments from controls, and that it is 1 in the treatment group and 0 in the controls.

Code:

gen long uid = _n preserve keep if treatment == 0 tempfile controls save `controls' restore, preserve keep if treatment == 1 keep uid industry_sic firm_year joinby industry_sic firm_year using `controls' keep if _merge == 1 // THESE ARE THE UNMATCHABLES drop _merge tempfile unmatchable save `unmatchable' restore merge 1:1 uid using `unmatchable', keep(master) nogenerate

At this point the data in memory will look like the original data set, except that it has a new variable uid (which you can now drop if you like) and it excludes those treatment = 1 observations that have no exact control that agrees with them on industry_sic and firm_year. If you now run -teffects-, I think it will go through.

Note: I have not tested this code, so it may contain errors, but this is the gist of the approach.
1 like
Comment
Tahseen Hasan

Join Date: Feb 2018

Posts: 33
#7

05 Jan 2019, 14:38

Thanks again for the help Clyde.

Following your code, when I run upto the following part, the dataset looks as follows:

Code:

joinby industry_sic firm_year using `controls'

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int firm_year byte industry_sic long uid int(id assets income merger_year) byte merger float treatment 2000 22 6 111 10 10 . . 0 2004 22 10 111 60 490 . . 0 end

I am unsure if the data is supposed to look like that after rejoining it?

Right after that if I run the code:

Code:

keep if _merge == 1 // THESE ARE THE UNMATCHABLES drop _merge tempfile unmatchable save `unmatchable'

It gives me an error that _merge is not found.

I tried repeating with _merger instead of _merge but I get the same error.

I know I'm probably messing up somewhere here by not being able to tailor the code to my dataset but I'm not being able to identify where my mistake is, especially in regards to why the _merger is not working.

Last edited by Tahseen Hasan; 05 Jan 2019, 14:55.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#8

05 Jan 2019, 14:54

Yes, sorry. I forgot that -joinby- only generates a -merge- variable if the -unmatched()- option is also specified. So change the -joinby- command to

Code:

joinby industry_sic firm_year using `controls', unmatched(both) _merge(_merge)

You cannot use -merge- instead of -joinby- here: the whole idea is to pair up each treatment variable with every control that agrees with it on industry_sic and firm_year. -merge- does not do that.
1 like
Comment

Tahseen Hasan

Join Date: Feb 2018
Posts: 33

05 Jan 2019, 15:18

Hey Clyde thanks for all your patience with me.

These are my results so far. This is the full code I have used:

Code:

drop _all
clear
use "C:\Users\Tahseen\Desktop\Temp\Diff.dta",clear
cd "C:\Users\Tahseen\Desktop\Temp"
gen treatment = 0
replace treatment = 1 if merger==1

gen long uid = _n
preserve

keep if treatment == 0
tempfile controls
save `controls'
restore, preserve

keep if treatment==1
keep uid industry_sic firm_year
joinby industry_sic firm_year using `controls', unmatched(both) _merge(_merge)

keep if _merge == 1
drop _merge
tempfile unmatchable
save `unmatchable'

restore
merge 1:1 uid using `unmatchable', keep(master) nogenerate

It ran without any errors. Once I ran it, the data looks like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(id firm_year) byte industry_sic int(assets income merger_year) byte merger float treatment long uid
111 2000 22  10  10    . . 0  1
111 2002 22  30 400    . . 0  3
111 2003 22  50 470    . . 0  4
111 2004 22  60 490    . . 0  5
333 2000 22  15  10 2001 1 1  6
333 2002 22  70 200    . . 0  8
333 2003 22  80 260    . . 0  9
333 2004 22  85 270 2007 1 1 10
333 2005 22  90 280    . . 0 11
333 2006 22  95 290    . . 0 12
333 2007 22 120 700    . . 0 13
555 2001 37  60  50    . . 0 15
555 2002 37  70  70    . . 0 16
end

Once I run the teffects code then I end up getting the same error except it says "9 observations have no exact matches".

Code:

teffects nnmatch (income industry_sic assets firm_year) (treatment), osample(Unobserved) biasadj(assets) ematch(firm_year industry_sic) vce(robust) dmvariables

I am wondering if this is a legitimate bug in the stata 13 software? I don't see why there wouldn't be a standard code to simply ignore the unmatched variables in their nearest neighbor matching. For large sized panel data (in my case I have over 500K observations) it is seems nearly impossible to run exact matching.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#10

05 Jan 2019, 15:56

I really don't know what to say here. As I indicated, I'm an infrequent user of -teffects- and have never used it with -nnmatch-, so it may be that there is something that we are both missing here. I just don't know. Sorry I can't be more helpful here.
1 like
Comment
Tahseen Hasan

Join Date: Feb 2018

Posts: 33
#11

05 Jan 2019, 16:38

Please don't feel bad about that at all! I scoured through probably 30+ threads on this this forum and on stack exchange on matching topics. It seems like nearest neighbor matching is not a popular command at all because most of these threads either don't have a solution or the questions go unanswered. I am truly truly grateful for all the help you've provided and the patience you had through this and if anything I learned a new way of database management from your codes which I wasn't aware of before. Thank you professor Clyde!

My next step is to see if I can do something similar with Propensity Score Matching (psmatch2) and try to get the same results because that code is slightly more popular and has more support. Nnmatch would have been ideal for me for the purposes of my paper but I will try to see if it will be sufficient to use psmatch2 instead. I will update this thread if I find useful results.

Last edited by Tahseen Hasan; 05 Jan 2019, 16:57.
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#12

06 Jan 2019, 01:22

Dear Tahseen, As far as I know, most (or all) of the existing matching approaches do not respect the structure of panel data. As such, you might want to try entropy balancing (search ebalance) or coarsened exact matching (search cem).

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
1 like
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#13

07 Jan 2019, 03:10

Just FYI, despite the name, psmatch2 also allows for nearest neighbour matching if you use the mahalanobis option. I would also suggest reshaping your data into a cross section before starting your matching. I.e. there should not be a year variable anymore, instead you should have assets2000 assets2002 etc. As River Huang mentioned, the matching modules don't support panel structures at all.

Once you have reshaped your data, you no longer need to exact match on firm_year I think? (I did not read the first posts) There is indeed no way to ignore errors to ematch (or caliper for that matter). All you can do if use osample(newvar), which will create a new variable identifying the problem cases and then running the command again with osample == 0 (I think 0, maybe it's 1). Then you have to hope the omitted observations weren't used to match any other observations, although I think that is more an issue with caliper than ematch.

Alternatively, you could do something along the following lines (code may need adjustment, didn't test)

Code:

* Generate sample used for estimation teffects nnmatch ... [do not use ematch option] gen esample = e(sample) * Group observations by FY/SIC egen group_FY_SIC = group(firm_year industry_sic) if esample == 1 * Count how many levels of the treatment value are present in each group unique treatment, by(group_FY_SIC) gen(treatment_levels) * Generate new sample restriction (only keep groups with multiple treatment levels) gen esample_em = esample if treatment_levels > 1 teffects nnmatch ... if esample_em, ematch(...)

I am working on a wrapper for teffects that essentially changes these (to us) inconvenient design decisions. If I ever find the time, I'll add an option that fixes these ematch-issues.
1 like
Comment

Tahseen Hasan

Join Date: Feb 2018
Posts: 33

#14

07 Jan 2019, 12:39

Thank you River and Jesse for letting me now. I really really appreciate the help.

I am not able to reshape my data to wide because Stata is giving me an error that my "Firm_Year values within ID are not unique". I could not find a solution for that even though I don't see how that is the case with my data. I will provide a screenshot of the data at the bottom.

Because of that issue I proceeded onwards with my long dataset.

This is the code I have used (closely following your one). My total sample size is 50,434.

Code:

teffects nnmatch (Dependent sic2 fyear X1 X2) (Treatment), biasadj(X1 X2) osample(match1) dmvariables vce(robust)        ///No ematch here
gen esample = e(sample)

egen group_FY_SIC = group(fyear sic2) if esample == 1

unique Treatment, by(group_FY_SIC) gen(treatment_levels)

gen esample_em = esample if treatment_levels > 1

*This has worked fine so far

_____________________________

teffects nnmatch (Dependent sic2 fyear X1 X2) (Treatment) if esample_em==1, ematch(sic2 fyear) biasadj(X1 X2) osample(match2) dmvariables vce(robust)

* Error: 19332 observations have no exact matches

teffects nnmatch (Dependent sic2 fyear X1 X2) (Treatment) if esample_em, ematch(sic2 fyear) biasadj(size_w bm_lag1) osample(match2) dmvariables vce(robust)

*Error: 20083 observations have no exact matches

drop if esample_em != 1

teffects nnmatch (Dependent sic2 fyear X1 X2) (Treatment), ematch(sic2 fyear) biasadj(size_w bm_lag1) osample(match2) dmvariables vce(robust)

*Error: 20083 observations have no exact matches

______________________________

Just when I get to the final teffects estimation, I am getting the same error again about observations not having exact matches. When I run it with 'if esample_em==1' the number of unmatched observations decrease to 19,332.

I have also tried the following where I tried to only match when osample(match1)==0 but I am getting the same error.

Code:

teffects nnmatch (Dependent sic2 fyear X1 X2) (Treatment), biasadj(X1 X2) osample(match1) dmvariables vce(robust)        ///No ematch here, osample = match1
*Using osample, where assigned observations are 0 or missing
teffects nnmatch (Dependent sic2 fyear X1 X2) (Treatment) if match1==0, ematch(sic2 fyear) biasadj(X1 X2) osample(match2) dmvariables vce(robust)

*Error: 20083 observations have no exact matches

I am willing to drop all the "non-exact matches" observations and proceed with the analysis if that is what's needed.

If possible could you see where I'm going wrong here? Is this error occurring specifically because my data is not in wide format?

Thank you so much again Jesse I really appreciate this. This is the rough format of my dataset.

ID	Year	Income	Treat
111	2000	10	1
111	2001	40	0
111	2002	90	0
111	2003	100	1
111	2004	120	0
111	2005	190	0
333	2000	10	1
333	2001	45	1
333	2002	90	0
333	2003	110	1
333	2004	160	0
333	2005	240	1
333	2006	290	0
333	2007	380	0
555	2000	10	0
555	2001	20	1
555	2002	85	0
555	2003	195	0
555	2004	215	0

Last edited by Tahseen Hasan; 07 Jan 2019, 13:33.

Comment

Jesse Wursten

Join Date: Jan 2016

Posts: 915
#15

08 Jan 2019, 03:02

Do you have any missings in year or id? Those might be holding back your reshape. Try duplicates tag fyear id, gen(dubs) and see what you get.
Comment

Announcement