Matching firm algorithm

MIchael Jefferson

Join Date: Feb 2019

Posts: 36
#1

Matching firm algorithm

11 Feb 2019, 08:15

I am fairly new to Stata and for my PhD would like to try and reproduce previous research done.

I have two datasets ; dataset1 contains 500 IPOs between 1988 until 1997. Dataset 2 contains firm data of over 4000 firms on three specific dates(31/12/1988, 31/12/1993 and 31/12/1997).

All IPO firms from Dataset 1 need to be matched to firms from dataset 2, according to the following criteria:

-IPO firms of the time period 1988-92 need to be matched to the firm with the same SIC code and closest market value on 31-dec-1988

-IPO firms of the time period 1993-95 need to be matched to the firm with the same SIC code and closest market value on 31-dec-1993

-IPO firms of the time period 1996-97 need to be matched to the firm with the same SIC code and closest market value on 31-dec-1997

-Firm from dataset 2 can only be matched once every 3 years

-If a matching firm in the same industry (based on SIC-code) is not available, then a small firm from another industry has to be chosen

Any help regarding the matching code would be really appreciated, as I’ve spent tens of hours on this already and still can not find the solution.

Kind regards,

Last edited by MIchael Jefferson; 11 Feb 2019, 08:23. Reason: Matching
Tags: algorithm, firm, IPO matching, matching, Matching algorithm
Mike Lacy

Join Date: Apr 2014

Posts: 2423
#2

11 Feb 2019, 09:52

First, I'd have one question: Is it OK in your situation to have one of your control firms (non-IPO) be matched to more than one IPO case? If not, you won't likely be able to get the "closest" match for each firm since two firms might share the same nearest neighbor. Here are two possibilities, described schematically, possibly with bugs in the code, since I didn't want to create example data for testing.

1) Pair each IPO firm with its closest match, possibly shared with another firm.

Code:

use dataset2 rename id id2 keep id2 sic market88 rename market88_2 save controls.dta // clear use dataset1 keep if inrange(ipoyear, 1988, 1992) rename id id1 rename market88_1 // Make a data set in which each IPO firm is paired with all // controls within the same sic code joinby sic using(controls.dta) // // Within each collection of pairs for a particular IPO firm // keep the one with the smallest difference in 1988 market values. gen diff = abs(market88_1 - market88_2) bysort id1 (diff): keep if (_n == 1)

2) Use the user-contributed command -calipmatch-, which will match each IPO firm to one or more control firms that fall within some specified range of closeness (caliper width) on the market88 value, but not necessarily the closest. The match will be "greedy," with controls taken without replacement. See -ssc describe calipmatch-.

Code:

use dataset2 keep id sic market88 gen case = 0 save controls.dta // clear use dataset1 gen case = 1 append using controls.dta // Match market value within 1000, for example. Only one control per case, but you might want more. calipmatch if inlist(ipoyear, 1988, 1992), gen(pairid) casevar(case) maxmatches(1) /// calipermatch(market88)) caliperwidth(1000) exactmatch(sic)

Last edited by Mike Lacy; 11 Feb 2019, 10:29.
2 likes
Comment
MIchael Jefferson

Join Date: Feb 2019

Posts: 36
#3

13 Feb 2019, 04:20

Originally posted by Mike Lacy View Post

First, I'd have one question: Is it OK in your situation to have one of your control firms (non-IPO) be matched to more than one IPO case? If not, you won't likely be able to get the "closest" match for each firm since two firms might share the same nearest neighbor. Here are two possibilities, described schematically, possibly with bugs in the code, since I didn't want to create example data for testing.

1) Pair each IPO firm with its closest match, possibly shared with another firm.

Code:

use dataset2 rename id id2 keep id2 sic market88 rename market88_2 save controls.dta // clear use dataset1 keep if inrange(ipoyear, 1988, 1992) rename id id1 rename market88_1 // Make a data set in which each IPO firm is paired with all // controls within the same sic code joinby sic using(controls.dta) // // Within each collection of pairs for a particular IPO firm // keep the one with the smallest difference in 1988 market values. gen diff = abs(market88_1 - market88_2) bysort id1 (diff): keep if (_n == 1)

2) Use the user-contributed command -calipmatch-, which will match each IPO firm to one or more control firms that fall within some specified range of closeness (caliper width) on the market88 value, but not necessarily the closest. The match will be "greedy," with controls taken without replacement. See -ssc describe calipmatch-.

Code:

use dataset2 keep id sic market88 gen case = 0 save controls.dta // clear use dataset1 gen case = 1 append using controls.dta // Match market value within 1000, for example. Only one control per case, but you might want more. calipmatch if inlist(ipoyear, 1988, 1992), gen(pairid) casevar(case) maxmatches(1) /// calipermatch(market88)) caliperwidth(1000) exactmatch(sic)

Thank you for your response!

Excuse me for not being clear enough. It is not okay to use firms from dataset 2 (Non-IPO firms) twice within 3 years.

Example: After matching an IPO firm of 1-Jan-1988 (Dataset1) to a non-IPO firm of 31-dec-1988 (Dataset 2), this non-IPO firm cannot be matched again with an IPO firm of 31-Dec-1990. However, this non-IPO firm can be matched with an IPO firm from 1-Jan-1991 onwards.

I think this problem could be overcome by creating a certain loop. However, I do not know how to create this.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2423
#4

13 Feb 2019, 09:00

I'm not sure I understand exactly your rules for re-using controls, but as near as I understand them, I'm not thinking of any easy way to implement them, although I'd presume some reasonable methods exists. I think there was a thread on StataList a few years ago, in which I participated, about how to do matching without replacement, so you might try searching for that. Keywords might be "cases, controls, without replacement, match." If I recall correctly, I found a solution in which, after picking a control for a case, the program deleted that control from the pairs pertaining to all other cases---a brute force approach.

Matching without replacement is generally difficult. Although I personally find "without replacement" methods more intuitive, I believe (?) that the matching estimators in the built-in command -teffects- use matching *with* replacement, so I'd wonder if that might be preferable for your ultimate analytic goals. My impression is that -teffects- implements quite up to date methods.
Comment
MIchael Jefferson

Join Date: Feb 2019

Posts: 36
#5

17 Feb 2019, 04:57

Originally posted by Mike Lacy View Post

I'm not sure I understand exactly your rules for re-using controls, but as near as I understand them, I'm not thinking of any easy way to implement them, although I'd presume some reasonable methods exists. I think there was a thread on StataList a few years ago, in which I participated, about how to do matching without replacement, so you might try searching for that. Keywords might be "cases, controls, without replacement, match." If I recall correctly, I found a solution in which, after picking a control for a case, the program deleted that control from the pairs pertaining to all other cases---a brute force approach.

Matching without replacement is generally difficult. Although I personally find "without replacement" methods more intuitive, I believe (?) that the matching estimators in the built-in command -teffects- use matching *with* replacement, so I'd wonder if that might be preferable for your ultimate analytic goals. My impression is that -teffects- implements quite up to date methods.

Thank you for your response!

Basically, I cannot re-use the controls for 3 years after using them.

I looked for the thread you are mentioning. Unfortunately, I could not find it.

Anyone else with suggestions?
Comment

David Benson

Join Date: Oct 2018
Posts: 489

17 Feb 2019, 22:17

It looks to me like the statalist post Mike Lacy mentioned is this one: Question on matching in a nested case control study

For other Statalist posts on matching firms without replacement (usually (a) have to be in same SIC, then (b) find closest in size), see here, here, and here

Originally posted by Mike Lacy View Post

Setting aside considerations of whether sampling with or without replacement is preferable, I have some code that I think does incidence density sampling without replacement. The overall strategy is to start with a file of both the cases and controls, and use it to make a file of all possible pairs of cases and controls. Then, only the pairs with a case that matches on the covariates are retained, and then the pairs that don't meet the risk set condition are dropped. At this point, using a loop over all sets of case-control pairs, a sort of greedy sampling is performed: The controls for the each case are picked, then any other pairs involving those controls are deleted. I'm not certain that what I have done is right, or that it is the most efficient approach, but I think it's close and fast enough.

Code:

// Matched case-control sampling using incidence density sampling, with no replacement.
//
// Create example files of cases and controls to work with.
// Example conditions
clear
set seed 33245
local matchvars = "x y z"
local maxtime = 50
local pcase = 0.05 // proportion of case events among all observations
local ControlsPerCase = 3
set obs 100000 // total number of persons, cases and potential controls
//
//
gen int id = _n
gen int evtime = ceil(runiform() * `maxtime') // time of disease event
replace evtime = . if runiform() > `pcase' // Many persons never have the disease event
// Create variables beside event time on which cases and controls would be matched.
foreach v of local matchvars {
gen `v' = ceil(3*runiform()) // 3 value for each match variable
}
// End of preparing example data
// *********************************************
//
// Within this file of cases and controls, everyone is a potential control to start with,
// so save everyone in this file as a source of controls.
compress // important to save memory
tempfile filecase filectl
rename id idctl
rename evtime evtimectl
// randomize the order of the controls
gen rand = runiform()
sort rand, stable
drop rand
save `filectl' // file of controls
rename idctl idcase
rename evtimectl evtimecase
//
//
// Strip the current file down to just the cases.
drop if missing(evtimecase)
qui count
di r(N) " event cases in file"
//
// Pair up each case with each of the potential controls that match on the matching variables.
// We will worry about time of event, risk set, etc. later.
joinby `matchvars' using `filectl'
//
//
// Drop impossible pairs
drop if (idcase == idctl) // self pairs
drop if (evtimecase >= evtimectl) // control member is not in risk set
//
// A few details before we start incidence sampling
gen rand = runiform()
sort idcase rand, stable // randomize the order of the cc pairs
drop rand evtime* x y z // don't need these anymore
by idcase: gen byte first = (_n ==1) // just to count cases
qui count if first ==1
di r(N) " event cases that have a potential match after considering event time"
//
//
// Keep the desired number of controls for each case. For each case,
// remove her/his controls from all other case-control pairs so
// as to give sampling w/o replacement.
// I use a loop here over all case-control pairs , generally not Stata-ish,
// but it seems like a good approach here.
by idcase: gen int seqnum = _n // sequence number of the c-c pair for each case case
qui levelsof idcase, local(caselist) // a list of all the cases
gen byte casedone = 0 // to mark each case as we process it.
foreach c of local caselist {
// Keep desired number of c-c pairs for this case
qui drop if (idcase == `c') & (seqnum > `ControlsPerCase')
qui replace casedone = 1 if (idcase == `c')
//
// Make a list of the controls just used. I used a clumsy approach with preserve/restore.
preserve
qui keep if (idcase == `c') // current case/control pairs
local used "" // will hold the list of controls just used
forval i = 1/`ControlsPerCase' {
local used = "`used' " + string(idctl[`i'])
}
restore
//
// Drop all remaining unexamined pairs that involve the controls just used
local used = subinstr(ltrim("`used'"), " " , ",", .)
qui drop if (casedone == 0 ) & inlist(idctl, `used')
}
// Report on number of cases and controls matched.
by idcase: gen NCtl = _N
tab NCtl if first, missing

Regards, Mike

Last edited by David Benson; 17 Feb 2019, 22:28.

Comment

David Benson

Join Date: Oct 2018

Posts: 489
#7

17 Feb 2019, 22:39

Or, Mike could've been thinking of this post: Matching participants per cases and controls

For a few more posts on matching without replacement (usually firms, but sometimes people), see here, here, here, and here
Comment
MIchael Jefferson

Join Date: Feb 2019

Posts: 36
#8

18 Feb 2019, 08:37

Thanks for your response David! I will look at the mentioned links.
Comment

Announcement