Matching sample - Statalist

David Tranv

Join Date: Jun 2016

Posts: 47
#31

24 Jun 2016, 19:52

thank Clyde and Robert for your answers!
Clyde Schechter : There is no match because I just post only 30 first rows. I think if I use longer string, it could have matched.
Robert Picard : Very impressive your code as well as your rangestat program. As you said, it is complicated rangestat.

I still have some more problems:

1. Just alternative code to see whether I understand "rangestat".

Code:

* find the index of the matching private firm replace gta_low = 0 if mi(gta_low) rangestat (max) id_low = id2use, interval(gta2use gta_low gta_low) by(yq) ***** *--> I think we can do this code directly rangestat (max) id_low2 = index, interval(gta2use gta_low gta_low) by(yq) assert id_low==id_low2 ***** * the matching private firm for that quarter with the nearest higher gta_w value gen double rssd_low = rssd9001[id_low] ***** *--> Or alternatively, rangestat (max) rssd_low2=rssd9001,interval(gta2use gta_low gta_low) by(yq) assert rssd_low==rssd_low2 *****

2. If I understand, the code you provide allow me to match the first quarter by each id ("rssd9001") because I saw that the quarter ("yq") is always 1990q1.
So with this code, we can identify the private firm that have the closest size ("gta_w") with public firm at this first quarter.
So how I can index that this private firm is matched? For example, in the "combo.dta", the private bank with rssd9001==699751 is matched with public bank with rssd9001==46 (row2), how I can create dummy or index in row 5886 that could me that rssd9001==699751 is matched?

3. In the paper I send, authors say that:

In "Corporate Investment and Stock Market Listing: A Puzzle?" (p.350)
Once a match is formed, it is kept in subsequent years to ensure the panel structure remains intact.
This allows us to estimate within-firm investment regressions.

So what I understand is once I identified the matched private firm, we use the matched private firm for the subsequent quarters. So there are 2 possibilities: (let's take the same example above):
- One is when the numbers of quarters of public firm are smaller than those of matched private firms? For example: public firm (rssd9001==46) have 3 more quarters (1990q2,1990q3,1990q4), whereas matched private firm (rssd9001==699751) have 5 more quarters. How I can tell Stata that I just use 3 more quarters of matched private firm?

- Second is (I think more complicated) when the numbers of quarters of public firm are larger than those of matched private firm? For example: public firm (rssd9001==46) have 3 more quarters (1990q2,1990q3,1990q4), whereas matched private firm (rssd9001==699751) have 2 more quarters. So I have to re-match the third quarter of public firm. How can I do that?

Thanks for your help!
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#32

25 Jun 2016, 09:46

Good work on figuring out how rangestat works. With respect to 1., if a private firm has a gta_w value that is exactly the same as the one of the current public firm, your first code may pick-up the public firm's rssd9001. That's why I define id2use as 0 for public firms, the (max) will always pick-up a private firm. You are correct that you can skip the index step altogether; the following does the same thing:

Code:

use "combo.dta", clear

* an overall observation identifier
gen long index = _n

* ------- find nearest but larger private firm in the quarter ------------------
* target larger firms, no need to look beyond gta_w * 2, that would exceed
* an assert ratio of 2. Use values that will match nothing for private obs.
gen low = cond(public, gta_w, c(minfloat))
gen high = cond(public, gta_w * 2, c(minfloat))

* we are looking for private gta_w values only
gen gta2use = cond(public, c(maxfloat), gta_w)

* find the smallest gta_w value of private firms in the quarter
rangestat (min) gta_high = gta_w, interval(gta2use low high) by(yq)

* find the rssd9001 code of the matching private firm
replace gta_high = 0 if mi(gta_high)
gen double rssd2use = cond(public, c(mindouble), rssd9001)
rangestat (max) rssd_high = rssd2use, interval(gta2use gta_high gta_high) by(yq)


* ------- find nearest but smaller private firm in the quarter -----------------
* ignore firms half the size, they are outside the desired asset ratio of 2
drop low high
gen low = cond(public, gta_w / 2, c(minfloat))
gen high = cond(public, gta_w, c(minfloat))

* find the largest gta_w value of private firms in the quarter
rangestat (max) gta_low = gta_w, interval(gta2use low high) by(yq)

* find the rssd9001 code of the matching private firm
replace gta_low = 0 if mi(gta_low)
rangestat (max) rssd_low = rssd2use, interval(gta2use gta_low gta_low) by(yq)


* --------- select the nearest neighbor using the size difference --------------
gen dhigher = gta_high - gta_w
gen dlower  = gta_w - gta_low
gen double rssd_near = cond(dlower < dhigher, rssd_low, rssd_high)

keep if public
isid rssd9001 yq, sort
save "nearest_by_yq_b.dta", replace

On your 2. comment, by specifying a by(yq) option, rangestat will only look for a match in the same quarter. If you want to go back to the "combo.dta" data and flag private firms that have matched, you simply load the matched results, reduce to one observation per rssd_near and do a 1:m merge with "combo.dta" (after adjusting the variable's name to match). I think you need to address your third point before doing this.

On your 3. comment, my code example showed how to replicate the initial match to later quarters. As you reasoned, there's an issue if the private firm exits before the public firm. As I pointed out, Asker et al. simply say:

If a matched private firm exits the panel, a new match is spliced in.

I'm not sure I understand the logic of switching private firms mid-course and what that does to the panel, the new private firm could be quite different in other ways than gta_w. They don't say how often this was required. I can also imagine that you could have to do this more than once for the same public firm if the alternate private firm exit as well. What about missing data? what if the private firm is missing data for one or more quarters?

The good news is that you have a best match for each quarter already computed. So if a private firm exits, you just switch to the best match for the quarter instead of using the previously matched private firm. How to do this is a bit tricky. I would iterate to substitute until you find no more cases of matched private firms with no data. This would address both the case of private firms that exit altogether and cases where there are missing quarters. It would go something like this:

Code:

* save the sample of private firms with data
use "combo.dta", clear
keep if public == 0
keep rssd9001 yq
rename rssd9001 final_near
save "private_sample.dta", replace

* load the match results and drop initial quarters with no match
use "nearest_by_yq.dta", clear

* the data is already sorted but double check
isid rssd9001 yq, sort

* order matches (do not count missing values)
by rssd9001: gen match_i = sum(!mi(rssd_near))

* drop quarters until the first match
drop if match_i == 0

* the overall matched firm is the first found
by rssd9001: gen final_near = rssd_near[1]

* iterate until all quarters have a matched private firm with data
local more 1
while `more' {
    merge m:1 final_near yq using "private_sample.dta", keep(master match)
    bysort rssd9001 (yq): gen has_exit = sum(_merge != 3)
    replace has_exit = 1 if has_exit > 1
    count if has_exit
    if r(N) {
        bysort rssd9001 has_exit (yq): replace final_near = rssd_near[1] if has_exit
    }
    else local more 0
    drop _merge has_exit
}

* count the number of private firms per public firm
bysort rssd9001 final_near (yq): gen first_quarter = _n == 1
bysort rssd9001: egen nprivate = total(first_quarter)
bysort rssd9001: gen public1 = _n == 1

tab nprivate public1

As you can see from the final tabulate results, up to 7 private firms are required:

Code:

. tab nprivate public1

           |        public1
  nprivate |         0          1 |     Total
-----------+----------------------+----------
         1 |    58,537      3,205 |    61,742 
         2 |    29,662        840 |    30,502 
         3 |    10,871        238 |    11,109 
         4 |     3,554         58 |     3,612 
         5 |     1,463         19 |     1,482 
         6 |       280          3 |       283 
         7 |        95          1 |        96 
-----------+----------------------+----------
     Total |   104,462      4,364 |   108,826

Comment

David Tranv

Join Date: Jun 2016
Posts: 47

#33

25 Jun 2016, 22:35

Many thanks for your powerful code. it works perfect even take long time to be able understand it.
I still have some problems.

1. To see whether it is correct when I create a separated "matched_private_firm"

Code:


    use "combo.dta",clear
    keep if listed2 == 0      //108826 dropped -> 738111 obs
    rename rssd9001 final_near
    save "private_sample.dta", replace

    use "nearest_by_yq_matched.dta",clear
    keep final_near quarter   //final_near=rssd matched private
    save "private_sample_matched.dta",replace   //108826 obs
    merge m:1 final_near quarter using "private_sample.dta"
    keep if _merge==3    //662416 dropped, 108826 obs
    drop _merge
    rename final_near rssd9001
    save "private_sample_matched.dta",replace
    isid final_near quarter

2. When I use -isid final_near quarter- as the code above, the data is not unique.
How can I refer that a private firm is already being matched?
And then, how can I use a private firm for match just one time (without replacement)?

Thanks for your help!

Comment

Robert Picard

Join Date: Mar 2014

Posts: 1536
#34

26 Jun 2016, 07:38

As I mentioned in #30, Asker et al. does the matching with replacement and that's what I implemented. To do it without replacement is a completely different problem. At the moment, I can't think of a way to do it without looping. The details depend on what you want in the end, in particular if you need to continue with the matched private firm in later quarters. That would require finding unique matches in the first quarter, then moving on to the second quarter, and so on.
Comment
David Tranv

Join Date: Jun 2016

Posts: 47
#35

26 Jun 2016, 19:38

yes, you're right.
it seems more complicated for without replacement.
anyway, many thanks for your help!
have a nice week!
Comment
Erik Olson

Join Date: Aug 2016

Posts: 3
#36

20 Aug 2016, 14:05

Hello all,

I'm working on a first year summer paper and am having trouble creating a matched sample. I am using data from the COMPUSTAT database. I want to look at US firms who have recently moved their headquarters, and created an indicator variable for these firms by "gvkey" (a unique firm identifier) called "usmoveind". This value is equal to one for firms who have moved their headquarters during the time period I wish to observe. I have also created an indicator for the 1-3 years prior to the year in which the firm moved it's headquarters called "apremoveind". It should be noted that "apremoveind" is a function of the individual firm (captured by the "gvkey" variable) and that firm's specific year of interest (variable name "fyear").

For example, here is how I created an indicator for one of the gvkeys to indicate that I wished to look at the 3 years before and after the move which was completed in the year 2001:

Code:
gen usmoveind=1 if gvkey==5959 & fyear>1997 & fyear<2005

And now, this is how I created a unique identifier for the 3 years before the move for this particular firm:

gen apremoveind=1 if gvkey==5959 & fyear>1997 & fyear<2001

I also created a variable to identify the industry in which the firm operates called "sic2". This variable was applied to the entire dataset, not just the US firms who moved. Finally, I created a measure for GAAP effective tax rates, called "gaap_etr" which is my primary variable of interest..

What I would like to do is identify a matched sample of firms (by "gvkey") with respect to "gaap_etr" in the "apremoveind" period for firms who have the "usmoveind" ==1. It is absolutely critical that the matched sample have the exact same same value for "sic2". Thus, for my US move firms, I would like to find comparable firms (in the same industry) who have a "gaap_etr" which is within a certain margin (say, +/- 1%) of the "gaap_etr" for my US move firms DURING the US firms "apremoveind" period (which ranges from 1-3 years and is a function of the firm year, variable name "fyear", as well as the unique firm identifier, variable name "gvkey"). Ideally, I would need the 3 year period for the matched firm to be the same 3 year period as the US Move firm, thus, the "fyear" variable should cover the same 3 year period.

I would also ideally like to match on two other variables, total assets (variable name "at") and revenues (variable name "revt"). Trying to match exactly on "sic2" and find a comfortable range for "gaap_etr" (most important matching characteristic), "at", and "revt" all in the same 3 years which cover the "apremoveind" for my US Move firms is giving me a bit of a headache. I would like to get a list of these control groups, create an indicator for them (call it "controlind"), and run some analyses to see how they compare to my "usmoveind" firms during the 3 year period before the headquarters move.

Any help on this matter would be greatly appreciated! I hope this is clear, and would be happy to clarify!

Thanks,

Erik
Comment
Florian Seliger

Join Date: Apr 2014

Posts: 41
#37

21 Nov 2016, 03:53

Originally posted by Clyde Schechter View Post

Your question is unclear. In the sample you describe in your second paragraph, you state there are 39000 listed companies outside of USA. Does that sample show their country code, sic code, and market value? And are these variables available in your initial sample of 130 cases? Let me also assume that in both files each company has some kind of identifier, such as a name or a code number, which I will call company_id

If so, matching on country code and sic code should be simple. As for market value, I would imagine that an exact match will be rare, if it even occurs at all. You must have some acceptable range of closeness of market value in mind. What would that be?

Assuming the variables I mention above are all there, and calling your acceptable limit of difference in market value 1000 for the sake of illustration:

Code:

use sample_of_130_companies, clear rename company_id case_company_id rename market_value case_market_value joinby sic_code country_code using sample_of_39000_companies gen delta = abs(market_value - case_market_value) drop if delta >= 1000

At this point, all pairs of the 130 and 39000 that agree on sic code and country code and whose market values differ by less than 1000 are now paired up in memory. Next step is to select the best match based on closeness of market value for each of the 130

Code:

// GENERATE SOME RANDOM NUMBERS TO RANDOMLY BREAK ANY TIES // ON CLOSENESS OF MATCH OF MARKET VALUE set seed your_lucky_number_here gen double shuffle1 = runiform() gen double shuffle2 = runiform() /* IF YOU WILL ONLY ALLOW A GIVEN COMPANY FROM THE 39000 TO BE USED WITH A SINGLE ONE FROM THE 130 THEN INCLUDE THIS: by company_id (delta shuffle1 shuffle2), sort: keep if _n == 1 */ // NOW FOR EACH COMPANY FROM THE 390 CHOOSE THE BEST // MATCHING CASE by case_company_id (delta shuffle1 shuffle2), sort: keep if _n == 1 drop delta shuffle1 shuffle2

At that point the data in memory will contain some or all of your original 130 companies, each paired with a match from among the 39,000. It is possible, of course, that some of the 130 will find no acceptable match among the 39,000. That is one of the limitations of a matched-pair analysis.

Dear Clyde,

I know this is an old thread, but I was searching for a solution to my matching problem that is basically the same as Marko's.

I implemented your code, but I do not understand why you generate two more variables, shuffle1 and shuffle2. What's the idea behind this?

Thank you,

Florian
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#38

21 Nov 2016, 18:17

Not clear if you're asking why these variables are needed at all, or if you are specifically wondering why I used two of them.

The general need for them is as follows. There may be (probably are) multiple controls that match some or all of the case companies. Since only a single matched control was desired in the end, it is necessary to select one from among the possible matches for each case company. The variables shuffle1 and shuffle2 accomplish that. The -by case_company_id (delta shuffle1 shuffle2), sort:...- command sorts the matched pairs with the closest matches (i.e. smallest value of delta) first, and if there is more than one such, in random order. The first of those is then retained as the match.

The reason for using two random variables rather than just one is that the data set here was quite large, and a single random variable shuffle1 might have resulted in ties, so that the random selection would not be reproducible if the code were re-run. By using two random variables, the probability of a tie on both shuffle1 and shuffle2 is so low that we can presume it won't happen in a data set this size.
Comment
Florian Seliger

Join Date: Apr 2014

Posts: 41
#39

22 Nov 2016, 01:41

Thank you Clyde.
Ok, if I understood it correctly, the reason for including them is sorting matched pairs in random order (if there are multiple matches).

I generated a second delta (delta1, delta2) referring to a second variable on which I would like to match-

Could you show me how you would include a second numerical variable?

Would you just insert it into
by company_id (delta shuffle1 shuffle2)? Best, Florian
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#40

22 Nov 2016, 08:38

Well, it gets more complicated to require a "closest match" on two deltas. What do you want to do if you have a match that's closer on delta1 than any other, but there is another possible match that's closer on delta2? Can you prioritize one of the deltas? If so, and if, say, delta1 takes priority, then you just replace -delta- by -delta1 delta2- in that -by- prefix. But if there is no priority, then you need to come up with some formula that captures the tradeoff between them. Perhaps you want the match with the smallest value of delta1+delta2; that's a way of saying the closeness on delta1 and delta2 are equally important. But there are endless other possibilities. The choice among them derives from the content and subject matter of what you're doing and is not a statistical issue per se.
Comment
Florian Seliger

Join Date: Apr 2014

Posts: 41
#41

22 Nov 2016, 10:32

Originally posted by Clyde Schechter View Post

Well, it gets more complicated to require a "closest match" on two deltas. What do you want to do if you have a match that's closer on delta1 than any other, but there is another possible match that's closer on delta2? Can you prioritize one of the deltas? If so, and if, say, delta1 takes priority, then you just replace -delta- by -delta1 delta2- in that -by- prefix. But if there is no priority, then you need to come up with some formula that captures the tradeoff between them. Perhaps you want the match with the smallest value of delta1+delta2; that's a way of saying the closeness on delta1 and delta2 are equally important. But there are endless other possibilities. The choice among them derives from the content and subject matter of what you're doing and is not a statistical issue per se.

thanks a lot!
Comment
Charlotte Smith

Join Date: Feb 2017

Posts: 27
#42

05 May 2017, 05:03

Hello,

I am also using matching for my master thesis, and while reading this, I was wondering why you calculate the ratio as shown below. Can you also calculate it as case_market_value/market_value? In my sample it yields to very different results and I do not quite understand why!

I hope someone is willing to clear this up.

Originally posted by Clyde Schechter View Post

Code:

gen ratio = market_value/case_market_value drop if !inrange(ratio, 1/1.3, 1.3)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#43

05 May 2017, 08:14

I was wondering why you calculate the ratio as shown below. Can you also calculate it as case_market_value/market_value?

No particular reason. I would expect things to work out either way. If one of the market_value variables has values close to zero, then putting that one in the numerator would lead to more tractable numbers and stable calculations. But case_market_value/market_value and market_value/case_market_value are reciprocals of each other and the -inrange()- arguments are symmetrical around 1 in a multiplicative sense..

In my sample it yields to very different results and I do not quite understand why!

The two lines of code you quote in #42 should result in the same sample being retained at the end. If ratio calculated one way is between 1/1.3 and 1.3, then its reciprocal (calculated the other way) is necessarily also between 1.3 and 1/1.3. So I suspect that there is something else going on here that cannot be discerned from the information you have so far provided.
Comment
Charlotte Smith

Join Date: Feb 2017

Posts: 27
#44

05 May 2017, 08:39

Hello,
This is the first one i tried:

Code:

set seed 16 joinby sic fiscalyear using `holding', unmatched(master) gen ratio = control_lagsize/lagsize drop if !inrange(ratio, 1/1.3, 1.3) set seed 17 gen double shuffle1 = runiform() gen double shuffle2 = runiform() by control_gvkey (ratio shuffle1 shuffle2), sort : keep if _n==1 by case_gvkey (ratio shuffle1 shuffle2), sort : keep if _n==1 drop ratio shuffle1 shuffle2

Because this led to a rather "strange" sample, being:

| Y
X | 0 1 | Total
-----------+----------------------+----------
0 | 464 15 | 479
1 | 464 15 | 479
-----------+----------------------+----------
Total | 928 30 | 958

That's why i assumed something went wrong and tried the following:

Code:

set seed 16 joinby sic fiscalyear using `holding', unmatched(master) gen ratio = lagsize/control_lagsize drop if !inrange(ratio, 1/1.3, 1.3) set seed 17 gen double shuffle1 = runiform() gen double shuffle2 = runiform() by control_gvkey (ratio shuffle1 shuffle2), sort : keep if _n==1 by case_gvkey (ratio shuffle1 shuffle2), sort : keep if _n==1 drop ratio shuffle1 shuffle2

This led to a better looking sample, but I indeed did not change anything to the 1/1.3 and 1.3 because that did lead to no matches. I probably did something wrong here.
I did however manually check some of the observations for both the first and the second code, and it does look like the difference is range is 30% for both codes..
More specifically, these are the first three observations:
- first code
lagsize control_lagsize
7.203108 5.622001
10.28004 8.313141
10.58946 8.162308

-second code
lagsize control_lagsize
7.475793 8.335548
8.672144 10.863
7.304893 9.221478

Am I totally missing something here? I still don't understand the difference!!
Comment
Bo Wang

Join Date: Feb 2016

Posts: 12
#45

18 Oct 2017, 05:31

Hi Clyde,
@Clyde Schechter
Sorry to bother you with this old thread. I have a similar problem when matching firms. What I would like to do is to match an M&A acquirer with a control firm (not an acquirer) which has the sample probability of engaging in acquisitions and has the same leverage deficit in the same industry. I calculated the probability first and get the variable "Probability". Then I need to match based on "Probability" and "Leverage deficit", both are continuous factors. I have two datasets, one is the acquirers and the other is non-acquirers. And what I understand is I need to combine these two datasets by using joinby industry_code, and I can do it if I just match on one variable. Your reply in #5 says

It gets more complicated if you add another continuous factor to match on. The reason is that, in general, the closest match on one of the continuous variables will not be the same record that is the closest match on the other. So in this situation, you need first to specify some rule for trading off the closeness of match on one with the closeness of match on the other. Then, the code I wrote would require additional work to do implement that.

I am not sure what do you mean the rules for trading off the closeness of match? What would the code be if I want to match on two continuous variables? Does this mean I need to combine the dataset twice, once for "Probability" within the band, say (-0.1, +0.1). Then combine the datasets again and find the closest "leverage deficit"?

Thanks. Your reply will be highly appreciated.

Bo

Last edited by Bo Wang; 18 Oct 2017, 05:54.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment