Matching sample

Marko Pucelj

Join Date: May 2015

Posts: 15
#1

Matching sample

28 May 2015, 14:34

Hello,

I am doing my master thesis and I am rather lost, so I could use some help.
I have a sample of 130 companies for which i need to find a matching control sample. I would like to match the companies based on country code, sic code and market value. I would need at least 1 matching company for every company in my original sample. I am looking at how the liquidity of the companies (bid-ask spread, turnover...) changes after the company delists.

How can I find a matching sample? Can I use the propensity matching? My sample from which I will try to find the matching companies consists of all the listed companies outside of USA (39000), so I cannot get all the data for them (bid ask spreads, turnover, returns...), as the data is quite hard to find for all of them. Therefore I would need to get out an ACTUAL LIST WITH NAMES of matching companies, so that I could then download the data for them and get the bid ask spread,turnover... and then do a difference in difference test
Is there any way stata could give me an actual list of companies, or possibly some other software?

Thank you in advance for any help and guidance.

Marko
Tags: match, propensity, propensityscore, sample
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

28 May 2015, 15:20

Your question is unclear. In the sample you describe in your second paragraph, you state there are 39000 listed companies outside of USA. Does that sample show their country code, sic code, and market value? And are these variables available in your initial sample of 130 cases? Let me also assume that in both files each company has some kind of identifier, such as a name or a code number, which I will call company_id

If so, matching on country code and sic code should be simple. As for market value, I would imagine that an exact match will be rare, if it even occurs at all. You must have some acceptable range of closeness of market value in mind. What would that be?

Assuming the variables I mention above are all there, and calling your acceptable limit of difference in market value 1000 for the sake of illustration:

Code:

use sample_of_130_companies, clear rename company_id case_company_id rename market_value case_market_value joinby sic_code country_code using sample_of_39000_companies gen delta = abs(market_value - case_market_value) drop if delta >= 1000

At this point, all pairs of the 130 and 39000 that agree on sic code and country code and whose market values differ by less than 1000 are now paired up in memory. Next step is to select the best match based on closeness of market value for each of the 130

Code:

// GENERATE SOME RANDOM NUMBERS TO RANDOMLY BREAK ANY TIES // ON CLOSENESS OF MATCH OF MARKET VALUE set seed your_lucky_number_here gen double shuffle1 = runiform() gen double shuffle2 = runiform() /* IF YOU WILL ONLY ALLOW A GIVEN COMPANY FROM THE 39000 TO BE USED WITH A SINGLE ONE FROM THE 130 THEN INCLUDE THIS: by company_id (delta shuffle1 shuffle2), sort: keep if _n == 1 */ // NOW FOR EACH COMPANY FROM THE 390 CHOOSE THE BEST // MATCHING CASE by case_company_id (delta shuffle1 shuffle2), sort: keep if _n == 1 drop delta shuffle1 shuffle2

At that point the data in memory will contain some or all of your original 130 companies, each paired with a match from among the 39,000. It is possible, of course, that some of the 130 will find no acceptable match among the 39,000. That is one of the limitations of a matched-pair analysis.
Comment
Marko Pucelj

Join Date: May 2015

Posts: 15
#3

29 May 2015, 03:32

Thank you for the reply Mr. Schechter.

To clarify some things I did not explain well. In my original sample of 130 companies, I have the SIC codes, country codes and also the market values.
For the 39000 companies (the number of listed companies globally, excluding USA), I have also the SIC, country and market values.
There is also a common identifier for both samples, the ISIN codes.
My output variables for which I want to do a difference in difference test are bid ask spread, turnover, volume.

For the matching process. I would need an exact match on the SIC code (first three numbers, to be exact) and an exact match on country code. I guess that would get me a sample of a thousand or more potential matches. For the market value, I dont need an exact match, but a company that is similar. It could be either the nearest neighbor, or propensity score.
Example I have a company in my original sample that is from Ireland, sic code 342 and market value $5billion. I would need to find a matching company out of the 39000, that is from ireland and has sic code 342 and the market value as close as possible.

How would the code change, if instead of delta, I would use nearest neighbor, or propensity score. Which would be better in this case? Also if I wanted to match on another factor, such as book to market value, I would just repeat the code for that factor?

After running the codes, I would get a matching sample for some of the original companies as you said. Would my matching sample, be bigger than the original sample? Is there a way to include multiple matches, if they are good fits? So for example in the example of the company from ireland with sic code 342, if stata finds multiple good matches, it would take all of them or just one? Any way to specify this?
After doing all of this, would stata return me a list of matching companies with an unique identifier? (In this case the ISIN code), so that I could then download data (bid-ask spread, turnover, volume) for them, as I need to do a difference in difference test.

Thank you again for the help.

Marko
Comment
Marko Pucelj

Join Date: May 2015

Posts: 15
#4

29 May 2015, 03:39

I forgot to ask something else. Would be easy to implement that I would find a match for companies at different dates, or does that take a lot of time? For example I have company 1 that I would like to match at the end of year 2003, company 2 at the end of 2005, company 3 middle of 2006....
If that would be hard to implement, I could just match at the same time for all of the companies, it would still be ok.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#5

29 May 2015, 08:41

How would the code change, if instead of delta, I would use nearest neighbor, or propensity score. Which would be better in this case? Also if I wanted to match on another factor, such as book to market value, I would just repeat the code for that factor?

Propensity score matching is not appropriate for your requirements. Propensity score matching will not guarantee an exact match on country and SIC codes: a propensity score is, in effect, a weighted average of the matching variables.
The code I showed you is, in fact, similar to a nearest-neighbor match, only better: it guarantees an exact match on country and SIC codes.

Also if I wanted to match on another factor, such as book to market value, I would just repeat the code for that factor?

No. It gets more complicated if you add another continuous factor to match on. The reason is that, in general, the closest match on one of the continuous variables will not be the same record that is the closest match on the other. So in this situation, you need first to specify some rule for trading off the closeness of match on one with the closeness of match on the other. Then, the code I wrote would require additional work to do implement that.

Would my matching sample, be bigger than the original sample? Is there a way to include multiple matches, if they are good fits? So for example in the example of the company from ireland with sic code 342, if stata finds multiple good matches, it would take all of them or just one?

Yes, it is likely (but not guaranteed) that the matching sample would be bigger than the original sample--it depends on the availability of suitable matches in the data. You could have a complicated situation where some of the 130 have no match at all and others have even hundreds of matches--it really depends on the data. In any case, you can keep as many of the matches as you are prepared to work with in analysis. Some analytic procedures are designed for only 1:1 matches. Some are designed for 1:n matches, but where n must be the same for each case. Some are designed for 1:n matches where n can vary. So it all depends on how you plan to analyze the matched data.

Would be easy to implement that I would find a match for companies at different dates, or does that take a lot of time? For example I have company 1 that I would like to match at the end of year 2003, company 2 at the end of 2005, company 3 middle of 2006....

Assuming you are talking about an exact date match, that is very simple. You would just add the date to the list of variables in the -joinby- command. If, however, you are talking about close, but not necessarily exact, date matching, then you are in the same situation as with book-to-market value discussed above.
Comment
Marko Pucelj

Join Date: May 2015

Posts: 15
#6

29 May 2015, 15:35

Thank you so much for the explanations and guidance.

I processed the code and it gave me about 500 matches, which is great!
I just have a couple of questions. Based on some of the similar studies I read, I would be better if instead of a numeric delta for market value, I would use a range. For example companies that are 30% smaller or 30% bigger. How would that code look like?

The code line:
set seed your_lucky_number_here

gives me an error. Do I need to change something? I just pasted it in.

If I were to match by date, it would be exact date. Would my code look like this then? I currently have 15 yearly observations, so company 1 has value for market value in 2001,2002,2003... how would I code that I need to match for company 1 in 2003 and not in any other year and for company 2 in 2005 and not in any other year and so on.

Also If I run the last line of the code:
by case_company_id (delta shuffle1 shuffle2), sort: keep if _n == 1 drop delta shuffle1 shuffle2
Then it gives me back as match just the original companies. So a company from my original sample is a match with itself, there are no new companies. Can I just delete this line in the code, or did I do something wrong? Thank you again

Last edited by Marko Pucelj; 29 May 2015, 15:40.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#7

29 May 2015, 16:23

Based on some of the similar studies I read, I would be better if instead of a numeric delta for market value, I would use a range. For example companies that are 30% smaller or 30% bigger. How would that code look like?

Code:

REPLACE gen delta = abs(market_value - case_market_value) drop if delta >= 1000 WITH gen ratio = market_value/case_market_value drop if !inrange(ratio, 1/1.3, 1.3)

If I were to match by date, it would be exact date. Would my code look like this then?

I assume you have a variable, call it year, which contains the year you want to match on.

Code:

REPLACE joinby sic_code country_code using sample_of_39000_companies WITH joinby sic_code country_code year using sample_of_39000_companies

The code line:
set seed your_lucky_number_here

gives me an error. Do I need to change something? I just pasted it in.

I had intended the italics and the phrase your_lucky_number_here to convey that you would substitute a number for "your_lucky_number_here". Sorry that wasn't clear. Any positive integer will do. The purpose is to set the random number generator seed so as to assure that each time you run the code you will get exactly the same results.

Also If I run the last line of the code:
by case_company_id (delta shuffle1 shuffle2), sort: keep if _n == 1
...Then it gives me back as match just the original companies. So a company from my original sample is a match with itself

Yes, if the file of potential matches also contains the original companies you are trying to match, then, of course, the best match to a company will be itself! I had So you need to purge those 130 companies from the file of 39000 potential matches. I would do that as follows (before running the match):

Code:

use sample_of_39000_companies, clear merge 1:1 company_id using sample_of_130_companies, keep(master) nogenerate keepusing(company_id) save purged_sample_of_39000_companies, replace

Then in the matching code, replace sample_of_39000_companies by purged_sample_of_39000_companies everywhere.

Finally, the purpose of that final line of code -by case_company_id (delta shuffle1 shuffle2), sort: keep if _n == 1- was, in any case, to get you down to one match for each of the original 130. So if you intend to keep all acceptable matches, then just remove that line of code. If you want to use, say, 3 matches per company, then keep that line, but change _n == 1 to _n == 3, etc. Also, replace delta by ratio in light of your changed criterion for matching on market value.
Comment
Marko Pucelj

Join Date: May 2015

Posts: 15
#8

30 May 2015, 14:39

I am running the code on a subsample, but it has all the same variables as the big sample, it just smaller so everything runs faster. It has 11 companies in it and the sample from where I am looking for matches has 3500 companies.
After running the code, just without the final line of the code : -by case_company_id (delta shuffle1 shuffle2), sort: keep if _n == 1-
I get 380 matches and matches for all my 11 companies. But after running the final line in the code, I get matches for 4 of my companies and that is it. So I have 4 companies and 1 match for each companies, so 4 matches overall. That does not change, even if I set the number in that line to 3 or 5 or 10.
How could the final line be fixed, so that it would return a bit less matches, because 380 matches for 11 companies is too much, but more than the 4 that I get by running the final line.

Thank you for the help
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#9

30 May 2015, 15:13

I think you need to paste the exact code you ran and the output you got from Stata (don't retype anything, paste it) into a code block here so we can see what happened. What you say happened does not seem possible. That final line of code -by case_company_id (delta shuffle1 shuffle2), sort: keep if _n == 1- will keep exactly one observation (the first) for each case_company_id in the data at the time it was run. So if you had 11 companies before that line, you will necessarily still have 11 companies afterward, only with just one match for each.

So either you didn't use the line exactly as given, or something else had already gone wrong before Stata got there. You need to show us exactly what happened.
Comment
Marko Pucelj

Join Date: May 2015

Posts: 15
#10

30 May 2015, 15:20

Also I have data for every year from 2000-2014. So i get a match for a company in 2000, 2003, 2008... on multiple years.
I guess there wouldnt be a way to automatize this, because stata wouldnt know for which year to match every single company. But is there a way to get out just the match for the year i need it for. So for company 1 in 2005, company 2 in 2008, company 3 in 2011.
Even putting in the code 130 times, would still be faster than filtering it out in excel.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#11

30 May 2015, 15:38

Also I have data for every year from 2000-2014. So i get a match for a company in 2000, 2003, 2008... on multiple years.
I guess there wouldnt be a way to automatize this, because stata wouldnt know for which year to match every single company. But is there a way to get out just the match for the year i need it for. So for company 1 in 2005, company 2 in 2008, company 3 in 2011

Unless I have misunderstood you, I answered this in #7 earlier. If that isn't what you need, please either show some hand-worked examples or explain differently what the requirements are.
Comment

Marko Pucelj

Join Date: May 2015
Posts: 15

#12

30 May 2015, 16:47

Here is the code that I ran.

Code:

use /Applications/Stata/final_names_matching_COMPUSTAT/my_sample_canada.dta

. rename company_name case_company_name

. rename assets case_assets

. 
. joinby sic_3 country year using /Applications/Stata/final_names_matching_COMPUST
> AT/canada_whole.dta

. gen ratio = assets/case_assets
(326 missing values generated)

. drop if !inrange(ratio, 1/1.3, 1.3)
(6601 observations deleted)

. 
. set seed 24

. gen double shuffle1 = runiform()

. gen double shuffle2 = runiform()

. by company_name (ratio shuffle1 shuffle2), sort: keep if _n == 1
(344 observations deleted)

. by case_company_name (ratio shuffle1 shuffle2), sort: keep if _n == 5
(154 observations deleted)

. drop ratio shuffle1 shuffle2

Comment

Marko Pucelj

Join Date: May 2015

Posts: 15
#13

30 May 2015, 16:54

Instead of market value I matched on assets. Before running the last line I had 158 matches and at least 1 match for all of my 11 companies. After that I get 4, even though I set the n==5. This does not change either if I set it to 2 or 1 or 5. I always get only 4 in the end.

As for the date. I guess the problem is my input before it gets into Stata. For example Company 1 has observation for every year from 2000-2014. Therefore it can get a match in year 2000, 2004, 2007... But I only want to match it in year 2007. I guess I would need to create a new variable matching_year in my input and therefore it could only be matched in 2007 or something similar.
Currently company 1 gets a match in 2005 with company A, in 2007 with compnay B, in 2010 with company C. This is before running the last line of code, where I have 158 matching companies.
Is there any easy way to implement this in stata, or do I need to do this before I get the data into stata?

Last edited by Marko Pucelj; 30 May 2015, 17:08.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#14

30 May 2015, 17:34

OK, my mistake. Sorry about that.

Code:

by case_company_name (ratio shuffle1 shuffle2), sort: keep if _n == 5 // SHOULD BE by case_company_name (ratio shuffle1 shuffle2), sort: keep if _n <= 5

The first version seeks to keep the 5th match for each case_company_name. But some of the case_company_names will have only 4 or fewer matches, and so they will disappear entirely. That is why you go down rom 11 cases to 4. If you change that == to <=, then it will keep the first through fifth matches for each case_company_name. Now, not every case_company_name will have five matches, but all of them have at least 1 or they wouldn't be there at that point in the first place. So you will be left with 11 case_company_names, and between 1 and 5 matches for each of them.

With regard to the selection of year to match on, you say that company 1 has observations for every year from 2000-2014 but you only want to match it in 2007. Where does 2007 come from? Why 2007 and not 2000, or 2006, or 2008, or 2014, or any of the other years? If you can specify a rule for which year you want to use as your match year, there is probably a way to program it. If there you can't articulate a rule, then you will just need to hand enter a matching_year variable into your data and then incorporate that into the matching code. (Actually, I think the simplest way to incorporate it into the matching code would be to just insert -drop if year != matching_year- into the code just before the -joinby- command.)
Comment

Marko Pucelj

Join Date: May 2015
Posts: 15

#15

30 May 2015, 17:41

Also while trying to get the purged sample when I run the code

Code:

use /Applications/Stata/final_names_matching_COMPUSTAT/compustat_global_sample.d
> ta, clear

. merge 1:1 company_name using /Applications/Stata/final_names_matching_COMPUSTAT/
> my_sample_global.dta, keep(master) nogenerate keepusing(company_name)
variable company_name does not uniquely identify observations in the master data
r(459);

I get this, how would I make the company_name an unique identifier?

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment