Matching Firms Based on Industry and Total Assets

Alex Conju

Join Date: Aug 2024

Posts: 17
#1

Matching Firms Based on Industry and Total Assets

08 Aug 2024, 13:31

Hi all,

Apologies in advance if this is not quite the place for this question but I seem to have exhausted all other avenues. For context, I am replicating a published paper for my thesis which essentially gathers key data from firms listed on on the US and Canadian stock exchange and runs a regression to prove that - per the fundamentals - the US firms are valued higher. To start this, the paper pulls loads of data from Compustat and forms an initial sample from which it matches US firms to Canadian firms based on industry and firm size (please see the methodology as an image attached). My hope was to do this for a dozen countries, however I have been unable to match firms as in the paper. Having spoken to some people, they suggested the 'range', 'merge' and 'psmatch2' commands but I get a range of error messages from 'command not recognised' to 'incorrect data type'.

I am able to run a regression, test for basic things such as autocorrelation and hetroskedastidicy, but this seems to be a little (a lot) outside my skillset so any advice on how to match data and replicate the paper would be hugely appreciated. I cannot seem to share the dataset I am currently experimenting with, maybe because it has near 1M observations?

Thank you in advance!

Attached Files
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#2

08 Aug 2024, 14:44

Take a look at #4 in https://www.statalist.org/forums/for...-two-variables. Your situation is similar to that one. Evidently, you will have to change the variable names in the code to match your data set. Similarly the variables named *_3 and *_1 will instead be named in ways that reflect the way you code US vs Canadian in your data. You will also need to make a couple of other changes:

Code:

CHANGE local allocation_ratio 3 TO local allocation_ratio 1 CHANGE gen delta = abs(age_3 - age_1) // ENFORCE AGE CALIPER OF 0.5 YR local caliper 0.5 drop if delta > `caliper' TO gen delta = age_3/age_1 keep if inrange(delta, 0.75, 1.25)

These changes will give you a 1:1 match (without replacement) instead of the 3:1 in the other thread, and will restrict size matches based on a ratio between 0.75 and 1.25, rather than an age difference of < 0.5 years. I think that will do the trick.

If you have difficulty implementing this in your data set, when you post back please include example data that exhibits whatever difficulties you encounter. You need not, and should not, send your entire data set. Just a reasonable example, that includes some possible matches, and that goes wrong in the same way as the problem you have with your full data set. Also show the exact code you are trying, and show any output with error messages, or, if there are no error messages, explain why the results aren't what you want.

And bear in mind that you must, must, must use the -dataex- command to show the example, so that I can then use your data to troubleshoot what you are doing. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Alex Conju

Join Date: Aug 2024

Posts: 17
#3

09 Aug 2024, 11:17

Hi Clyde,

Thank you for the swift response, it is hugely appreciated. Apologies for the delay in getting back to you, I realised my Canadian data was not right and so I had to pull Indian data to test this out. I looked at the code in the thread you linked and tried to edit it accordingly, incorporating the changes your suggested on this thread and produced the below code:

// CREATE DEMONSTRATION DATA SET
// Total observations in dataset
set obs 11913
//Observation identifier than increases by one for each subsequent observation
gen `c(obs_t)' id = _n
// Assigns values based on stock market identifier, i.e., first 6161 observations are 11 as that is NYSE, the remaining 120 for Bombay SE
gen treatment = cond(_n <= 6161, 11, 120)
set seed 1234
//Assigns stock exchange of 11 or 120 randomly (as these are the two SECodes in the dataset)
gen stockexchange = 11 + 109*runiformint(0, 1)

// SEPARATE INTO TWO DATA SETS
preserve
keep if stockexchange == 11
rename (GlobalCompanyKey AssetsTotal) =_1
drop treatment
tempfile treatment1
save `SE11'.dta

restore
keep if stockexchange == 120
rename (GlobalCompanyKey AssetsTotal) =_1
drop treatment

// COMBINE POTENTIAL MATCHES & SELECT BEST 3, BREAKING TIES AT RANDOM
joinby stockexchange using `SE11.dta'
gen double shuffle = runiform()
gen delta = AssetsTotal_120 / AssetsTotal_11
keep if inrange(delta, 0.75, 1.25)
by id_3 (delta shuffle), sort: keep if _n <= 1

Currently, the two error messages i get are "nothing to restore r(622)" and "AssetsTotal_120 not found r(111)". I know I have not defined that variable, but (unless I am misinterpreting the code in this and the previous thread), "age_3" has not been defined either, unless of course it was here: rename (id age) =_3. If so, then I am unsure how to tweak this part of the code such that I can define the total assets for all firms on the 120 stock exchange.

When trying -dataex- I initially saw an "input statement exceeds linesize limit. Try specifying fewer variables" error message, and thus cut down to the four key variables, GlobalCompanyKey, Assets Total, StockExchangeCode, and NorthAmericanIndustryClassifi given that the goal is to match e.g., firm ABC123 on stock exchange 11 with a similar firm (i.e., one with the same NorthAmericanIndustryClassifi and similar AssetsTotal (as per the parameters set out in the methodology screenshot)) that is listed on the other stock exchange (120). This then produced the below dataex output, I hope this is what you would expect to see, if not please do let me know and I can try get STATA to produce a shorter output.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str6 GlobalCompanyKey double AssetsTotal int StockExchangeCode byte NorthAmericanIndustryClassifi "315823" 36.612 120 11 "359170" 37.382 120 11 "315823" 39.029 120 11 "336909" 44.204 120 11 "315823" 52.597 120 11 "359170" 61.391 120 11 "312273" 128.965 120 11 "337840" 135.602 120 11 "312273" 140.109 120 11 "312273" 192.641 120 11 "337840" 207.307 120 11 "337840" 208.339 120 11 "039783" 231.785 11 11 "359170" 249.555 120 11 "337840" 252.113 120 11 "315145" 256.796 120 11 "359170" 257.762 120 11 "324024" 257.829 120 11 "315145" 262.422 120 11 "324024" 263.752 120 11 "312273" 266.85 120 11 "039783" 274.764 11 11 "312273" 291.175 120 11 "324024" 305.591 120 11 "327179" 346.968 120 11 "277487" 373.095 11 11 "037856" 381.752 11 11 "324024" 402.285 120 11 "327179" 437.073 120 11 "321289" 481.888 120 11 "039783" 500.92 11 11 "039783" 534.897 11 11 "010390" 536.349 11 11 "321289" 539.008 120 11 "010390" 546.036 11 11 "327179" 556.164 120 11 "010390" 566.791 11 11 "010390" 577.523 11 11 "321298" 592.28 120 11 "321298" 602.892 120 11 "316385" 616.847 120 11 "277487" 639.466 11 11 "316385" 662.58 120 11 "321289" 673.725 120 11 "321298" 675.601 120 11 "277487" 683.457 11 11 "327179" 694.653 120 11 "324024" 705.641 120 11 "321298" 725.084 120 11 "277487" 727.562 11 11 "343733" 736.218 120 11 "343733" 746.273 120 11 "321289" 780.618 120 11 "343733" 795.228 120 11 "327179" 844.563 120 11 "321298" 919.237 120 11 "343733" 935.544 120 11 "353338" 1325.106 120 11 "321289" 1353.371 120 11 "336609" 1621.476 120 11 "353338" 1728.34 120 11 "353338" 1747.541 120 11 "336609" 1907.871 120 11 "316260" 2583.817 120 11 "316260" 3051.458 120 11 "316260" 3083.969 120 11 "030443" 3184.1 11 11 "316260" 3309.858 120 11 "030443" 3343.3 11 11 "030443" 3398.1 11 11 "316260" 3448.546 120 11 "030443" 3458.9 11 11 "035168" 42344 11 11 "035168" 42618 11 11 "035168" 42649 11 11 "035168" 42996 11 11 "311952" 25.148 120 21 "329883" 61.187 120 21 "329883" 62.747 120 21 "329883" 63.515 120 21 "329883" 64.713 120 21 "343346" 70.086 120 21 "302888" 73.017 120 21 "302888" 75.666 120 21 "329883" 79.513 120 21 "316518" 82.188 120 21 "290979" 87.618 120 21 "027199" 141.232 11 21 "302888" 142.238 120 21 "019433" 152.502 11 21 "316518" 166.253 120 21 "019433" 166.508 11 21 "107245" 210.592 11 21 "316518" 218.198 120 21 "314717" 228.174 120 21 "031845" 240.6 11 21 "176899" 260.832 11 21 "034034" 261.841 11 21 "027199" 263.09 11 21 "302927" 268.924 120 21 end

Apologies for how long this post has turned out to be; many thanks for your help so far,
Alex
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#4

09 Aug 2024, 13:19

I'd like to help you fix up your code, but I don't quite understand some of what you're doing in it.

In #1 you wanted to match Canadian to American firms. But none of the variables in your code or example data seems to match this distinction. You have created two variables, StockExchangeCode and NorthAmericanIndustryClassifi, both of which taken the values 11 or 120, and these it seems correspond to the Bombay and NY stock exchanges. But why does the North American Industry classification also take on these two values. I would think it is a different kind of code altogether, taking on more than just two values. And I think there is some confusion in your code between which of these variables plays the role of a grouping variable (Canadian vs American) and which is a matching variable (industry code).

Also, your -dataex- example is good for giving me a feel of what some real data might look like, but when you subsequently -set obs 11913-, the newly created additional observations have only missing values in them, so in the end this is not directly usable.

What I show below is my best guess at what you actually intend to do. You will still need to carefully replace the variable names I am using with the variable names of the corresponding variables in your real data set. I think my variable names are explicit enough and close enough to the real data that you will be able to do that easily. I don't know what the industry codes your real data set would have look like; I have just used integers from 1 through 25 for demonstration places.

Code:

clear // Total observations in dataset set obs 11913 //Observation identifier than increases by one for each subsequent observation gen `c(obs_t)' id = _n // Assigns values based on stock market identifier, i.e., first 6161 observations are 11 as that is NYSE, the remaining 120 for Bombay SE gen stock_exchange = cond(_n <= 6161, 11, 120) set seed 1234 gen assets = rgamma(2, 750) gen industry_code = runiformint(1, 25) // SEPARATE INTO TWO DATA SETS preserve keep if stock_exchange == 11 rename (id assets) =_11 // SUFFIX IN RENAME MUST MATCH STOCK EXCHANGE drop stock_exchange tempfile SE11 // DECLARATION IN TEMPFILE MUST MATCH SUBSEQUENT USE OF THE FILE save `SE11' // N.B. NO .dta restore keep if stock_exchange == 120 rename (id assets) =_120 drop stock_exchange // COMBINE POTENTIAL MATCHES & SELECT BEST 3, BREAKING TIES AT RANDOM joinby industry_code using `SE11.dta' gen double shuffle = runiform() gen delta = assets_120/assets_11 keep if inrange(delta, 0.75, 1.25) replace delta = abs(log(delta)) // MATCHING WITH REPLACEMENT by id_11 (delta shuffle), sort: keep if _n <= 1 // MATCHING WITHOUT REPLACEMENT local allocation_ratio 1 local current 1 sort id_11 (delta shuffle) while `current' < _N { local end_current = `current' + `allocation_ratio' - 1 while id_11[`end_current'] != id_11[`current'] { local end_current = `end_current' - 1 } // KEEP REQUIRED # OF MATCHES FOR THE CURRENT CASE drop if id_11 == id_11[`current'] in `=`end_current'+1'/L // REMOVE THE SELECTED MATCHES FROM FURTHER CONSIDERATION forvalues i = 0/`=`allocation_ratio'-1' { drop if id_120 == id_120[`current'+`i'] & _n > `end_current' } local current = `end_current' + 1 }

Note that I have nothing in this code that is analogous to your line -by id_3 (delta shuffle), sort: keep if _n <= 1-. That command would produce matching with replacement, contradicting your expressed goal of matching without replacement. I have replaced it by the longer code segment needed to match without replacement.
Comment
Alex Conju

Join Date: Aug 2024

Posts: 17
#5

10 Aug 2024, 09:45

Thank you again for the quick turnaround.

Having run your code, the first error I get is invalid file specification when trying to run "save `SE11'". If I try to bypass this by adding “.abc” to the end of the command, the file is saved. However, the next command ‘restore’ returns nothing to restore r(622). By skipping this command, this means that the variable stock_exchange is not restored and thus “keep if stock_exchange == 120” inevitably leads to an error message of stock_exchange not found r(111). In turn, this means that I cannot form the assets_120 variable which prevents the generation of the delta.

However, if I initially skip the “restore” down to “drop stock_exchange” lines of code inclusive, and jump to the “joinby” command, it pulls in data from the SE11.abc temp file. From this, the restore command again produces an error message but the next two lines work, though over 925K observations are deleted. Next, I am able to generate the delta and log it and then match with replacement (though I am unsure how to view the matches).

The matching without replacement generates error of <_N invalid name r(198). I hope this summary is clear; please see below my current do file.

clear
// Total observations in dataset
set obs 11837

//Observation identifier than increases by one for each subsequent observation
gen `c(obs_t)' id = _n

// Assigns values based on stock market identifier, i.e., first 6161 observations are 11 as that is NYSE, the remaining 120 for Bombay SE
gen stock_exchange = cond(_n <= 6140, 11, 120)
set seed 1234
gen assets = rgamma(2,750)
gen industry_code = runiformint(21, 99)

// SEPARATE INTO TWO DATA SETS
preserve
keep if stock_exchange == 11
rename (id assets) =_11 // SUFFIX IN RENAME MUST MATCH STOCK EXCHANGE
drop stock_exchange
tempfile SE11 // DECLARATION IN TEMPFILE MUST MATCH SUBSEQUENT USE OF THE FILE
save `SE11'.abc // N.B. NO .dta

// COMBINE POTENTIAL MATCHES & SELECT BEST 3, BREAKING TIES AT RANDOM
joinby industry_code using `SE11'.abc
restore // produces error message r(622)
keep if stock_exchange == 120
rename (id assets) =_120
drop stock_exchange

gen double shuffle = runiform()
gen delta = assets_120/assets_11
keep if inrange(delta, 0.75, 1.25)
replace delta = abs(log(delta))

// MATCHING WITH REPLACEMENT
by id_11 (delta shuffle), sort: keep if _n <= 1

// MATCHING WITHOUT REPLACEMENT
local allocation_ratio 1
local current 1

sort id_11 (delta shuffle)
while `current' < _n {
local end_current = `current' + `allocation_ratio' - 1
while id_11[`end_current'] != id_11[`current'] {
local end_current = `end_current' - 1
}
// KEEP REQUIRED # OF MATCHES FOR THE CURRENT CASE
drop if id_11 == id_11[`current'] in `=`end_current'+1'/L
// REMOVE THE SELECTED MATCHES FROM FURTHER CONSIDERATION
forvalues i = 0/`=`allocation_ratio'-1' {
drop if id_120 == id_120[`current'+`i'] & _n > `end_current'
}
local current = `end_current' + 1
}

Secondly, apologies for the confusion, this data is US and India (as my Canada data was wrong) so the goal is to match US firms to Indian firms whereby the firms are in the same industry (i.e., same NAICS code) and similar total assets.

Also yes, you are right the North American Industry Classification Codes (NACIS) should be different to the stock exchange codes, this is now reflected in the below -dataex- and they range from 21 to 99, however it is not continuous, for instance there is no code 25, 34, etc.

I have also included the date and year variable to showcase that the dataset is for many firms across multiple years – it was an oversight from me to not include this initially as I imagine it is useful information to know.

Note, this -dataex- was the first command entered after importing the data, so there should be no new additionally created observations, however it now only includes the stock exchange code 11 for the NYSE, despite the fact that the 120 code is still in the dataset, meaning the dataex is only including US firms and I am unsure why?

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str6 GlobalCompanyKey int(DataDate DataYearFiscal) double AssetsTotal int StockExchangeCode byte NAICS "001004" 22066 2019 2079 11 42 "001004" 22431 2020 1539.7 11 42 "001004" 22796 2021 1573.9 11 42 "001004" 23161 2022 1833.1 11 42 "001004" 23527 2023 2770 11 42 "001075" 22280 2020 20020.421 11 22 "001075" 22645 2021 22003.222 11 22 "001075" 23010 2022 22723.405 11 22 "001075" 23375 2023 24661.153 11 22 "001078" 22280 2020 72548 11 33 "001078" 22645 2021 75196 11 33 "001078" 23010 2022 74438 11 33 "001078" 23375 2023 73214 11 33 "001186" 22280 2020 9614.755 11 21 "001186" 22645 2021 10186.776 11 21 "001186" 23010 2022 23494.808 11 21 "001186" 23375 2023 28684.949 11 21 "001209" 22188 2020 25168.5 11 32 "001209" 22553 2021 26859.2 11 32 "001209" 22918 2022 27192.6 11 32 "001209" 23283 2023 32002.5 11 32 "001230" 22280 2020 14046 11 48 "001230" 22645 2021 13951 11 48 "001230" 23010 2022 14186 11 48 "001230" 23375 2023 14613 11 48 "001254" 22280 2020 2900.6 11 48 "001254" 22645 2021 3693.1 11 48 "001254" 23010 2022 4330 11 48 "001254" 23375 2023 4294.6 11 48 "001257" 22280 2020 1404.138 11 53 "001257" 22645 2021 1391.965 11 53 "001257" 23010 2022 1397.776 11 53 "001257" 23375 2023 1403.68 11 53 "001380" 22280 2020 18821 11 21 "001380" 22645 2021 20515 11 21 "001380" 23010 2022 21695 11 21 "001380" 23375 2023 24007 11 21 "001393" 22005 2019 13438.024 11 53 "001393" 22370 2020 14651.606 11 53 "001393" 22735 2021 17299.581 11 53 "001393" 23100 2022 18124.648 11 53 "001393" 23466 2023 19058.758 11 53 "001410" 22219 2020 3776.9 11 56 "001410" 22584 2021 4436.2 11 56 "001410" 22949 2022 4868.9 11 56 "001410" 23314 2023 4933.7 11 56 "001545" 22280 2020 865.764 11 53 "001545" 22645 2021 770.569 11 53 "001545" 23010 2022 1197.479 11 53 "001545" 23375 2023 1023.484 11 53 "001585" 22280 2020 680.293 11 32 "001585" 22645 2021 694.16 11 32 "001585" 23010 2022 726.313 11 32 "001585" 23375 2023 767.548 11 32 "001598" 22280 2020 10357.483 11 33 "001598" 22645 2021 11898.187 11 33 "001598" 23010 2022 12431.12 11 33 "001598" 23375 2023 15023.533 11 33 "001613" 22280 2020 463.208 11 33 "001613" 22645 2021 485.632 11 33 "001613" 23010 2022 502.774 11 33 "001613" 23375 2023 565.654 11 33 "001618" 22400 2020 97.366 11 23 "001618" 22765 2021 94.917 11 23 "001618" 23130 2022 115.895 11 23 "001618" 23496 2023 122.83 11 23 "001661" 22280 2020 5503.428 11 21 "001661" 22645 2021 5525.364 11 21 "001661" 23010 2022 4729.854 11 21 "001661" 23375 2023 5277.965 11 21 "001706" 22158 2020 824.294 11 33 "001706" 22523 2021 820.247 11 33 "001706" 22888 2022 757.312 11 33 "001706" 23253 2023 762.597 11 33 "001712" 22280 2020 316.833 11 32 "001712" 22645 2021 293.54 11 32 "001722" 22280 2020 49719 11 31 "001722" 22645 2021 56136 11 31 "001722" 23010 2022 59774 11 31 "001722" 23375 2023 54631 11 31 "001773" 22280 2020 17053.911 11 42 "001773" 22645 2021 19535.54 11 42 "001773" 23010 2022 21763.182 11 42 "001773" 23375 2023 21726.168 11 42 "001794" 22188 2020 6877 11 32 "001794" 22553 2021 6612 11 32 "001794" 22918 2022 6213 11 32 "001794" 23283 2023 5939 11 32 "001864" 21945 2019 500.502 11 32 "001864" 22311 2020 479.345 11 32 "001864" 22676 2021 550.361 11 32 "001864" 23041 2022 579.579 11 32 "001864" 23406 2023 664.802 11 32 "001913" 22280 2020 6083.9 11 32 "001913" 22645 2021 7971.6 11 32 "001913" 23010 2022 7950.5 11 32 "001913" 23375 2023 8209.8 11 32 "001926" 21974 2019 1073.831 11 33 "001926" 22339 2020 996.442 11 33 "001926" 22704 2021 1133.028 11 33 end format %tdnn/dd/CCYY DataDate

Again, apologies for the length of this reply, my goal is to be as clear as possible with regards to my goal and the errors STATA generates.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#6

10 Aug 2024, 10:54

I ran the code in #4 before I posted it, and it did not produce any of the errors you describe. And I have repeated that check again, and it again runs without error messages.

Looking at the errors you are getting, they are in every instance attributable to the code being run incorrectly. This code relies heavily on the use of local macros. I believe you are trying to run it line by line, or in small chunks. You cannot do that with code that has local macros. It must be run beginning to end, without interruption, in one fell swoop. The same is true of code with -preserve- and -restore- commands. If you interrupt code between a -preserve- and a -restore-, the preserved data set vanishes, and when the -restore- command is then attempted, there is nothing there to restore.

You should definitely undo the changes to it that you made in what you show in #5. In particular adding .abc to `SE11' results in your saving the data in a permanent, not a temporary file, and that file will have some funny name and will clutter up your hard drive until you remove it. Using `SE11' as originally coded will result in a temporary file that will conveniently disappear once the code has finished running. Even more important, changing _N to _n completely kills the algorithm and will produce incorrect results.

So go back to the code in #4, and with the understanding that it must be run uninterruptedly beginning to end, you will find it works. Then all you have to do is perhaps adapt the variable names to your actual data set and run it with that instead of the demonstration data by removing everything before // SEPARATE INTO TWO DATA SETS and replacing that by a command to -use- your actual data.
Comment
Alex Conju

Join Date: Aug 2024

Posts: 17
#7

10 Aug 2024, 12:44

Yes! You were right, the issue was that I was running the code in chunks, rather than together in one go (feeling slightly foolish!). Now that both matching with and without replacement works, what is the most effective way to show the matches? e.g., a table which has the US firms ID and the Indian firms ID as a pair, showing what industry they are a part of and what their asset value is? I tried the list command and the output seems to be too long for STATA as I cannot see the top of it, nor can I quite fathom what data the output is showing.

Thanks!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#8

10 Aug 2024, 15:06

Now that both matching with and without replacement works, what is the most effective way to show the matches?

That depends on who you are showing them to and what they need/want to know about them.

At a bare minimum, a table of the matches would contain the ID's of the US and Indian firms in each pair, their common industry code (or the corresponding industry name), and the two asset sizes so that people can see that the match fulfills its stated matching criteria. What other variables to show depends on your audience and what will be important in your subsequent analyses using the matched pairs.

The -list- command is fine when the variables are few or short so that each line fits on a screen without breaking. If you need to show a bit more than that, you might use -putdocx- to export the needed variables to a table in Microsoft Word--there you can use landscape orientation, and, if necessary and desirable, reduce the type size. If the number of crucial variables or the size is such that each observation's data cannot readably fit across a page in Word, probably you need to export to a spreadsheet. (See -help export excel-.)
Comment
Alex Conju

Join Date: Aug 2024

Posts: 17
#9

10 Aug 2024, 18:00

The goal of the whole matching process is to get like for like firms listed on different stock exchanges. Once I can export these pairs the aim is to find their underlying fundamentals and ratios (from whichever financial database) and then run a regression to prove that, per the fundamentals, the e.g., Indian firms have systematically lower valuations than the US firms. Therefore, the data/table I need to export would need to show the pair of firms' IDs, the industry they are in and their relative assets as this will then allow me to identify them and then search for their e.g., B/M ratio, P/E ratio etc. In trying to export to excel the code I used was:

// EXPORT TO EXCEL
export excel using "final_matched_data.xlsx", firstrow(variables) replace

Although this did save a file, it does not seem to have actually captured the matches made by the previous code, and is seemingly just a file containing roughly 85% of the original Indian firms (no US firms in the file). Per your advice, I then copied (or tried to) the format in -help export excel- and tried the below code, however I get an error than the cell command is unrecognised.

sh("MATCHEDFIRMS"[,modify|replace])
cell(start)
first(var|varl)
nol
keepcellfmt
replace

Apologies if I am missing something rudimentary.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#10

10 Aug 2024, 19:37

When writing #8, I had misunderstood you to be looking for a way to get a table of the matched pairs intended for human eyes to read. But what you actually need, based on what you say in #9, is a Stata data set containing the key variables of the matched firms. You will then need to -merge- that with other Stata sets created from the financial databases that will provide your fundamentals variables. (You probably will need to create those Stata data sets using -import delmited- or -import excel-, but that's for another day.) To make this work as well as possible, the Stata data set of matched pairs needs to be reorganized into long layout.

Code:

keep id_* assets_* industry_code gen `c(obs_t)' pair_num = _n reshape long id assets, i(pair_num) j(stock_exchange) destring stock_exchange, ignore("_") replace

Save the results of that as a Stata data set with some name that is convenient and suggestive.

Now, I am for the first time noticing that in #5 you said

I have also included the date and year variable to showcase that the dataset is for many firms across multiple years – it was an oversight from me to not include this initially as I imagine it is useful information to know.

It was an oversight on my part not to pay attention to that earlier. The matching algorithm we have been using has not taken this into account. The question is what kind of matching you want to do with this kind of data. The algorithm we are using now will end up matching firm A in year Y1 with firm B in year Y2 and in another observation it will be A in year Y3 with firm C in year Y4. In other words, the same firm get matched with different firms in different years, and there is no correspondence between the years of the two firms in any matched pair. That's sometimes acceptable, but often is not.

There are three common alternative ways to do things. The first approach is to do the matching using a reduced data set that contains only the first year each firm appears in the data set. You then form matched pairs from these first-year observations. Then you merge that back to the original data set so that you consistently pair each firm with the same partner in all years. Of course, if one of the firms' data extends over fewer years than the other, then you will not have a pair for some years.

The second approach has the opposite requirements: it will require each pair to consist of firms observed in the same year, but the a given firm may be paired with different firms in different years.

The third approach is the most stringent: it will require each firm to be consistently matched to the same partner in each year, and it will require that the matched observation for any year also be the same for the other firm.

Summarizing, there are four levels of stringency that can be imposed on the match:
1. Anything goes. That's what we have now.
2. Any given firm must match the same partner in every year, although the matched firm's data need not come from the same year.
3. All matched pairs must couple firms from the same year, but across the years a firm may match with different partner firms.
4. All matched pairs must couple firms from the same year and any given firm must match the same partner in every year.

Bear in mind that the choice of one of the more stringent approaches may reduce the size of the matched pair data sets, as matches meeting the additional requirements may not be available (especially when matching without replacement). That said, you need to think substantively about the meaning of the other variables you are analyzing. If they are time sensitive, then probably a good matching process would need both firms in each pair observed in the same year. If you think that the other variables in your analysis capture all of the relevant information about a firm and there are no unmeasured attributes of the firms that substantively matter for your purposes, then you can forego firm-to-firm consistency across years. But if there are idiosyncratic attributes of the firms that affect the relationships among your other variables, then it is better to require firm-to-firm consistency. These are substantive, not statistical, concerns and not having any expertise in finance or economics, I cannot advise you about this choice. If you decide to go with 2, 3, or 4, post back to let me know and I will modify the matching code accordingly.
Comment
Alex Conju

Join Date: Aug 2024

Posts: 17
#11

11 Aug 2024, 03:40

Apologies if my previous posts had not been clear - I should have specified the year and date variable in the initial dataex. The methodology used in the paper I am aiming to replicate states "we match each Canadian firm with a US firm based on industry and firm size for a given year using one-to-one matching (without replacement)" (I believe this is mentioned in the screenshot attached in the OP), where the industry match would mean the same NAICS code and the firms size is given by total assets with the delta 0.75 to 1.25 as in your previous code. As such, option 3 appears to be the closest match as the paper specifies "for a given year", implying that a firm can be part of a different pair year on year, assuming the next year has a closer match. I hope this has cleared up any confusion with what I am trying to do!
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30065

#12

11 Aug 2024, 11:28

Thanks. I saw that quote "we match each Canadian firm with a US firm based on industry and firm size for a given year using one-to-one matching (without replacement)", but I didn't know how to interpret "for a given year." So, now that you have cleared that up, here is modified code.

Code:

// SEPARATE INTO TWO DATA SETS
preserve
keep if stock_exchange == 11
rename (id assets) =_11 // SUFFIX IN RENAME MUST MATCH STOCK EXCHANGE
drop stock_exchange
tempfile SE11 // DECLARATION IN TEMPFILE MUST MATCH SUBSEQUENT USE OF THE FILE
save `SE11' // N.B.  NO .dta

restore
keep if stock_exchange == 120
rename (id assets) =_120
drop stock_exchange

// COMBINE POTENTIAL MATCHES & SELECT BEST 3, BREAKING TIES AT RANDOM
joinby industry_code year using `SE11.dta'
gen double shuffle = runiform()
gen delta = assets_120/assets_11
keep if inrange(delta, 0.75, 1.25)
replace delta = abs(log(delta))

//    MATCHING WITHOUT REPLACEMENT
local allocation_ratio 1
local current 1

sort id_11 year (delta shuffle)
while `current' < _N {
    local end_current = `current' + `allocation_ratio' - 1
    while id_11[`end_current'] != id_11[`current'] ///
        & year[`end_current'] != year[`current'] {
        local end_current = `end_current' - 1
    }
    //    KEEP REQUIRED # OF MATCHES FOR THE CURRENT CASE
    drop if id_11 == id_11[`current'] & year == year[`current'] in `=`end_current'+1'/L    
    //    REMOVE THE SELECTED MATCHES FROM FURTHER CONSIDERATION
    forvalues i = 0/`=`allocation_ratio'-1' {
        drop if id_120 == id_120[`current'+`i'] & year == year[`current' + `i'] & _n > `end_current'
    }
    local current = `end_current' + 1
}

Changes to previous code are in bold face.

Comment

George Ford

Join Date: Aug 2014

Posts: 3136
#13

11 Aug 2024, 16:24

You could use -cem- and do a k2k match.
Comment
Alex Conju

Join Date: Aug 2024

Posts: 17
#14

11 Aug 2024, 16:28

Amazing, thank you code works perfectly! I appreciate the changes being bolded as it makes it easier to understand the differences and learn. This code was for to find matches when there are only two stock exchanges, if I wanted to do a similar exercise only for multiple (say ten stock exchanges) such that each "match" was a group of e.g., ten similar firms each year on different stock exchanges but in the same industry and similar assets, is the change to the code simply separating into ten data sets rather than two, i.e., build on the first section of the code as below. Or does the final section (i.e., line 32 onward) also need altering by just adding the additional stock exchange codes to the end of what is currently there?

// SEPARATE INTO MANY DATA SETS
preserve
keep if StockExchangeCode == 11
rename (GlobalCompanyKey AssetsTotal) =_11 // SUFFIX IN RENAME MUST MATCH STOCK EXCHANGE
drop StockExchangeCode
tempfile SE11 // DECLARATION IN TEMPFILE MUST MATCH SUBSEQUENT USE OF THE FILE
save `SE11' // N.B. NO .dta

restore
keep if StockExchangeCode == 120
rename (GlobalCompanyKey AssetsTotal) =_120
drop StockExchangeCode

restore
keep if StockExchangeCode == 130 // For example
rename (GlobalCompanyKey AssetsTotal) =_120
drop StockExchangeCode

restore
keep if StockExchangeCode == 19 // For example
rename (GlobalCompanyKey AssetsTotal) =_120
drop StockExchangeCode

//Etc. for all stock exchange codes

// Assume middle section with delta generation remains the same?

sort GlobalCompanyKey_11 DataYearFiscal (delta shuffle)
while `current' < _N {
local end_current = `current' + `allocation_ratio' - 1
while GlobalCompanyKey_11[`end_current'] != GlobalCompanyKey_11[`current'] ///
& DataYearFiscal[`end_current'] != DataYearFiscal[`current'] {
local end_current = `end_current' - 1
}
// KEEP REQUIRED # OF MATCHES FOR THE CURRENT CASE
drop if GlobalCompanyKey_11 == GlobalCompanyKey_11[`current'] & DataYearFiscal == DataYearFiscal[`current'] in `=`end_current'+1'/L
// REMOVE THE SELECTED MATCHES FROM FURTHER CONSIDERATION
forvalues i = 0/`=`allocation_ratio'-1' {
drop if GlobalCompanyKey_120 == GlobalCompanyKey_120[`current'+`i'] & DataYearFiscal == DataYearFiscal[`current' + `i'] & _n > `end_current'
}
local current = `end_current' + 1
drop if GlobalCompanyKey_130 == GlobalCompanyKey_120[`current'+`i'] & DataYearFiscal == DataYearFiscal[`current' + `i'] & _n > `end_current'
}
local current = `end_current' + 1
drop if GlobalCompanyKey_19 == GlobalCompanyKey_120[`current'+`i'] & DataYearFiscal == DataYearFiscal[`current' + `i'] & _n > `end_current'
}
local current = `end_current' + 1
}

Last edited by Alex Conju; 11 Aug 2024, 16:30.
Comment
Alex Conju

Join Date: Aug 2024

Posts: 17
#15

11 Aug 2024, 16:36

George Ford thank you for the suggestion. My best bet on how to use -cem- is below, however I cannot quite figure out how I would incorporate the delta side of things.

cem NAICS AssetsTotal DataYearFiscal, treatment(treatment_var) keep(matched_strata)
reg outcome_var treatment_var [pw=cem_weights]
Comment

Announcement

Matching Firms Based on Industry and Total Assets

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment