Matching sample - Statalist

Federica Tortorella

Join Date: Nov 2017

Posts: 4
#1

Matching sample

30 Nov 2017, 05:25

Hello,

I am currently writing my master thesis and I am struggling with Stata.
The aim of my thesis is to understand if the stakeholder orientation has an influence in the phase of the selection of target companies for an acquisition (considering other factors such as Sic codes, number of employees and revenues being equal among a sample of companies).
I have a sample of 2573 companies for which I downloaded 224 variables that tell me if the companies are or not stakeholder oriented. But before considering the impact of the stakeholder orientation I need to find companies that share the same characteristics for the factors I named above, in order to later understand if given those factors being equals, the reason why one company has been chosen instead of another is the stakeholder orientation (condition that I want to verify in my thesis).
Among these 2573 companies I selected, 269 of them have been target (each of this 269 in a specific year). For this 2573 companies (in which the 269 targets are included) I have collected data for 15 years (from 2002 to 2016) about the number of employees each company has, the Sic codes and the revenues. I identify each company through the ISIN number. What I need to do is to match those 269 target companies with other companies of my sample according to the Sic codes, the number of employees and the revenues. I want to find groups of companies from the sample of 2573 companies that are similar to each of the 269 target companies according to those three criteria and exactly in the year in which the announcement of the acquisition took place (I have a column in which I have the announcement years).
How can I do? Which code can I use?
Also, with the SIc code I want an exact match, but with the number of employees and the revenues I want that they match in a range of +/- 10%. The problem is also that the revenues and the employees are extremely variable for the companies in the sample, so the standard deviation is very high.

Thank you in advance,

Federica.
Tags: matching, matching sample, propensity score, stata
George Yiallouros

Join Date: Nov 2017

Posts: 10
#2

30 Nov 2017, 06:26

I guess you have to do that with if statements or with clustering, yet i don't know how your data is organised
Comment
Patrick Abi Nader

Join Date: Jun 2014

Posts: 174
#3

30 Nov 2017, 07:19

Hello Fedrica,

Your best bet to receive any useful advice would be to post a sample of your data using dataex from ssc see FAQ 12.2
You can also show us an example of what you would want your data to look like.

Best wishes,
Patrick
Comment
Federica Tortorella

Join Date: Nov 2017

Posts: 4
#4

30 Nov 2017, 08:19

Hello Patrick,

with dataex you would not understand it properly, because you would need to look at it as a whole in order to have a clear picture so I should post the excel but I read it is not really appropriate to do it so I will try to explain it in the best way I can.
My goal is to obtain for each of the 269 targets other companies of the sample of 2573 companies that have the same characteristics of the targets in terms of sic code, employees and revenues exactly in the year before the acquisition of each target took place. I need stata to give me back the exact list of matching with the name and the ISIN numbers of the companies of each group matched.
I created two databases, one with all the information of the targets and one with the information of the other companies I want the targets to be matched with (control group). This control group also contains the targets but only until the year in which they have been acquired.
So let's assume that company X has been acquired in the year 2013. In the target database I have all the information (SIC, employees, revenues, Year, ISIN, Name) about X and I want to find companies from the control group that in the year before the acquisition of X took place (so in 2012) had the same characteristics of X and I want to know the ISIN numbers of these companies matched. In the control group I have the same type of info that I have in the target group (ISIN, SIC, revenues ecc) but about companies that have not been targeted. Moreover in the control group I also have the information of the target companies that are in the target group but only until the year before the acquisition. So, considering the example I made before, and considering that for all the companies I have data from 2002 until 2016, I would have company X in both databases. In the target database I will have the information about X only for 2012 (that is the year I am interested in in this case because it is the year before the acquisition took place) and in the control database I will have the information about X from 2002 and only until 2012. Obviously the acquisition date varies for each company.

Tell me if something is not clear.
Thank you very much,

Federica.
Comment

Federica Tortorella

Join Date: Nov 2017
Posts: 4

30 Nov 2017, 10:11

This is a sample of both databases with dataex.

THIS IS THE SAMPLE FROM THE TARGET DATABASE

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str12 ISIN str45 NAME str3 target int(ann1 Announcementdata) long(NetSales1 N_EMPLOYEES1) int SIC_Primary
"US0008863096" "ADC TELECOM. DEAD - DELIST.10/12/10"           "YES" 2009 2010  1087310  9050 3661
"US0081901003" "AFFILIATED CMP.SVS.'A' DEAD - DELIST.18/02/10" "YES" 2008 2009  6160550 65000 7374
"US0012041069" "AGL RESOURCES DEAD - DELIST.01/07/16"          "YES" 2014 2015  5385000  5165 4924
"CA0089161081" "AGRIUM (NYS)"                                  "YES" 2015 2016 19057883 15200 2873
"US0093631028" "AIRGAS DEAD - DELIST.23/05/16"                 "YES" 2014 2015  5072537 16000 2813
"US0130781000" "ALBERTO CULVER DEAD - DELIST.11/05/11"         "YES" 2009 2010  1433324  2500 2844
"US0173611064" "ALLEGHENY EN. DEAD - DELIST.28/02/11"          "YES" 2009 2010  3287300  4383 4911
"US0184901025" "ALLERGAN DEAD - DELIST.17/03/15"               "YES" 2013 2014  6300400 11400 2834
"US01903Q1085" "ALLIED CAP.NEW DEAD - DELIST.15/04/10"         "YES" 2008 2009   502244   130 6726
"US0195893088" "ALLIED WASTE INDS. DEAD - DELIST.18/12/08"     "YES" 2007 2008  6068700 23300 4953
end

THIS THE SAMPLE OF THE CONTROL DATABASE

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str12 ISIN str10 NAME int(year SIC_Primary) long NET_SALES_OR_REVENUES int N_EMPLOYEES
"US3369011032" "1ST SOURCE" 2002 6022 281440 1233
"US3369011032" "1ST SOURCE" 2003 6022 245508 1220
"US3369011032" "1ST SOURCE" 2004 6022 225190 1200
"US3369011032" "1ST SOURCE" 2005 6022 237675 1200
"US3369011032" "1ST SOURCE" 2006 6022 285579 1200
"US3369011032" "1ST SOURCE" 2007 6022 328316 1350
"US3369011032" "1ST SOURCE" 2008 6022 318639 1280
"US3369011032" "1ST SOURCE" 2009 6022 285942 1170
"US3369011032" "1ST SOURCE" 2010 6022 286869 1160
"US3369011032" "1ST SOURCE" 2011 6022 267807 1160
"US3369011032" "1ST SOURCE" 2012 6022 262340 1180
"US3369011032" "1ST SOURCE" 2013 6022 256797 1100
"US3369011032" "1ST SOURCE" 2014 6022 256441 1100
"US3369011032" "1ST SOURCE" 2015 6022 268000 1150
"US3369011032" "1ST SOURCE" 2016 6022 280705 1150
"US90214J1016" "2U"         2002 7371      .    .
"US90214J1016" "2U"         2003 7371      .    .
"US90214J1016" "2U"         2004 7371      .    .
"US90214J1016" "2U"         2005 7371      .    .
"US90214J1016" "2U"         2006 7371      .    .
"US90214J1016" "2U"         2007 7371      .    .
"US90214J1016" "2U"         2008 7371      .    .
"US90214J1016" "2U"         2009 7371      .    .
"US90214J1016" "2U"         2010 7371      .    .
"US90214J1016" "2U"         2011 7371  29733  301
"US90214J1016" "2U"         2012 7371  55879  426
"US90214J1016" "2U"         2013 7371  83127  594
"US90214J1016" "2U"         2014 7371 110239  784
"US90214J1016" "2U"         2015 7371 150194 1004
"US90214J1016" "2U"         2016 7371 205864 1209
end

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30118

30 Nov 2017, 11:11

Thank you for the -dataex- posts. Unfortunately, the example data you show does not contain any potential matches among the controls for any of the cases. Try it by hand--you will see that there are no controls shown that could meet your matching criteria with any of the cases. In fact, even if we disregard the requirements for near-matching on NetSales and N_EMPLOYEES, and just look at matching on year and SIC_Primary, there are no possible matches shown. So, the following code is not fully tested and may contain errors. But it may point you in the right direction:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str12 ISIN str45 NAME str3 target int(ann1 Announcementdata) long(NetSales1 N_EMPLOYEES1) int SIC_Primary
"US0008863096" "ADC TELECOM. DEAD - DELIST.10/12/10"           "YES" 2009 2010  1087310  9050 3661
"US0081901003" "AFFILIATED CMP.SVS.'A' DEAD - DELIST.18/02/10" "YES" 2008 2009  6160550 65000 7374
"US0012041069" "AGL RESOURCES DEAD - DELIST.01/07/16"          "YES" 2014 2015  5385000  5165 4924
"CA0089161081" "AGRIUM (NYS)"                                  "YES" 2015 2016 19057883 15200 2873
"US0093631028" "AIRGAS DEAD - DELIST.23/05/16"                 "YES" 2014 2015  5072537 16000 2813
"US0130781000" "ALBERTO CULVER DEAD - DELIST.11/05/11"         "YES" 2009 2010  1433324  2500 2844
"US0173611064" "ALLEGHENY EN. DEAD - DELIST.28/02/11"          "YES" 2009 2010  3287300  4383 4911
"US0184901025" "ALLERGAN DEAD - DELIST.17/03/15"               "YES" 2013 2014  6300400 11400 2834
"US01903Q1085" "ALLIED CAP.NEW DEAD - DELIST.15/04/10"         "YES" 2008 2009   502244   130 6726
"US0195893088" "ALLIED WASTE INDS. DEAD - DELIST.18/12/08"     "YES" 2007 2008  6068700 23300 4953
end
tempfile cases
save `cases'

* Example generated by -dataex-. To install: ssc install dataex
clear
input str12 ISIN str10 NAME int(year SIC_Primary) long NET_SALES_OR_REVENUES int N_EMPLOYEES
"US3369011032" "1ST SOURCE" 2002 6022 281440 1233
"US3369011032" "1ST SOURCE" 2003 6022 245508 1220
"US3369011032" "1ST SOURCE" 2004 6022 225190 1200
"US3369011032" "1ST SOURCE" 2005 6022 237675 1200
"US3369011032" "1ST SOURCE" 2006 6022 285579 1200
"US3369011032" "1ST SOURCE" 2007 6022 328316 1350
"US3369011032" "1ST SOURCE" 2008 6022 318639 1280
"US3369011032" "1ST SOURCE" 2009 6022 285942 1170
"US3369011032" "1ST SOURCE" 2010 6022 286869 1160
"US3369011032" "1ST SOURCE" 2011 6022 267807 1160
"US3369011032" "1ST SOURCE" 2012 6022 262340 1180
"US3369011032" "1ST SOURCE" 2013 6022 256797 1100
"US3369011032" "1ST SOURCE" 2014 6022 256441 1100
"US3369011032" "1ST SOURCE" 2015 6022 268000 1150
"US3369011032" "1ST SOURCE" 2016 6022 280705 1150
"US90214J1016" "2U"         2002 7371      .    .
"US90214J1016" "2U"         2003 7371      .    .
"US90214J1016" "2U"         2004 7371      .    .
"US90214J1016" "2U"         2005 7371      .    .
"US90214J1016" "2U"         2006 7371      .    .
"US90214J1016" "2U"         2007 7371      .    .
"US90214J1016" "2U"         2008 7371      .    .
"US90214J1016" "2U"         2009 7371      .    .
"US90214J1016" "2U"         2010 7371      .    .
"US90214J1016" "2U"         2011 7371  29733  301
"US90214J1016" "2U"         2012 7371  55879  426
"US90214J1016" "2U"         2013 7371  83127  594
"US90214J1016" "2U"         2014 7371 110239  784
"US90214J1016" "2U"         2015 7371 150194 1004
"US90214J1016" "2U"         2016 7371 205864 1209
end

tempfile controls
save `controls'

//    CREATE A CONTROLS FILE WITH VARIABLE NAMES SUITABLE FOR MATCHING TO
//    THE CASES
use `controls', clear
rename year ann1 // IN CASES DATA ann1 IS THE YEAR IN WHICH THEY SHOULD MATCH
rename NET_SALES_OR_REVENUES NetSales1
rename N_EMPLOYEES N_EMPLOYEES1
drop if missing(ann1, NetSales1, N_EMPLOYEES)
tempfile potential_matches
list, noobs clean
save `potential_matches'

//    BEGIN MATCHING BY JOINING CASES WITH POTENTIAL MATCHES
//    RESTRICT MATCING TO THOSE WITHIN 10% ON NetSales1
use `cases', clear
gen ub = 1.1*NetSales1
gen lb = 0.9*NetSales1
rangejoin NetSales1 lb ub using `potential_matches', by(ann1 SIC_Primary)
//    NOW ELIMINATE PAIRINGS THAT DO NOT MATCH WITHIN 10% ON N_EMPLOYEES
keep if inrange(N_EMPLOYEES1_U, 0.9*N_EMPLOYEES1, 1.1*N_EMPLOYEES1)

//    SELECT A SINGLE MATCH AT RANDOM FROM AMONG THE POTENTIAL MATCHES
set seed 1234 // OR YOUR FAVORITE NUMBER
gen double shuffle = runiform()
by ISIN (shuffle), sort: keep if _n == 1

Notes:

1. You must install -rangejoin- from SSC, and in order for that to work you also need -rangestat-, also from SSC. These are user-written programs (Robert Picard, and Robert Picard, Nick Cox, & Roberto Ferrer, respectively) that you will probably find extremely useful for other purposes going forward as well.

2. You never said how many matched controls per case you want. The above code selects 1. By changing the final line, you can get (up to) whatever number you desire. Remember that there is no guarantee that the number of matches you would like to have actually exist.

3. The same control may end up matched to more than one case. This is not a problem from a statistical perspective.

4. Note that the code involves fiddling around with the names of the variables in the two data sets so that the corresponding variables have the same name in both. You might have wanted to do that anyway; it's usually a good practice to call the same thing by the same name across data sets in a project.

5. The number specified in the -set seed- command is arbitrary, and you can pick any integer you like; it determines the starting state of the random number generator, so you will get different random assignments with different seeds, but all of them are equally valid.

Comment

Federica Tortorella

Join Date: Nov 2017

Posts: 4
#7

30 Nov 2017, 11:32

Thank you Clyde.
The reason why there are no matches is that for dataex I took only two companies of the 2573.
I want Stata to find me for the companies of the target group and for the year corresponding to the announcement year, companies from the control sample that in that specific year had the same characteristic of the target. The same company of the control sample could be similar to more than one target companies but in different years. So in which of two databases should I rename the label as year ann1?
Also, I need an exact match for the SIC codes and a match within a range for the number of employees and revenues. Is this possible to do it in a single command?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#8

30 Nov 2017, 11:45

I want Stata to find me for the companies of the target group and for the year corresponding to the announcement year, companies from the control sample that in that specific year had the same characteristic of the target. The same company of the control sample could be similar to more than one target companies but in different years.

You seem to be contradicting yourself here, what am I missing? In earlier posts you said you want each case to be matched with controls based on the controls' data in the year of announcement. But the last sentence seems to be backing off of this. The code shown in #6 requires the match to be based on data in the announcement year.

So in which of two databases should I rename the label as year ann1?

In your example data, the case data set does not contain a variable named year: it has two variables that look, to my eyes, like years, ann1 and Announcementdata. And, at least in the example you show, ann1 = Announcementdata - 1 in every case. So I interpreted this as meaning that ann1 is the year that you want to have the control firm's information matching based on. For that reason, the code in #6 renames the variable year from the second dataset to ann1. If my interpretation is not correct, then you need to modify the code accordingly.

Also, I need an exact match for the SIC codes and a match within a range for the number of employees and revenues.

Did you read the code in #6? That's what it does (or, rather, will do if provided with suitable data and if it doesn't have errors I've missed.)

Is this possible to do it in a single command?

I don't think so.
Comment