Matching sample - Statalist

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#46

18 Oct 2017, 08:11

What would the code be if I want to match on two continuous variables?

The point of what I had said earlier is that there is no general answer to this question because the question itself is incompletely specified.

Suppose for a given M&A acquirer we have several potential matches. Say one of the matches agrees exactly with the acquirer on Probability but is very different on leverage deficit. And suppose another match agrees exactly on leverage deficit but is appreciably different on Probability. And suppose there is a third potential match that is in good, but imperfect, agreement with the M&A acquirer on both Probability and leverage deficit. You have to specify a rule for which to select in this situation (which, in the normal course of events will happen frequently).

It will only require one use of -joinby- (or, using a more modern approach, -rangejoin- or -runby-), but the details depend on how you handle the situation I describe in the preceding paragraph. You need a specific rule or set of rules for making those decisions, and those rules have to be implemented in the code.

So the bottom line is: finish spelling out your question, and then there will be an answer. At the moment the question is incomplete and unanswerable.
Comment
Michail Michail

Join Date: Mar 2018

Posts: 3
#47

23 Mar 2018, 07:31

Hello Clyde Schechter,

Thank you for all your help so far. My problem is very similar to Florian's, in the sense that I also have two deltas that I wanto to match my firms on.

Originally posted by Clyde Schechter View Post

Perhaps you want the match with the smallest value of delta1+delta2; that's a way of saying the closeness on delta1 and delta2 are equally important.

My deltas are of equal importance, so I would want to match with the smallest value of delta1+delta2. Could you help me putting that into code?

Thanks again for your time.

Michail
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#48

23 Mar 2018, 10:23

You wouldn't try to get driving directions to a place without saying where you're starting from. Similarly, you can't get code without showing example data. Please use the -dataex- command to do that. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment
Michail Michail

Join Date: Mar 2018

Posts: 3
#49

23 Mar 2018, 18:53

I'm sorry, you are absolutely right. My problem is similar to original poster's, as I would like to match 32 socially responsible (SRI for short) bond funds to conventional ones. The conventional fund sample has 864 funds. The matching has to be done according to objective code, fund age and fund size (in USD). These variables are available for both samples. Matching by objective code will be relatively straightforward, using the joinby command, as you have indicated in the first reply to the original thread. In order to match by age and size, two delta variables have to be generated (delta1 and delta2), as finding exact age and size matches will probably be impossible. Following your instructions in the original response, the code so far would look something like this:

use 32_SRI_funds, clear
rename fund_ticker SRI_fund_ticker
rename fund_size SRI_fund_size
rename fund_age SRI_fund_age

joinby objective_code using 864_conventional_funds_sample
gen delta1 = abs(fund_size - SRI_fund_size)
gen delta2 = abs(fund_age - SRI_fund_age)
drop if delta1 >= 485761571
drop if delta2 >=3

this what the 32 SRI funds look like from -dataex-

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str5 fund_ticker byte fund_age long fund_size str4 objective_code "CFICX" 36 489039540 "CBG" "CSDAX" 16 1340362813 "CBG" "DSBFX" 18 152210418 "CBG" "CBFVX" 5 882341440 "CBG" "PRFIX" 26 228713811 "CBG" "SEBFX" 3 25779042 "CBHQ" "CYBIX" 17 194073923 "CBHY" "PAXHX" 19 406923139 "CBHY" "RGHYX" 7 39828879 "CBHY" "TPHAX" 11 58230930 "CBHY" "WISEX" 8 110724397 "GI" "KCCIX" 3 76286423 "GI" "KCLIX" 3 98386190 "GI" "PTSAX" 27 1000140173 "GI" "CUBIX" 4 122222046 "IN" "CSIBX" 31 977966276 "IN" "CGBIX" 5 118601318 "IN" "CLDAX" 14 86599999 "IN" "CULAX" 12 982195310 "IN" "SEACX" 14 163090613 "IN" "GEDYX" 17 236689001 "IN" "GLDZX" 17 924745416 "IN" "GMDYX" 17 1491284459 "IN" "PLDIX" 22 245377759 "IN" "MIIAX" 19 519014467 "IN" "CFVAX" 3 151014291 "IN" "TSBIX" 6 1897417529 "IN" "EPIAX" 8 22729912 "MSB" "GGBFX" 12 544240473 "MSB" "NCICX" 19 314619117 "MSB" "TFIAX" 19 83053046 "MSB" "TCPYX" 27 301228444 "MSB" end

(the fund ticker is simply an identifier for each fund)

The part which I'm not sure how to implement is the matching using the two deltas. As you have indicated, because they are of equal importance, matching with the smallest value of delta1+delta2 would be what I need.

Thanks again for your help. My apologies for the omitted information.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#50

23 Mar 2018, 19:24

As you have indicated, because they are of equal importance, matching with the smallest value of delta1+delta2 would be what I need.

Well, looking at your data, I would say this is the wrong conclusion. Your two delta variables are on very different scales. The standard deviation of fund_age is just 8.8, whereas that of fund_size is 4.9x10⁸. So the delta_size is almost always going to dwarf the delta_age, which means that the simple sum delta_size+delta_age will just be a slight rounding error on delta_size itself and this code will be almost exclusively matching on fund size. (This effect will be somewhat mitigated by the line of code that drops observations where delta_size exceeds 485761571, but this will not defeat the thrust of my reasoning here: the imbalance will be a bit smaller, but you are still, in effect, matching only on fund size.)

To combine these variables in a way that makes delta_age and delta_size about equally important, you will have to rescale these variables so that they are more similar to each other. Dividing size by 10⁷ might do it. Or consider using log size instead of size for the purpose at hand (though you are then looking at a difference in age vs a ratio in size).

Another approach to matching on multiple continuous variables is not to combine them algebraically but to define an acceptable match by specifying a maximum delta on each variable and accepting only pairs that meet the criteria for both variables, which is what your two -drop if- commands do. You might want to just make both of those criteria more stringent, and then randomly sample from the surviving pairs. This would not assure a "best" match, in any reasonable sense of the word best, for any observations but it would assure that there are no terrible matches for any.

Last edited by Clyde Schechter; 23 Mar 2018, 19:29.
Comment
Michail Michail

Join Date: Mar 2018

Posts: 3
#51

27 Mar 2018, 07:50

Dear Clyde Schechter ,

You were right about the size. Of course it seems very obvious now but I totally missed it in the beginning. Even after scaling fund_size to be somewhat comparable to fund_age, the algorithm would still match some funds that would differ more than 3 years in age, which would not be acceptable for my research. So in the end, I had to prioritize age over size and not use a delta1+delta2 variable for the matching. In the end, I managed to find 3 matches for each SRI fund with the exception of a single SRI fund, for which I could fine just one match, according to the set criteria. But since my samples are so small anyway, I suppose this will have to do for now. Thanks again for your advice and patience .
Comment

Kakul Modani

Join Date: Jan 2018
Posts: 1

#52

14 May 2018, 05:05

Hello,

I went through the post on matching sample. I would request for some help. This is my first post so apologies in case I haven't followed the prescribed format.

I have been using statalist's help for a while. In the past, I got clarifications from going through other users' posts.

I am trying to calculate the performance matched discretionary accruals as per the paper by Kothari et al (2005) - Kothari, S. P., Leone, A. J., & Wasley, C. E. (2005). Performance matched discretionary accrual measures. Journal of accounting and economics, 39(1), 163-197.

The paper talks about calculating performance matched discretionary accruals by matching the firms in the same industry as per the Return on Assets. The next step is to calculate the difference between the discretionary accruals of the original firms and the discretionary accruals of the corresponding matched-firms. This difference is the performance matched discretionary accruals as far as I understand. The discretionary accruals for the original as well as matched firms are calculate by using the Jones Model (1991). I have the used a code similar to - https://robsonglasscock.wordpress.co...nary-accruals/ for the same.

However, I am unable to understand how to first match firms as per their ROA . I plan to match firms from the original sample itself (so can't accept the same firm as a match) and then what should I include in the Jones (1991) code so that it becomes a code for performance matched accrual code. I am using STATA 12.0. Please find below the code that I have used to calculate discretionary accruals from Jones Model (1991) and my data.

I seek the community's help in writing a code for (a)matching the firms as per industry and ROA (margin can be of +/- 0.5%), and (b) the addition in the current code for it give performance matched discretionary accruals as an output.

Thanks in advance.

Code:

input long companycode str10 slotdate float roa long niccode float(ta rev ar gfa cfo ibet)
  11 "01-03-1998"   7.19 8   507.9   300.3   99.9   231.5       .    29.2
  11 "01-03-1999"   4.97 8   592.5   304.9    108   374.4       .    27.1
  11 "01-03-2000"   5.31 8   685.7   354.2  129.6   400.8       .    33.4
  11 "01-03-2001"   3.32 8   779.4   442.2  154.7   405.6       .    23.8
  11 "01-03-2002"    .95 8   829.9   401.5  153.1   422.7       .     7.5
  11 "01-03-2003"    -.9 8   850.5   462.5  129.6   479.6    43.1    -7.4
  11 "01-03-2004"  -6.71 8   768.5   533.6  123.2   592.5    17.4   -53.6
  11 "01-03-2005"  -8.78 8   698.5   578.3   80.4   666.3    68.2   -64.4
  11 "01-03-2006"   2.58 8   742.9   720.8  106.9   681.5    53.8    18.6
  11 "01-03-2007"   4.51 8   845.1   933.6  141.8   712.8    61.9    35.8
  11 "01-03-2008"   4.11 8   978.4  1147.9  200.3   760.7      69    37.5
  11 "01-03-2009"   4.38 8  1210.4  1454.8  233.7   811.1     1.4    47.9
  11 "01-03-2010"   4.42 8  1583.7  1808.6  291.3   939.2    32.7    61.7
  11 "01-03-2011"   3.02 8  1984.2  2408.3  403.8  1076.1    73.5    53.9
  11 "01-03-2012"    4.4 8    2548  2678.4  448.3  1152.4   117.9    99.8
  11 "01-03-2013"    .55 8  2805.9  2855.8  459.5  1876.8   405.9    14.6
  11 "01-03-2014"   -.28 8  3207.7  3010.4    539  2038.7   231.1    -8.3
  11 "01-03-2015"   -1.8 8  3289.8  3252.7  442.7  2098.2   244.6   -58.6
  11 "01-03-2016"   2.98 8  3342.3  3495.1  560.7  2158.8   184.1    98.7
  11 "01-03-2017"   3.72 8  3394.3  3731.2  548.5  2210.4   544.6   125.4
 771 "01-03-2014"   8.35 8   968.6   766.6  279.1   307.6   -85.6    72.1
 771 "01-03-2015"   6.05 8    1204   978.1  379.5   352.8   -23.1    65.6
 771 "01-03-2016"   5.04 8    1776  1251.4  581.2   402.1  -283.2      75
 771 "01-03-2017"   5.77 8  1941.9  1605.2    788   443.8    21.8   107.2
 783 "01-03-1998"  -3.02 8   478.7     427   70.7   215.1     4.2   -14.8
 783 "01-03-1999"  -2.16 8   426.1   401.4   68.7   218.3    44.7    -9.7
 783 "01-03-2000"   -.72 8   460.3   388.1  109.3   220.7    10.8    -3.2
 783 "01-03-2001"     .6 8   443.1   433.1   84.9   222.7    13.2     2.7
 783 "01-03-2002"   1.16 8   437.7   444.1  106.9   228.1    27.1     5.1
 783 "01-03-2003"    3.4 8   461.9   501.5  111.5   230.1    95.2    15.3
 783 "01-03-2004"   5.46 8   415.8   620.8   67.8   232.9    61.4    23.9
 783 "01-03-2005"    4.2 8   487.9   618.3   87.3   273.5     8.8    18.9
 783 "01-03-2006"   5.05 8   594.8   629.3   64.4   450.9    67.8    27.3
 783 "01-03-2007"  12.55 8   783.8   785.5   59.7     495   118.8    86.5
 783 "01-03-2008"   9.16 8   987.6     892   73.5   598.2     100    81.1
 783 "01-03-2009"   9.96 8  1084.6   967.6   78.8   617.6   167.8   103.2
 783 "01-03-2010"  13.47 8  1256.4  1008.2  100.6   687.8   166.2   157.7
 783 "01-03-2011"  11.85 8  1583.9  1109.7  271.1   756.6   224.3   168.3
 783 "01-03-2012"   6.87 8  1705.5  1130.2    270   777.8    79.1     113
 783 "01-03-2013"   4.38 8  2042.5  1306.6  333.4  1091.7      -2      82
 783 "01-03-2014"   5.75 8  2158.3  1436.3  281.7  1115.1   225.8   120.7
 783 "01-03-2015"   5.79 8  2252.8  1498.6  309.6  1280.1   -12.9   127.8
 783 "01-03-2016"   6.18 8  2433.4  1609.6  311.9  1290.3   443.3   144.7
 783 "01-03-2017"   4.87 8  2424.3  1665.8  314.2    1327   178.7   118.2
1120 "01-03-1998"  10.13 8   570.5  1042.9  261.5   160.7       .    54.6
1120 "01-03-1999"  11.42 8   602.5    1001    279   163.7       .      67
1120 "01-03-2000"  15.38 8   804.6  1090.8  459.6   173.4       .   108.2
1120 "01-03-2001"  11.93 8   882.5  1135.2  370.7   187.5       .   100.6
1120 "01-03-2002"  11.46 8  1387.8  1210.1  450.9   196.3       .   130.1
1120 "01-03-2003"  12.31 8  1138.6  1517.8  612.4   208.1    26.4   155.5
1120 "01-03-2004"  16.11 8  1422.5  1667.3  661.3   227.5       .   206.3
1120 "01-03-2005"  12.81 8  1716.5  2256.7  806.4     276    81.8   201.1
1120 "01-03-2006"  14.34 8  3479.7  3045.9 1073.8   331.7   181.3   372.5
1120 "01-03-2007"  14.73 8  5553.5  4282.7 1051.5   645.5   416.9   665.2
1120 "01-03-2008"  17.16 8  7061.7  6400.8 1387.2  1712.8    56.5  1082.5
1120 "01-03-2009"  15.45 8  9692.1    9805 1541.3  2568.6  1540.2  1294.1
1120 "01-03-2010"  11.86 8 10981.9  8458.4 1839.3  2788.7   868.7  1225.5
1120 "01-03-2011"  12.49 8  9766.2 10243.4 2475.2  3369.8   360.3  1296.1
1120 "01-03-2012"   14.2 8   11426   13185 3027.1  4343.7   535.8  1504.8
1120 "01-03-2013"  13.05 8 14308.4 16448.8 2344.3  4881.2  1654.4  1678.6
1120 "01-03-2014"  20.15 8 16921.8 18446.3 2221.9  5541.2  3795.8  3147.1
1120 "01-03-2015"  22.29 8 20245.2 21504.4 2275.6  7862.5  3246.9  4143.1
1120 "01-03-2016"  27.64 8 24324.3 18856.2 4020.3  9660.9  3355.7    6159
1120 "01-03-2017"  15.84 8 28345.1 21227.6 6622.4 10354.1   456.4  4170.1

Code:

clear
clear matrix

set more off
gen date = date(slotdate, "DMY", 2017)
gen year = year(date)

tostring niccode,replace
gen nic_2=substr(niccode,1,2)/*will take the first two digits of the nic code*/
destring niccode nic_2, replace

egen combo = group(nic_2 year)
levelsof combo, local(a)

gen Jones_1991=.


egen firm_id = group(companycode)
xtset firm_id year

gen obs= [_n]
summ obs
scalar e= r(min)
scalar f= r(max)

/*rename totalassets ta
rename tradereceivablesbillsreceivable ar
rename sales rev
rename grossfixedassets gfa
rename netcashflowfromoperatingactiviti cfo
rename patnetofpe ibet*/

drop if ta==.

gen delta_rec = d.ar
gen delta_rev = d.rev
gen lagged_ta =l.ta
gen part1 = (ibet-cfo)/lagged_ta
gen part2 = 1/lagged_ta
gen part3 = delta_rev/lagged_ta
gen part5 = gfa/lagged_ta



foreach k in `a'{
forvalues j=`=scalar(e)'/`=scalar(f)' {
if combo[`j']==`k' {
capture noisily reg part1 part2 part3 part5 if combo==`k' &amp; obs != `j', nocons
capture noisily predict uhat_2, resid
capture noisily replace uhat_2=. if e(N) &lt; 10
capture noisily replace Jones_1991= uhat_2 if combo==`k' &amp; obs== `j'
capture noisily drop uhat_2
}
di `k',`j'
}
}

bys nic_2:summ Jones_1991

Comment

lal mohan kumar

Join Date: May 2019

Posts: 265
#53

17 Mar 2021, 01:26

Dear Stata Members
I am doing an analysis based on Public firms and Private firms (similar to the one posted by @ https://www.statalist.org/forums/for...46#post1346546) . For matching , I read this post. However, there are few doubts which I have in mind before learning the technique of matching.
1) In many papers, it is wrtten that their results hold (based on their analyses) even for unmatched sample under the robustness section. In that case will it be a good starting point to do the analysis with unmatched data first to get some preliminary results? Can we trust those results?
2) Instead of matching, will it be alright to do our analysis in two sub-samples. Continuing my example, say my dependent variable is Profitability and Variable of Interest is say Managerial quality. What if I run a panel regression in 2 samples (Public and Private, where Public and Private are indicated by dummy coding 0 & 1)? Is not my beta of Managerial quality comparable between pvt versus public? Is it because of the the sample size or because of the innumerable factors that differentiate between public and private firms?
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment