Matching sample - Statalist

Marko Pucelj

Join Date: May 2015

Posts: 15
#16

30 May 2015, 17:57

With regard to the selection of year to match on, you say that company 1 has observations for every year from 2000-2014 but you only want to match it in 2007. Where does 2007 come from? Why 2007 and not 2000, or 2006, or 2008, or 2014, or any of the other years?
[/QUOTE]

The year comes from the thing I am trying to test later in a difference in difference test. I am looking at what happens with the company after it delists from a stock exchange. Companies delist at different times. Sorry for not making this clear before.
And I don't see how I could make a rule for that, so I will just filter them out, should not be too much work.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#17

30 May 2015, 19:21

Well, you may or may not have to. It depends on how your data sets are structured. I can see that in responding to your post I have assumed many things about your data, some of which have turned out to be untrue. Among these, I assumed that company_name uniquely identifies the observations in both of the data sets involved.

The purpose of the code here is to purge compustat_global_sample.dta of firms that also occur in my_sample_global.dta. If I am correct in assuming that the company_name variable in my_sample_global.dta do uniquely identify the observations in that data set, then it's as simple as just changing -1:1- to -m:1- in the -merge- command.

If the company name variable does not uniquely identify the observations in my_sample_global.dta, then I would reverse the approach:

Code:

used /Applications/Stata/final_names_matching_COMPUSTAT/my_sample_global, clear keep company_name duplicates drop merge 1:m company_name using /Applications/Stata/final_names_matching_COMPUSTAT/compustat_global_sample, /// keep(using) nogenerate

You will now have in memory the compustat_global_sample.dta purged of all observations where the company_name also appears in my_sample_global.dta.
Comment
Marko Pucelj

Join Date: May 2015

Posts: 15
#18

31 May 2015, 11:36

I changed the -1:1- to -m:1- in the code and I get the same error.
Currently my sample is in format that is straight from compustat.
So
year name country assets
2000 A CAN 2345
2001 A CAN 2453
2002 A CAN 2297
2000 B CAN 9876
2001 B CAN 876
2005 C CAN 9865
2010 D CAN 9862

Would there be a way to make the company_name as an unique identifier?
Comment

Marko Pucelj

Join Date: May 2015
Posts: 15

#19

31 May 2015, 11:44

Code:

use /Applications/Stata/final_names_matching_COMPUSTAT/my_sample_canada_filtered
> .dta, clear

. keep company_name

. duplicates drop

Duplicates in terms of all variables

(128 observations deleted)

merge 1:m company_name using /Applications/Stata/final_names_matching_COMPUSTAT/
> compustat_global_sample,
(note: variable company_name was str27, now str33 to accommodate using data's
       values)

    Result                           # of obs.
    -----------------------------------------
    not matched                       405,635
        from master                         9  (_merge==1)
        from using                    405,626  (_merge==2)

    matched                                15  (_merge==3)
    -----------------------------------------

.     keep(using) nogenerate
invalid syntax
r(198);

. keep(using) company_name nogenerate
invalid syntax
r(198);

If I try the other approach I get the error in the last line. I assume there is something missing after keep(using)?

Comment

Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#20

31 May 2015, 12:13

the guidance the Clyde gave you was a single command (the "///" means to continue to the next line as though there was no line break); you, however, are treating it as two lines and that does not and will not work; if you are using a "do" file, follow exactly what Clyde sent; if you are doing it interactively, ignore the "///" and just keep typing
Comment
Marko Pucelj

Join Date: May 2015

Posts: 15
#21

31 May 2015, 16:00

Thank you for all the help. It worked out great in the end.
Comment
David Tranv

Join Date: Jun 2016

Posts: 47
#22

23 Jun 2016, 13:52

Hi everyone,
I'm new in Statalist and using Stata v14, 64bit. I'm sorry if I have something not compliance with the rule of forum. I tried to post code as the post above but I cannot find it.

Actually, I have quite the same problem as Marko, but a little bit different. In fact, I have to match each public firms (master file contains 102826 firms-quarter observations) with a private fimrs (using file contains 738111 firms-quarter observations), and closest in size (gta_w) from every beginning quarter of sample(I created "c_1quarter" the first quarter by firms). And in case that no match can be found, I discard the observation and look for a match in the following quarter. I follow this procedure from the paper (page 9/50, paragraph 3,http://rfs.oxfordjournals.org/content/28/2/342.abstract).

I tried to do the same code as suggested Clyde Schechter
Code:
. joinby gta_w c_1quarter using "~\1.1.Match\1.control.dta"

It creates a new file containing 37958 observations but I dont know why the variables "c_1quarter" contains only missing value.

Furthermore, it seems that the match is exact and not closest: I mean that "joinby" find an exact match between "gta_w" of 2 samples, then when I run:
. gen delta = abs(m_gta_w - c_gta_w) /* m_ designs the variables from master file; c_ designs the variables from control file */
. sum delta
"delta" contains only 0.

Could you find the errors in my case?

Many thanks for your help

David

Last edited by David Tranv; 23 Jun 2016, 14:42.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#23

23 Jun 2016, 14:25

I don't think your question can be answered without some example data to work with. The -joinby- command you wrote appears to be correct, but it is not at all clear why it would produce the results you have gotten. Please use the -dataex- command (-ssc install dataex-, -help dataex-) to show small representative samples of your master and using data (and please choose examples where there will actually be some matches).
Comment

David Tranv

Join Date: Jun 2016
Posts: 47

#24

23 Jun 2016, 15:01

I suppose that the sample after "joinby" is already matched (but just based on "gta_w").
Here are the data,
"c_quarter / m_quarter" is the quarter in control / master file
I create m_1quarter m_2quarter to design the first quarter of each observation in the master file. Similar with control file.

Code:

* Master file (Public firms). I put "m_" before each variable to design that is from master file.
* 
clear
input float(m_id m_gta_w m_quarter m_1quarter m_2quarter m_qr1 m_qr2)
 3  89377 120 120   . 1 0
 3  90986 121   . 121 0 1
 3  92664 122   .   . 0 0
 3  89527 123   .   . 0 0
 3  87458 124   .   . 0 0
 3  98440 125   .   . 0 0
 3 100648 126   .   . 0 0
 3  96856 127   .   . 0 0
 3 112091 128   .   . 0 0
 3 180284 133   .   . 0 0
 3 200175 134   .   . 0 0
10  80139 120 120   . 1 0
10  82049 121   . 121 0 1
10  84082 122   .   . 0 0
10  87355 123   .   . 0 0
10  88751 124   .   . 0 0
10  96750 125   .   . 0 0
10  92917 126   .   . 0 0
10  85805 127   .   . 0 0
10  88820 128   .   . 0 0
end
format %tq m_quarter
format %tq m_1quarter
format %tq m_2quarter

Code:

* Control file (Private firms). "c_" means variables from control file.
clear
input float(c_id c_gta_w c_quarter c_1quarter c_2quarter c_qr1 c_qr2)
1 26550 121 121   . 1 0
1 27076 122   . 122 0 1
1 27403 123   .   . 0 0
1 28722 124   .   . 0 0
1 29917 125   .   . 0 0
1 30049 126   .   . 0 0
1 30459 127   .   . 0 0
1 32496 128   .   . 0 0
1 34221 129   .   . 0 0
1 35438 130   .   . 0 0
1 37575 131   .   . 0 0
1 39247 132   .   . 0 0
1 44611 133   .   . 0 0
1 46069 134   .   . 0 0
2 29295 104 104   . 1 0
2 30015 105   . 105 0 1
2 30641 106   .   . 0 0
2 32130 107   .   . 0 0
2 32310 108   .   . 0 0
2 33360 109   .   . 0 0
end
format %tq c_quarter
format %tq c_1quarter
format %tq c_2quarter

And for the joinby sample,

Code:

. joinby gta_w c_1quarter using "~\1.1.Match\1.control.dta"

Code:

clear
input float(c_id c_gta_w c_quarter c_1quarter c_2quarter c_qr1 c_qr2 ///
m_id m_gta_w m_1quarter m_2quarter m_qr1 m_qr2)

 3451  90986 136 . . 0 0  3  90986 . 121 0 1
 2203  89527 148 . . 0 0  3  89527 .   . 0 0
 6712  87458 111 . . 0 0  3  87458 .   . 0 0
 2489  98440 119 . . 0 0  3  98440 .   . 0 0
 4474 100648 122 . . 0 0  3 100648 .   . 0 0
 1883  96856 148 . . 0 0  3  96856 .   . 0 0
 3497 112091 126 . . 0 0  3 112091 .   . 0 0
 7225 200175 176 . . 0 0  3 200175 .   . 0 0
 1715  82049 208 . . 0 0 10  82049 . 121 0 1
 2427  84082 135 . . 0 0 10  84082 .   . 0 0
12540  87355 116 . . 0 0 10  87355 .   . 0 0
 2409  88751 202 . . 0 0 10  88751 .   . 0 0
  615  96750 188 . . 0 0 10  96750 .   . 0 0
  122  92917 154 . . 0 0 10  92917 .   . 0 0
 7132  85805 196 . . 0 0 10  85805 .   . 0 0
  184  88820 111 . . 0 0 10  88820 .   . 0 0
 6285  84636 117 . . 0 0 10  84636 .   . 0 0
 3680  85708 176 . . 0 0 10  85708 .   . 0 0
15714  85815 200 . . 0 0 10  85815 .   . 0 0
 7214  87076 115 . . 0 0 10  87076 .   . 0 0
end
format %tq c_quarter
format %tq c_1quarter
format %tq c_2quarter
format %tq m_1quarter
format %tq m_2quarter

Thanks you

David

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#25

23 Jun 2016, 18:12

OK, you have two problems here.

One is a syntax problem. In order for -joinby- to work, the variables specified as the join link have to be in both data sets, and have the same name in both data sets. By having one named m_gta_w and the other c_gta_w (ditto for quarter) join fails because there are no variables with common names in the two data sets. So you have to rename the link variables in both data sets to harmonize the names:

Code:

use control, clear rename c_gta_w gta_w rename c_quarter quarter tempfile control save `control' use master, clear rename m_gta_w gta_w rename m_quarter quarter joinby gta_w quarter using `control'

Now, with the data example you posted, you will still come up with an empty join because there is also a data problem. Look at the numbers in gta_w in both files. There are no overlaps at all: none of the observations in master has the same value of gta_w as any observation in control. So, of course, there will be no matches. I don't know if this is just an artifact of the particular samples you selected to illustrate your situation, or if this characterizes your data as a whole. If it characterizes the data as a whole, there are no matches to be found, and you need to investigate why your data sets don't match up at all.
Comment
David Tranv

Join Date: Jun 2016

Posts: 47
#26

23 Jun 2016, 19:04

Thanks for your return.
1/ For the first problem, I think I don't have this problem since I created gta_w for both sample, as well as c_1quarter in master file. I'm sorry to not show you all code before!
2/ For the 2nd problem: I'm not sure to understand what you mean "overlap": With c_id ==3451, c_gta_w==90986, and that is the gta_w of m_id==3 (First line of results I show above).

The data is real. You can find here: "https://www.dropbox.com/s/8at1p8obpdvk7gg/1.2.Joinby.rar?dl=0."
(I cannot upload to forum, even with zip file)

Here are all my code for joinby:

Code:

use ~\1.master.dta" cap drop c_1quarter gen c_1quarter = m_1quarter /*Create c_1quarter in master file to use in joinby*/ format c_1quarter %tq duplicates report gta_w c_1quarter duplicates drop gta_w c_1quarter,force /*drop all duplicates gta_w c_1quarter in master file*/ joinby gta_w c_1quarter using `control'
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#27

23 Jun 2016, 19:27

1/ For the first problem, I think I don't have this problem since I created gta_w for both sample, as well as c_1quarter in master file. I'm sorry to not show you all code before!
2/ For the 2nd problem: I'm not sure to understand what you mean "overlap": With c_id ==3451, c_gta_w==90986, and that is the gta_w of m_id==3 (First line of results I show above).

Maybe in your real data, but in the example data you posted in #24, the variables in the master data all begin with m_, and those in the control data all begin with c_, so there are no variables names common to both data sets. And, again in your example data in #24, gta_2 appears in the master file but not in the control file.

In the code you show in #26, if there is a variable named gta_w (not m_gta_w or c_gta_w) and a variable named c1_quarter in both data sets, that code should work to find whatever matches there are in the data. But if the data look like what you showed in #24, there aren't any matches to be found.

If you're still having problems, I suggest you try posting different sample data. In both the master and control files, be sure to name the link variables gta_w and quarter, with no prefixes, and then do something like -keep if inrange(gta_w, 20000, 25000)- in both files so that the gta_w numbers in what you post will be more likely to yield matches.
Comment

David Tranv

Join Date: Jun 2016
Posts: 47

#28

23 Jun 2016, 20:26

I still have problem! the quarter1 (formerly c_1quarter) have no observation.

I did the following code (with and without -keep if inrange(gta_w, 25000,100000)-, I did not do -keep if inrange(gta_w, 20000, 25000)- as you suppose since gta_w in master file have min of 25000.

Code:

global keep2 gta_w rssd9001 rssd9999 id quarter listed2 qr1 qr2 
 use "~\1.control.dta", clear
    drop gta_w /*I created it to match, then in control file, there are gta_w and c_gta_w*/
    foreach var in $keep2 {
    rename c_`var' `var'
    }
    rename c_1quarter quarter1
    rename c_2quarter quarter2
    keep if inrange(gta_w, 25000,100000)
    tempfile control
    save `control'
    
use "~\1.master.dta",clear //96293 obs
    drop gta_w
    foreach var in $keep2 {
    rename m_`var' `var'
    }
    rename m_1quarter quarter1
    rename m_2quarter quarter2
    drop quarter2 /*I try to drop 1 obs in master data to see how it match */
    joinby gta_w quarter1 using `control'

Here is control file:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(rssd9001 rssd9999) float(quarter id listed2 gta_w qr1 qr2 quarter1)
 46 19900331 120  3 1  89377 1 0 120
 46 19900630 121  3 1  90986 0 1   .
 46 19900930 122  3 1  92664 0 0   .
 46 19901231 123  3 1  89527 0 0   .
 46 19910331 124  3 1  87458 0 0   .
 46 19910630 125  3 1  98440 0 0   .
 46 19910930 126  3 1 100648 0 0   .
 46 19911231 127  3 1  96856 0 0   .
 46 19920331 128  3 1 112091 0 0   .
 46 19930630 133  3 1 180284 0 0   .
 46 19930930 134  3 1 200175 0 0   .
532 19900331 120 10 1  80139 1 0 120
532 19900630 121 10 1  82049 0 1   .
532 19900930 122 10 1  84082 0 0   .
532 19901231 123 10 1  87355 0 0   .
532 19910331 124 10 1  88751 0 0   .
532 19910630 125 10 1  96750 0 0   .
532 19910930 126 10 1  92917 0 0   .
532 19911231 127 10 1  85805 0 0   .
532 19920331 128 10 1  88820 0 0   .
532 19920630 129 10 1  85918 0 0   .
532 19920930 130 10 1  84636 0 0   .
532 19921231 131 10 1  85708 0 0   .
532 19930331 132 10 1  85815 0 0   .
532 19930630 133 10 1  87076 0 0   .
532 19930930 134 10 1  89105 0 0   .
532 19931231 135 10 1  97989 0 0   .
532 19940331 136 10 1  98254 0 0   .
532 19940630 137 10 1 100325 0 0   .
532 19940930 138 10 1 103915 0 0   .
end
format %tq quarter
format %tq quarter1

Master file:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(rssd9001 rssd9999) float(quarter id listed2 gta_w qr1 qr2 quarter1 quarter2)
28 19900630 121 1 0 26550 1 0 121   .
28 19900930 122 1 0 27076 0 1   . 122
28 19901231 123 1 0 27403 0 0   .   .
28 19910331 124 1 0 28722 0 0   .   .
28 19910630 125 1 0 29917 0 0   .   .
28 19910930 126 1 0 30049 0 0   .   .
28 19911231 127 1 0 30459 0 0   .   .
28 19920331 128 1 0 32496 0 0   .   .
28 19920630 129 1 0 34221 0 0   .   .
28 19920930 130 1 0 35438 0 0   .   .
28 19921231 131 1 0 37575 0 0   .   .
28 19930331 132 1 0 39247 0 0   .   .
28 19930630 133 1 0 44611 0 0   .   .
28 19930930 134 1 0 46069 0 0   .   .
37 19860331 104 2 0 29295 1 0 104   .
37 19860630 105 2 0 30015 0 1   . 105
37 19860930 106 2 0 30641 0 0   .   .
37 19861231 107 2 0 32130 0 0   .   .
37 19870331 108 2 0 32310 0 0   .   .
37 19870630 109 2 0 33360 0 0   .   .
37 19870930 110 2 0 33958 0 0   .   .
37 19871231 111 2 0 35910 0 0   .   .
37 19880331 112 2 0 35675 0 0   .   .
37 19880630 113 2 0 36266 0 0   .   .
37 19880930 114 2 0 36321 0 0   .   .
37 19881231 115 2 0 37796 0 0   .   .
37 19890331 116 2 0 39716 0 0   .   .
37 19890630 117 2 0 40275 0 0   .   .
37 19890930 118 2 0 40807 0 0   .   .
37 19891231 119 2 0 43824 0 0   .   .
end
format %tq quarter
format %tq quarter1
format %tq quarter2

And the result after all: Joinby sample with 19192 obs from Control: 271842 obs, Master: 108826 obs

Code:

clear
input double(rssd9001 rssd9999) float(quarter id listed2 gta_w qr1 qr2 quarter1 quarter2)
  46 19900630 121  3 1 90986 0 1 . 121
  46 19901231 123  3 1 89527 0 0 .   .
  46 19910331 124  3 1 87458 0 0 .   .
  46 19910630 125  3 1 98440 0 0 .   .
  46 19911231 127  3 1 96856 0 0 .   .
 532 19900630 121 10 1 82049 0 1 . 121
 532 19900930 122 10 1 84082 0 0 .   .
 532 19901231 123 10 1 87355 0 0 .   .
 532 19910331 124 10 1 88751 0 0 .   .
 532 19910630 125 10 1 96750 0 0 .   .
 532 19910930 126 10 1 92917 0 0 .   .
 532 19911231 127 10 1 85805 0 0 .   .
 532 19920331 128 10 1 88820 0 0 .   .
 532 19920930 130 10 1 84636 0 0 .   .
 532 19921231 131 10 1 85708 0 0 .   .
 532 19930331 132 10 1 85815 0 0 .   .
 532 19930630 133 10 1 87076 0 0 .   .
 532 19930930 134 10 1 89105 0 0 .   .
 532 19931231 135 10 1 97989 0 0 .   .
 532 19940331 136 10 1 98254 0 0 .   .
1258 19900630 121 18 1 58199 0 1 . 121
4558 19940331 136 65 1 87107 0 1 . 136
4558 19940630 137 65 1 90832 0 0 .   .
4558 19940930 138 65 1 89911 0 0 .   .
4558 19941231 139 65 1 93196 0 0 .   .
4558 19950331 140 65 1 90216 0 0 .   .
4558 19950630 141 65 1 96155 0 0 .   .
5555 19900630 121 78 1 67989 0 1 . 121
5555 19900930 122 78 1 68851 0 0 .   .
5555 19901231 123 78 1 70079 0 0 .   .
end
format %tq quarter
format %tq quarter1
format %tq quarter2

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#29

24 Jun 2016, 06:59

Once again, in the data example you have posted there are no matches to be found:

In the controls:

Code:

. summ gta_w quarter1 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- gta_w | 30 98130.5 26156.93 80139 200175 quarter1 | 2 120 0 120 120

And in the master data:

Code:

. summ gta_w quarter1 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- gta_w | 30 34604.4 5297.923 26550 46069 quarter1 | 2 112.5 12.02082 104 121 . count if quarter1 == 120 0

So in this example data, there is no possibility of a match on gta_w, nor quarter1. Stata appropriately comes up with an empty join. If you can post an example of your data where the same values of gta_w and quarter1 appear in both data sets, then Stata will give an empty join and I can try to experiment with that to see if I can a) reproduce your problem, and b) solve it. But I can't do anything helpful with what you've posted so far.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#30

24 Jun 2016, 13:25

I guess I did not feel like doing what I'm supposed to do today so I downloaded the datasets linked to in #26 and had a look-see.

I suspect that both dataset come from the same source and in both, the data is quarterly. The "1.master.dta" dataset contains public firms while the "1.control.dta" dataset contains data about private firms. There are a few issues with dates and some variables that are derivative of other variables. The following code checks a bunch of things, removes variables that can easily be regenerated and appends both datasets. Once combined, there are no duplicates and there are a number of firms with some quarters public and some quarters private.

Code:

* ------------ check master, remove all derivative variables ------------------
use "1.master.dta", clear

assert m_gta_w == gta_w
drop gta_w

* use Stata numeric dates
gen sdate = string(m_rssd9999,"%15.0g")
gen ndate = daily(sdate,"YMD")
format %td ndate
drop sdate

* check that m_quarter is correctly calculated
gen yq = qofd(ndate)
format %tq yq
assert m_quarter == yq
drop m_quarter 

* the m_qr1 m_1quarter are derivative of the yq variable
isid m_rssd9001 yq, sort
by m_rssd9001: gen one = _n == 1
assert one == m_qr1
assert m_1quarter == yq if one
assert m_1quarter == . if !one
drop one m_qr1 m_1quarter

* the m_qr2 m_2quarter are derivative of the yq variable
by m_rssd9001: gen two = _n == 2
assert two == m_qr2
assert m_2quarter == yq if two
assert m_2quarter == . if !two
drop two m_qr2 m_2quarter

* we need to append to compare the master and controls, adjust names
rename m_* *

* all obs are "listed"
assert listed2 == 1
save "master_clean.dta", replace

* ------------ check control, remove all derivative variables ------------------
use "1.control.dta", clear

* use Stata numeric dates
gen sdate = string(rssd9999,"%15.0g")
gen ndate = daily(sdate,"YMD")
format %td ndate
drop sdate

* check that quarter is correctly calculated
gen yq = qofd(ndate)
format %tq yq
assert quarter == yq
drop quarter

* no obs are "listed"
assert listed2 == 0
save "control_clean.dta", replace

* ------------ combine master and control and check overlap --------------------
append using "master_clean.dta"

* the id variable is derivative
bysort rssd9001: assert id == id[1]
bysort id: assert rssd9001 == rssd9001[1]
drop id

* there are no duplicates
isid rssd9001 yq, sort

* adjust the naming of the variable that identifies public firms
rename listed2 public
tab public

save "combo.dta", replace

* some firm have both private and public observations
bysort rssd9001 (public): gen is_public = public[_N]
by rssd9001: gen is_private = public[1] == 0
tab is_private is_public
by rssd9001: gen rssd_one = _n == 1
tab is_private is_public if rssd_one

The paper cited in #22 is

Code:

Corporate Investment and Stock Market Listing: A Puzzle?
John Asker, Joan Farre-Mensa, Alexander Ljungqvist,
Rev. Financ. Stud. (2015) 28 (2): 342-390.

On page 350, the authors describe the matching procedure. They call it a "caliper-based nearest-neighbor match adapted to a panel setting". They match public firms to private firms in the same year and industry based on total assets. The caliper used requires that the ratio of total assets be less than 2:

max(TApublic, TAprivate) / min(TApublic, TAprivate) <2

The matching is with replacement. Additional variables (characteristics) were considered but the results are based simply on Total Assets in the same year/industry.

In David's data, there is no industry variable. So the task is to find the nearest private firm in terms of size in the same quarter (and within the caliper). Given that there is only one dimension, you can use rangestat (from SSC) to identify the best match per quarter. The following code is a bit complicated and requires careful reading of the rangestat help file and the code used to understand what's going on. The intuition is easy to explain. For each observation of a public firm, find the nearest gta_w value of private firms in the same quarter. This is done in two steps, first you look for firms that are larger (up to twice the size). Second, look for private firms that are smaller, up to half as small. The rest is index arithmetics to get the firm identifier. Note that if there are multiple nearest neighbors, the code will just pick one of them, based on the firm identifier.

Code:

use "combo.dta", clear

* an overall observation identifier
gen long index = _n

* ------- find nearest but larger private firm in the quarter ------------------
* target larger firms, no need to look beyond gta_w * 2, that would exceed
* an assert ratio of 2. Use values that will match nothing for private obs.
gen low = cond(public, gta_w, c(minfloat))
gen high = cond(public, gta_w * 2, c(minfloat))

* we are looking for private gta_w values only
gen gta2use = cond(public, c(maxfloat), gta_w)

* find the smallest gta_w value of private firms in the quarter
rangestat (min) gta_high = gta_w, interval(gta2use low high) by(yq)

* find the index of the matching private firm
replace gta_high = 0 if mi(gta_high)
gen id2use = cond(public, 0, index)
rangestat (max) id_high = id2use, interval(gta2use gta_high gta_high) by(yq)

* the matching private firm for that quarter with the nearest higher gta_w value
gen double rssd_high = rssd9001[id_high]


* ------- find nearest but smaller private firm in the quarter -----------------
* ignore firms half the size, they are outside the desired asset ratio of 2
drop low high
gen low = cond(public, gta_w / 2, c(minfloat))
gen high = cond(public, gta_w, c(minfloat))

* find the largest gta_w value of private firms in the quarter
rangestat (max) gta_low = gta_w, interval(gta2use low high) by(yq)

* find the index of the matching private firm
replace gta_low = 0 if mi(gta_low)
rangestat (max) id_low = id2use, interval(gta2use gta_low gta_low) by(yq)

* the matching private firm for that quarter with the nearest higher gta_w value
gen double rssd_low = rssd9001[id_low]


* --------- select the nearest neighbor using the size difference --------------
gen dhigher = gta_high - gta_w
gen dlower  = gta_w - gta_low
gen double rssd_near = cond(dlower < dhigher, rssd_low, rssd_high)

keep if public
isid rssd9001 yq, sort
save "nearest_by_yq.dta", replace

One of the great thing about rangestat is that all calculations are done per observation. It's quite easy to spot check results using the simplest, albeit least efficient way to address the problem. The following shows that all matches are within the prescribed caliper and checks for obs 15 and 50000:

Code:

* double-check that all matches are within the caliper
use "nearest_by_yq.dta", clear
gen gta_win = cond(dlower < dhigher, gta_low, gta_high)
gen x = max(gta_w, gta_win) / min(gta_w, gta_win)
assert x < 2

* check the observation 15
use "nearest_by_yq.dta", clear
local i = 15
list in `i'
local test_yq = yq[`i']
local test_gta = gta_w[`i']
local test_id = rssd_near[`i']

use if yq == `test_yq' & public == 0 using "combo.dta"
gen dgta = abs(gta_w - `test_gta')
sort dgta rssd9001
keep in 1
list
assert rssd9001 == `test_id'


* check the observation 50000
use "nearest_by_yq.dta", clear
local i = 50000
list in `i'
local test_yq = yq[`i']
local test_gta = gta_w[`i']
local test_id = rssd_near[`i']

use if yq == `test_yq' & public == 0 using "combo.dta"
gen dgta = abs(gta_w - `test_gta')
sort dgta rssd9001
keep in 1
list
assert rssd9001 == `test_id'

The final step is to decide on the final matching private firm, which is the first firm to match. This can be as simple as:

Code:

use "nearest_by_yq.dta", clear

* the data is already sorted but double check
isid rssd9001 yq, sort

* order matches (do not count missing values)
by rssd9001: gen match_i = sum(!mi(rssd_near))

* drop quarters until the first match
drop if match_i == 0

* the overall matched firm is the first found
by rssd9001: gen final_near = rssd_near[1]

Contrary to what happens in the paper, there are no cases where a match was not found in the first quarter of a public firm. The code above would find such cases and drop the initial observation(s) without a match. The paper states that:

If a matched private firm exits the panel, a new match is spliced in.

I'm not sure this is needed here so I skip this step.

I'm attaching datasets for 1990q1 for those who would like to play with the code above.

Attached Files

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment