Matching/ rangejoin/ Invalid syntax

Aga Smith

Join Date: Jan 2021

Posts: 8
#1

Matching/ rangejoin/ Invalid syntax

04 Nov 2021, 11:21

Delete post

Last edited by Aga Smith; 04 Nov 2021, 11:37. Reason: Delete
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

04 Nov 2021, 11:46

The error message is misleading--it arises pretty deep inside -rangejoin- at a place where rangejoin is not able to figure out what went wrong, but knows that it's in trouble and has to give up.

The actual problem is that it cannot create a new variable named workplace_safety_violator_y_control, because that would be 35 characters, and the maximum length of a variable name is 32 characters. If you rename workplace_safety_violator_y to something shorter, it will run.
1 like
Comment

Aga Smith

Join Date: Jan 2021
Posts: 8

04 Nov 2021, 11:51

Thank you Clyde, I have adjusted the name of variable and now it works.

More importantly I have been trying to match my firms using size (within +/- 30% range) , industry (identical), and year (identical). I aim to match my observations, so that each observation is paired with another and observations can be used only once for matching (i.e. matching without replacement).

However, I am not entirely sure how do I join within 30% deviation from size
Is the following command correct? Is percentage calculated in the memory? The command seems to be incorrect to me, since I am loosing too many observations.

rangejoin size -.3 .3 using `controls', by(sector year) prefix(_control)

My code is :
gen long obs_no = _n
gen byte case =formatch

preserve
keep if case
tempfile cases
save `cases'
restore
keep if !case
tempfile controls
save `controls'

use `cases', clear

rangejoin size -.3 .3 using `controls', by(sector year) suffix(_control)
drop if missing(gvkey_control)

Dataex

. dataex

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double gvkey float(size sector year formatch)
1004740.998 41 1999 0
1004701.854 41 2000 0
1004710.199 41 2001 0
1004686.621 41 2002 0
1004709.292 41 2003 0
1004732.23 41 2004 0
1004978.819 41 2005 1
10041067.633 41 2006 0
10041362.01 41 2007 1
10041377.511 41 2008 0
10041501.042 41 2009 1
10041703.727 41 2010 1
10042195.653 41 2011 1
10042136.9 41 2012 1
10042199.5 41 2013 0
10041515 41 2014 0
10041504.1 41 2016 1
10041524.7 41 2017 0
10341610.435 13 2000 0
10342390.008 13 2001 0
10342296.924 13 2002 0
10342329.268 13 2003 0
10342003.842 13 2004 0
10341623.383 13 2005 0
1034927.239 13 2006 0
10341288.165 13 2007 0
104526213 40 2000 0
104529330 40 2003 1
104528773 40 2004 1
104529495 40 2005 1
104529145 40 2006 1
104528571 40 2007 1
104525175 40 2008 1
104525438 40 2009 1
104525088 40 2010 1
104543771 40 2014 1
104548415 40 2015 1
104551396 40 2017 1
1056248.707 37 2000 0
1056310.252 37 2001 0
1056318.465 37 2002 0
1056330.616 37 2003 0
1056551.391 37 2004 0
1056589.849 37 2005 0
1056638.022 37 2006 0
10721885.098 36 2000 0
10721691.599 36 2001 0
10721700.513 36 2002 0
10721667.877 36 2003 0
10721689.749 36 2004 0
10721675.208 36 2005 0
10721899.536 36 2006 0
10722109.078 36 2007 0
10721872.529 36 2008 0
10722051.492 36 2009 0
10722319.482 36 2010 0
10722468.012 36 2011 0
10722601.995 36 2012 0
10722384.988 36 2013 0
10722459.015 36 2014 0
10722409.819 36 2015 0
10722477.413 36 2016 0
10722672.766 36 2017 0
1076555.292 34 2003 0
1076700.288 34 2004 1
1076858.515 34 2005 0
1076979.606 34 2006 0
10761113.176 34 2007 0
10761233.27 34 2008 0
10761321.456 34 2009 1
10761502.072 34 2010 0
10761735.149 34 2011 0
10761812.929 34 2012 1
10761827.176 34 2013 0
10762456.844 34 2014 1
10762658.875 34 2015 0
10762615.736 34 2016 1
10762692.264 34 2017 0
107824259.1 13 2002 0
107826715.34 13 2003 1
1078 28767.494 13 2004 1
1078 29141.203 13 2005 0
107836178.17 13 2006 0
107839713.93 13 2007 1
107842419.2 13 2008 0
107852416.62 13 2009 1
107859462.27 13 2010 0
107860276.89 13 2011 0
1078 67234.945 13 2012 0
107842953 13 2013 0
107841275 13 2014 0
107841247 13 2015 0
107852666 13 2016 1
107876250 13 2017 1
107867173 13 2018 0
1094188.478 41 2007 0
1094222.243 41 2008 0
1094205.464 41 2009 0
1094231.851 41 2010 0
1094311.665 41 2011 0
end

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#4

04 Nov 2021, 12:03

No, it's not correct. It is matching firms whose sizes differ (absolutely, not as a percent) by less than 0.3.

-rangejoin- is not designed to do ranges based on ratios (percentage different); it is based on absolute numbers. But you can accomplish it in one of two ways:

1. Generate a new variable containing the log of size. And then match on log_size to within +/- log(.3). So something like:

Code:

gen log_size = log(size) local limit = log(0.3) rangejoin log_size -`limit' `limit' using `controls', by(sector year) suffix(_control)

2. OR, generate new variables representing the upper and lower acceptable size limits and use those variables, rather than constants, in -rangejoin-'s interval:

Code:

gen upper = size*1.3 gen lower = size*0.7 rangejoin size lower upper using `controls', by(sector year) suffix(_control)
1 like
Comment
Aga Smith

Join Date: Jan 2021

Posts: 8
#5

05 Nov 2021, 02:55

Thank you, Clyde. This is really helpful.

So that I have full understanding, would you be able to tell what is the matching technique of range join? Is it nearest neighbor within a range? All I can find in the stata helpline is the following:
rangejoin forms pairwise combinations of observations in memory and observations from the using_dataset when the value of keyvar in the using_dataset is within the range specified by low and high in the data in
memory.

Also, later I attempt to match my observations, so that I achieve one-to-one matching (without replacement). This attempt, however, seems to fail since I still can see duplicates among controls. Would you have any suggestions on how could I fix this?

duplicates report gvkey_control year

Duplicates in terms of gvkey_control year

--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 1593 0
2 | 496 248
3 | 126 84
4 | 48 36
--------------------------------------

My full code:
gen long obs_no = _n
gen byte case =formatch

preserve
keep if case
tempfile cases
save `cases'
restore
keep if !case
tempfile controls
save `controls'

use `cases', clear
*/*
gen log_size = log(size)
local limit = log(0.3)
rangejoin log_size -`limit' `limit' using `controls', by(sector year) suffix(_control)
*/

gen upper = size*1.3
gen lower = size*0.7
rangejoin size lower upper using `controls', by(sector year) suffix(_control)

drop if missing(gvkey_control)

set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED
gen double shuffle1 = runiform()
gen double shuffle2 = runiform()
by gvkey year (shuffle1 shuffle2), sort: keep if _n == 1
drop shuffle*

Thank you,
Aga

Last edited by Aga Smith; 05 Nov 2021, 03:24.
Comment
Aga Smith

Join Date: Jan 2021

Posts: 8
#6

05 Nov 2021, 10:32

Following up on my earlier post, I sorted out the duplicates problem using a simple code:

sort gvkey year //since I want to keep as old treatment as possible
by gvkey_control, sort : gen dup = cond(_N==1,0,_n)
drop if dup>1
drop dup

I purposely removed all duplicates of gvkey_control, so that none of the firms acts as control twice.

My question regarding the rangejoin still stands, i.e. would you be able to tell what is the matching technique of rangejoin? Is it nearest neighbor within a range? All I can find in the stata helpline is the following:
rangejoin forms pairwise combinations of observations in memory and observations from the using_dataset when the value of keyvar in the using_dataset is within the range specified by low and high in the data in
memory.

Also, I am quite puzzled with the part of the code that should assure matching without replacement, i.e.

set seed 1234
gen double shuffle1 = runiform()
gen double shuffle2 = runiform()
by gvkey year (shuffle1 shuffle2), sort: keep if _n == 1
drop shuffle*

It seems like the code arrives at different number of observations, each time I use it. Is there any other way to impose no-replacement restriction, please?

Thank you.

Last edited by Aga Smith; 05 Nov 2021, 11:14.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#7

05 Nov 2021, 10:41

-rangejoin- links each of your cases to all controls that agree with the case on sector and year and fall within 30% of it on size. It is then up to you to write additional code that selects which of these potential matches you want to retain for your study.

The code you have listed there is designed to select matches with replacement, which means that the same control can be matched to multiple cases. Selection without replacement requires different, slightly more complicated code. To develop that code, I will need example data. I know you used -dataex- in your initial post to give example data, but somehow it is corrupted and does not run. In particular, even though its initial -input- command specifies 5 variables to be read in, the data lines only show four variables. Just visually scanning it, it seems the variable workplace safety violator is not there. So please post back with a corrected -dataex-.

I should add that from a statistical perspective there is no reason to prefer matching without replacement. There is no statistical benefit to preventing the same control from matching to the same case. It is purely an esthetic matter. And it does have possible drawbacks: you may end up with some cases that go unmatched if you prevent matching to a previously used control.

Added: Crossed with #6.

Last edited by Clyde Schechter; 05 Nov 2021, 10:44.
Comment

Aga Smith

Join Date: Jan 2021
Posts: 8

05 Nov 2021, 16:31

Thanks, Clyde.

So suppose I stay with a choice with replacement. When I use the following code to limit the number of controls to one per case, is the selection done randomly? It would explain why I end up with different matches each time.

set seed 1234
gen double shuffle1 = runiform()
gen double shuffle2 = runiform()
by gvkey year (shuffle1 shuffle2), sort: keep if _n == 1
drop shuffle*

Here is my code:

gen long obs_no = _n
gen byte case =formatch

preserve
keep if case
tempfile cases
save `cases'
restore
keep if !case
tempfile controls
save `controls'

use `cases', clear
*/*
gen log_size = log(size)
local limit = log(0.3)
rangejoin log_size -`limit' `limit' using `controls', by(sector year) suffix(_control)
*/

gen upper = size*1.3
gen lower = size*0.7
rangejoin size lower upper using `controls', by(sector year) suffix(_control)

drop if missing(gvkey_control)

//keep one control per case
set seed 1234
gen double shuffle1 = runiform()
gen double shuffle2 = runiform()
by gvkey year (shuffle1 shuffle2), sort: keep if _n == 1
drop shuffle*

sort gvkey year
by gvkey_control, sort : gen dup = cond(_N==1,0,_n)
drop if dup>1
drop dup

rename year fyear
gen m8=1

save match, replace

and my data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double gvkey float(size sector year formatch)
1004  6.258968 41 2005 1
1034  6.814104 13 2000 0
1034   7.47556 13 2001 0
1034  7.066647 13 2002 0
1034  6.417892 13 2003 0
1034  6.952617 13 2004 0
1034  6.797158 13 2005 0
1034  7.340683 13 2006 0
1034  6.945712 13 2007 0
1045  6.937496 40 2003 1
1056  5.896753 37 2000 0
1056   7.23158 37 2001 0
1056  6.440205 37 2002 0
1056   6.03312 37 2003 0
1056  6.134929 37 2004 0
1056  6.966675 37 2005 0
1056   6.44056 37 2006 0
1072  8.796995 36 2000 0
1072  8.011474 36 2001 0
1072  8.204611 36 2002 0
1072  7.354906 36 2003 0
1072  7.959785 36 2004 0
1072  7.658557 36 2005 0
1072  8.022314 36 2006 0
1072  7.866893 36 2007 0
1072  7.692276 36 2008 0
1072  7.344129 36 2009 0
1072  7.789475 36 2010 0
1072  7.838665 36 2011 0
1072  7.718201 36 2012 0
1072  7.604263 36 2013 0
1072  7.703979 36 2014 0
1072  7.783254 36 2015 0
1072  7.652248 36 2016 0
1072  7.919608 36 2017 0
1076  6.477864 34 2004 1
1078  10.89571 13 2003 1
1094  5.123987 41 2007 0
1094  5.416334 41 2008 0
1094  5.229864 41 2009 0
1094  5.107294 41 2010 0
1094  4.981055 41 2011 0
1094  5.185262 41 2012 0
1094  5.494053 41 2013 0
1094  5.960195 41 2014 0
1094  6.257522 41 2015 0
1094  6.473635 41 2017 0
1161   7.44495 36 2013 1
1209  9.412098 14 2006 1
1213  6.976413 40 2000 0
1213  6.570662 40 2002 0
1230  7.619003 40 2011 1
1239  7.096351  9 2000 0
1239  7.330117  9 2001 0
1239  7.634803  9 2002 0
1239  7.929049  9 2003 0
1239  8.140529  9 2004 0
1239  8.280563  9 2005 0
1239 8.4589205  9 2007 0
1239   7.79599  9 2008 0
1239  7.888255  9 2009 0
1239  7.908348  9 2010 0
1240  9.614973 42 1999 0
1240  9.348515 42 2001 0
1240  9.367451 42 2002 0
1240  8.986947 42 2003 0
1240  9.059108 42 2004 0
1240  9.038346 42 2005 0
1243  9.293663 19 2001 0
1243  9.352709 19 2002 0
1243  9.749476 19 2004 0
1243  9.631033 19 2006 0
1246  7.374483 41 2000 0
1246  6.346355 41 2001 0
1246  7.039108 41 2003 0
1246  6.979948 41 2004 0
1246   7.44773 41 2005 0
1246  7.211398 41 2006 0
1246  7.457497 41 2007 0
1254  6.877424 40 2000 1
1281  5.775748 36 2000 0
1281  6.219883 36 2001 0
1281  5.555331 36 2002 0
1300  10.73355 48 2000 1
1318 10.165283 32 2000 0
1318  9.880318 32 2001 0
1318  9.861051 32 2002 0
1318  9.672207 32 2003 0
1318  9.586236 32 2004 0
1318  9.784776 32 2005 0
1318 10.094337 32 2006 0
1359  8.171508 13 2000 0
1372  4.956957 23 2000 0
1372 4.6138525 23 2001 0
1372 4.2713466 23 2002 0
1408  8.593639  4 2000 0
1408  8.434957  4 2001 0
1408  8.675776  4 2002 0
1408  8.830039  4 2003 0
1408  9.254977  4 2004 0
end

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#9

05 Nov 2021, 18:22

When I use the following code to limit the number of controls to one per case, is the selection done randomly?

Well, if you are using my code with the shuffle1 shuffle2 variables to do it, it is done "randomly" in the sense that the random numbers shuffle1 and shuffle2 determine which control gets selected for each case. But, with the random number seed set, you should get the same results each time you run it. That is, you should if gvkey and year uniquely identify observations in the original data set. In fact, I have just run that code with your example data five times and got identical results each time. Your example data does have observations uniquely identified by gvkey and year. And, in your example data, there are only three cases that find a match at all with -rangejoin-, and each of those finds only a single match, so this is not really a good test case.

If gvkey and year do not uniquely identify the observations in the starting data, then there would be indeterminacy in the selection of the control, and you would need to modify the -by- prefixes in the code to include whatever other variables are needed to uniquely identify the observations at the outset. (Or, make it -by obs_no (shuffle1 shuffle2):-)

If, however, you are using the code with -gen dup = ...-, that code is non-deterministic and can produce different results each time.

By the way, if you want the selected control to be the nearest neighbor on size, then change the code after -rangejoin- to:

Code:

drop if missing(gvkey_control) //keep one control per case set seed 1234 gen double shuffle1 = runiform() gen double shuffle2 = runiform() gen delta = abs(size - size_control) by gvkey year (delta shuffle1 shuffle2), sort: keep if _n == 1 drop shuffle*

This will give you a nearest neighbor match. You still need the random numbers shuffle1 and shuffle2 to randomly select a control in the event that there are cases that have ties for nearest neighbor among the matchable controls. If that should happen, the code will select one of the nearest neighbors at random (but reproducibly).

Last edited by Clyde Schechter; 05 Nov 2021, 18:32.
Comment

Announcement