Match sample of firms with a control sample

krishantha Ainsworth

Join Date: Dec 2020

Posts: 35
#1

Match sample of firms with a control sample

22 Apr 2022, 09:59

Dear Statalist,

I have a group of firms ( a panel). They include my treatment group and some other firms of my population.
I need to match the treatment group with rest of the firms in a way within same industry (variable sic)
and within 10% (plus or minus) of size. So each firm can have one two three or none matching firms.
After that I have to remove rest of the firms that are neither matching nor treatment.

How can I do it ?
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30058
#2

22 Apr 2022, 10:53

The code for matching is moderately complicated and it depends on the details of the data you are starting with. No description in words can provide adequate information. You must show example data, and you must use the -datatex- command to do that in order to get a useful answer. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When you post back with example data from -dataex-, please also indicate whether you want matching with or without replacement. In matching with replacement, the same firm can be a control form for more than one treatment firm. In matching without replacement any control firm can only serve as the control for a single treatment firm. Replacement without matching usually results in more cases having no available match, and there are no statistical advantages to replacement without matching. Some people, however, prefer it for aesthetic reasons that I have never understood.

Also, since you are talking about panel data, can the same treatment firm have different controls at different times? Or does each firm just get a single control that is paired with it for the entire duration.
Comment

krishantha Ainsworth

Join Date: Dec 2020
Posts: 35

26 Apr 2022, 21:59

Dear Clyde,

Please find herewith the example.
In this case Firms X,Y,Z or W are control firms and Firms A to H are the firms that needed to be matched with them based on size and industry.
1) There can be new control firms enter or leave in any given year.
2) There can be non-control firms enter or leave in any given year.
3) There can be more than one non-control firms matched with control firms (e.g in 1991 both firms A and D are matched with firm Z)
4) "

can the same treatment firm have different controls at different times?

" Here I made it possible for same treatment firm have different controls at different times. Yes, that's fine.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int year byte id str6 firm int(sic size) byte(controlfirm matches_sic) int(lowerbound upperbound) byte(wihin_10pct match matchid)
1991  1 "Firm X" 5812 1000 1 .  900 1100 . . .
1991  2 "Firm Y" 5400 2000 1 . 1800 2200 . . .
1991  3 "Firm Z" 5990 3000 1 . 2700 3300 . . .
1991  4 "Firm A" 5990 3200 0 1    .    . 1 1 3
1991  5 "Firm B" 6000 3200 0 0    .    . 0 0 0
1991  6 "Firm C" 5400 2100 0 1    .    . 1 1 2
1991  7 "Firm D" 5400 2050 0 1    .    . 1 1 3
1991  8 "Firm E" 5990 3400 0 1    .    . 0 0 0
1992  1 "Firm X" 5812 1000 1 .  900 1100 . . .
1992  2 "Firm Y" 5400 2000 1 . 1800 2200 . . .
1992  3 "Firm Z" 5990 3000 1 . 2700 3300 . . .
1992  4 "Firm A" 5990 3200 0 1    .    . 1 1 3
1992  5 "Firm B" 6000 3200 0 0    .    . 0 0 0
1992  6 "Firm C" 5400 2100 0 1    .    . 1 1 2
1992  7 "Firm D" 5400 2050 0 1    .    . 1 1 3
1992  9 "Firm H" 5812 1100 0 1    .    . 1 1 1
1993 10 "Firm W" 6200 1500 1 . 1350 1650 . . .
1993  2 "Firm Y" 5400 2000 1 . 1800 2200 . . .
1993  3 "Firm Z" 5990 3000 1 . 2700 3300 . . .
1993  4 "Firm A" 5990 3200 0 1    .    . 1 1 3
1993  5 "Firm B" 6000 3200 0 0    .    . 0 0 0
1993  6 "Firm C" 5400 2100 0 1    .    . 1 1 2
1993  7 "Firm D" 5400 2050 0 1    .    . 1 1 3
1993  9 "Firm H" 5812 1100 0 0    .    . 0 0 0
end

Last edited by krishantha Ainsworth; 26 Apr 2022, 22:01.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30058

26 Apr 2022, 22:44

Code:

//  SEPARATE THE CASES AND CONTROLS
preserve
keep if controlfirm == 1
drop controlfirm
tempfile controls
save `controls'

restore
keep if controlfirm == 0
drop controlfirm

//  MATCH THEM
gen lower = 0.9*size
gen upper = 1.1*size
rangejoin size lower upper using `controls', by(sic year) suffix(0)

//  ELIMINATE THE UNMATCHED
drop if missing(id0)

//  KEEP AT MOST THREE MATCHES
by year firm, sort: keep if _n < = 3

//  RE-ORGANIZE TO LONG
gen year0 = year
gen sic0 = sic
drop upper lower
rename (year-size) =1
gen match_id = id1
reshape long year id firm sic size, i(id1 year1) j(controlfirm)
drop id1 year1
sort id year

Comment

krishantha Ainsworth

Join Date: Dec 2020

Posts: 35
#5

27 Apr 2022, 08:51

Dear Clyde,

Thank you so much for the reply. Here I have a few issues with the code.
1) Where do you compare the non-control firs with the control firms ( whether they are in the same industry and within lower and upper bounds)
2) When I run this code for the full data set it give me an error

Code:

reshape long year id firm sic size, i(companyid1 fyear1) j(franch) (note: j = 0 1) variable id does not uniquely identify the observations Your data are currently wide. You are performing a reshape long. You specified i(companyid1 fyear1) and j(franch). In the current wide form, variable id1 year1 should uniquely identify the observations. Remember this picture:...

Then what I did was removed the duplicates

Code:

duplicates drop id1 year1, force duplicates drop id0 year0, force

After that when I run the reshape code it gives me an output (50/50) matched but it seems some of the control firms are also missing in that output.

For example when I run

Code:

tabulate sic if controlfirm == 1

I can see a drop in control firms at the very beginning and after the last line of the code.
However when I run only

Code:

duplicates drop id1 year1, force

it allows me to run the reshape code but this time number of control firms are higher than the initial.

Ideally what I expect is to have the same number of observations at the beginning and the end and allowing me to recognize matched firms though the variable "match".

Last edited by krishantha Ainsworth; 27 Apr 2022, 08:57.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30058
#6

27 Apr 2022, 09:13

The code in #4 ran without error messages in the example data you provided. In that data, the combination of id and year uniquely identified observations, and the code relied on that. Evidently in your real data, that does not hold. The following code removes that reliance so you won't have to remove anything. That said, based on what you show in #5, it seems that you have some instances where the same firm in the same year appears more than once in the data set and with conflicting information about the other variables. That might mean you have bad data. I suggest you run -duplicates tag id year, gen(flag)- and then -browse if flag- to see these observations and figure out what is going on. If these observations represent data errors, you should remove the incorrect ones. But you should also investigate the data management that led up to the creation of the data set and fix those problems, along with any others you find in the process.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input int year byte id str6 firm int(sic size) byte controlfirm 1991 1 "Firm X" 5812 1000 1 1991 2 "Firm Y" 5400 2000 1 1991 3 "Firm Z" 5990 3000 1 1991 4 "Firm A" 5990 3200 0 1991 5 "Firm B" 6000 3200 0 1991 6 "Firm C" 5400 2100 0 1991 7 "Firm D" 5400 2050 0 1991 8 "Firm E" 5990 3400 0 1992 1 "Firm X" 5812 1000 1 1992 2 "Firm Y" 5400 2000 1 1992 3 "Firm Z" 5990 3000 1 1992 4 "Firm A" 5990 3200 0 1992 5 "Firm B" 6000 3200 0 1992 6 "Firm C" 5400 2100 0 1992 7 "Firm D" 5400 2050 0 1992 9 "Firm H" 5812 1100 0 1993 10 "Firm W" 6200 1500 1 1993 2 "Firm Y" 5400 2000 1 1993 3 "Firm Z" 5990 3000 1 1993 4 "Firm A" 5990 3200 0 1993 5 "Firm B" 6000 3200 0 1993 6 "Firm C" 5400 2100 0 1993 7 "Firm D" 5400 2050 0 1993 9 "Firm H" 5812 1100 0 end // SEPARATE THE CASES AND CONTROLS preserve keep if controlfirm == 1 drop controlfirm tempfile controls save `controls' restore keep if controlfirm == 0 drop controlfirm gen tuple_id = _n // MATCH THEM gen lower = 0.9*size gen upper = 1.1*size rangejoin size lower upper using `controls', by(sic year) suffix(0) // ELIMINATE THE UNMATCHED drop if missing(id0) // KEEP AT MOST THREE MATCHES by tuple_id, sort: keep if _n < = 3 // RE-ORGANIZE TO LONG drop upper lower rename (id firm size) =1 reshape long id firm size, i(tuple_id) j(controlfirm) sort id year

Where do you compare the non-control firs with the control firms ( whether they are in the same industry and within lower and upper bounds)

Look at the -rangejoin- command. That's where that happens.

By the way, I forgot to mention that -rangejoin- is written by Robert Picard and is available from SSC. To use it you also need the -rangestat- command, by Robert Picard, Nick Cox, and Roberto Ferrer. It is also available from SSC.
Comment
krishantha Ainsworth

Join Date: Dec 2020

Posts: 35
#7

27 Apr 2022, 22:05

Dear Clyde,

I had a look into my actual data. When I check for duplicates -duplicates tag id year, gen(flag)- at beginning there are no id year duplicates.
however when I reach to the code -reshape long id firm size, i(tuple_id) j(controlfirm)- somehow it creates duplicates in terms of
year and id1 that I found through -duplicates tag id year, gen(flag)-. They are created while running the code up to reshape.

So the reshape again generates an error like this.

Code:

. reshape long id firm size, i(tuple_id) j(controlfirm) (note: j = 0 1) variable id does not uniquely identify the observations Your data are currently wide. You are performing a reshape long. You specified i(tuple_id) and j(controlfirm). In the current wide form, variable tuple_id should uniquely identify the observations. Remember this picture: long wide +---------------+ +------------------+ | i j a b | | i a1 a2 b1 b2 | |---------------| <--- reshape ---> |------------------| | 1 1 1 2 | | 1 1 3 2 4 | | 1 2 3 4 | | 2 5 7 6 8 | | 2 1 5 6 | +------------------+ | 2 2 7 8 | +---------------+ Type reshape error for a list of the problem observations.

I can remove them by -duplicates drop year id, force- and run reshape but it also reduces contolfirms.

The differences in the actual data than the example are:
1) There can be unmatched control firms in the actual with non-controls as there is no suitable firm to match
3) The number of observations each year could be different.

I'm not sure if above contribute to forming duplicates when reach to the command reshape from the original data.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30058
#8

27 Apr 2022, 22:57

Sorry, this is an error in my code. The part I wrote about re-organizing the data to long would be correct if we were keeping only a single match to each case. And in your example data, that is what happened. But it breaks with multiple matching. The duplicates you are seeing are actually supposed to be there: they are the multiple matches to the same case. So, the code below will properly overcome this, not be removing them, but by distinguishing them for -reshape-:

Code:

// SEPARATE THE CASES AND CONTROLS preserve keep if controlfirm == 1 drop controlfirm tempfile controls save `controls' restore keep if controlfirm == 0 drop controlfirm gen tuple_id = _n // MATCH THEM gen lower = 0.9*size gen upper = 1.1*size rangejoin size lower upper using `controls', by(sic year) suffix(0) // ELIMINATE THE UNMATCHED drop if missing(id0) // KEEP AT MOST THREE MATCHES by tuple_id, sort: keep if _n < = 3 // RE-ORGANIZE TO LONG drop upper lower rename (id firm size) =1 gen _j = id0 reshape long id firm size, i(tuple_id _j) j(controlfirm) drop _j sort id year

Last edited by Clyde Schechter; 27 Apr 2022, 22:59.
Comment
krishantha Ainsworth

Join Date: Dec 2020

Posts: 35
#9

27 Apr 2022, 23:26

That's fine Clayed.

But now when I run this no error pops up.
I get a final data sample exactly double the size of controls and each industry has same number of control and non-control firms.
This is strange ( I'm not sure how I exactly got somethig like that as for some firms there are multiple matches and for some firms there are no matches).

Next, most firms I marked as control in the original sample (i.e controlfirm = 1) ends up as
non-control firms (i.e controlfirm = 0) after processing data .

I also don't get a column "match" which says which non-control firm matches to which control firm.

Last edited by krishantha Ainsworth; 27 Apr 2022, 23:32.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30058
#10

27 Apr 2022, 23:50

I get a final data sample exactly double the size of controls and each industry has same number of control and non-control firms.
This is strange ( I'm not sure how I exactly got somethig like that as for some firms there are multiple matches and for some firms there are no matches).

I think this is likely just a coincidence.

Next, most firms I marked as control in the original sample (i.e controlfirm = 1) ends up as
non-control firms (i.e controlfirm = 0) after processing data .

Ah, you are right. I reversed the coding of the controlfirm variable. I think I did that because usually when I do these matching operations I have a variable for case = 1 control = 0, whereas you are working the other way around. Anyway, I think the simplest solution is to simply

Code:

replace controlfirm = !controlfirm

at the end of the code, rather than making the changes at several places along the way.

I also don't get a column "match" which says which non-control firm matches to which control firm.

Well, I didn't call it match. But the variable tuple_id gives this information. If you sort the data by tuple_id, you will see that each group of observations with the same tuple_id contains one case and up to three controls that are matched to it. If you want a variable that presents this same information differently, as giving the id number of the case to which each control matches, you can do this (after you run the command above to fix the controlfirm variable):

Code:

by tuple_id, sort: egen matching_case = max(cond(controlfirm == 0, id, .)

In most analyses of matched data, however, you will not find this variable useful. Most matched-pair analysis commands (all that I am personally familiar with) identify a group of matches by a tuple_id, not by a variable identifying the case. But perhaps you have some other use for a case-identifier variable.
Comment

krishantha Ainsworth

Join Date: Dec 2020
Posts: 35

#11

28 Apr 2022, 01:31

Originally posted by Clyde Schechter View Post

I think this is likely just a coincidence.

Ah, you are right. I reversed the coding of the controlfirm variable. I think I did that because usually when I do these matching operations I have a variable for case = 1 control = 0, whereas you are working the other way around. Anyway, I think the simplest solution is to simply

Code:

replace controlfirm = !controlfirm

at the end of the code, rather than making the changes at several places along the way.

Well, I didn't call it match. But the variable tuple_id gives this information. If you sort the data by tuple_id, you will see that each group of observations with the same tuple_id contains one case and up to three controls that are matched to it. If you want a variable that presents this same information differently, as giving the id number of the case to which each control matches, you can do this (after you run the command above to fix the controlfirm variable):

Code:

by tuple_id, sort: egen matching_case = max(cond(controlfirm == 0, id, .)

In most analyses of matched data, however, you will not find this variable useful. Most matched-pair analysis commands (all that I am personally familiar with) identify a group of matches by a tuple_id, not by a variable identifying the case. But perhaps you have some other use for a case-identifier variable.

It seems however that's not quite so the issue. Some original controls (i.e. controlfirms == 1) remain the same and some have changed. Another thing to note is that, despite the fact that no errors occur, duplicates are created in this sample after running the code. I think that what makes them equal at the end. There are repeated id year combinations in this sample after running this code. So I believe it has also removed some unmatched control firms.
A sample is here. Here Firm D has four entries for 2000. They match with different firms and two of them are below.

Code:

 tuple_id
controlfirm
year
firm
size
id
sic
matching_case

8810
1
2000
Firm D
85.929
2163
5812
28511

9664
1
2000
Firm D
85.929
2163
5812
30818

4093
1
2000
Firm D
85.929
2163
5812
14292

13928
1
2000
Firm D
85.929
2163
5812
65434










tuple_id
controlfirm
year
firm
size
id
sic
matching_case

8810
0
2000
Firm AB
89.575
28511
5812
28511










tuple_id
controlfirm
year
firm
size
id
sic
matching_case

9664
0
2000
Firm AC
79.37
30818
5812
30818

If this panel full matching is prone to errors is there a way I can match year by year without this rangejoin command ?.

Last edited by krishantha Ainsworth; 28 Apr 2022, 01:36.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30058
#12

28 Apr 2022, 11:00

Some original controls (i.e. controlfirms == 1) remain the same and some have changed.

That does not happen in the example data you showed, and I do not see where the code would have introduced that error. Nevertheless, in the code at bottom, I have revised the code so that it does not change the controlfirm variable at all during processing. This should eliminate this problem. If it does not, please post back with a new data example which exhibits the problem so I can troubleshoot it.

What you show about firm D are not errors. Firm D is a control firm. It is matched to multiple cases. This is what matching with replacement does. I asked you in #2 whether you wanted to match with or without replacement. You didn't respond to that question. Since I prefer matching with replacement (because it is easier to code, leaves fewer cases unmatched, and because it has no statistical disadvantages compared to matching without replacement) I chose to go that route.

That said, it does seem that sometimes control firms are being reused unnecessarily and are "crowding out" some others that "never get a chance" even though they are suitable matches to some case. The code shown at the bottom here mitigates that problem by randomizing which controls are kept for each case. I should have done this in the first place.

So I believe it has also removed some unmatched control firms.

Yes, any unmatched control firms, and any unmatched case firms are removed from the data by the code. I understood this to be what you wanted. And I have not changed it.

If this panel full matching is prone to errors is there a way I can match year by year without this rangejoin command ?

Well, except for the issue with controlfirms being wrong, these are not, as I have just explained, errors. And in any case they have nothing to do with full sample vs year-by-year. In fact, the matching already is year-by-year because I included year in the -by()- option of -rangejoin-. You can verify for yourself in the results that all observations in any tuple have the same value of year. (I had assumed you wanted year by year matching, although looking back over the post I see now that you did not actually say that.)

Code:

* Example generated by -dataex-. For more info, type help dataex clear input int year byte id str6 firm int(sic size) byte controlfirm 1991 1 "Firm X" 5812 1000 1 1991 2 "Firm Y" 5400 2000 1 1991 3 "Firm Z" 5990 3000 1 1991 4 "Firm A" 5990 3200 0 1991 5 "Firm B" 6000 3200 0 1991 6 "Firm C" 5400 2100 0 1991 7 "Firm D" 5400 2050 0 1991 8 "Firm E" 5990 3400 0 1992 1 "Firm X" 5812 1000 1 1992 2 "Firm Y" 5400 2000 1 1992 3 "Firm Z" 5990 3000 1 1992 4 "Firm A" 5990 3200 0 1992 5 "Firm B" 6000 3200 0 1992 6 "Firm C" 5400 2100 0 1992 7 "Firm D" 5400 2050 0 1992 9 "Firm H" 5812 1100 0 1993 10 "Firm W" 6200 1500 1 1993 2 "Firm Y" 5400 2000 1 1993 3 "Firm Z" 5990 3000 1 1993 4 "Firm A" 5990 3200 0 1993 5 "Firm B" 6000 3200 0 1993 6 "Firm C" 5400 2100 0 1993 7 "Firm D" 5400 2050 0 1993 9 "Firm H" 5812 1100 0 end // SEPARATE THE CASES AND CONTROLS preserve keep if controlfirm == 1 tempfile controls save `controls' restore keep if controlfirm == 0 gen tuple_id = _n // MATCH THEM gen lower = 0.9*size gen upper = 1.1*size rangejoin size lower upper using `controls', by(sic year) suffix(1) // ELIMINATE THE UNMATCHED drop if missing(id1) // KEEP AT MOST THREE MATCHES SELECTED AT RANDOM set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED gen double shuffle = runiform() by tuple_id (shuffle), sort: keep if _n < = 3 // RE-ORGANIZE TO LONG drop upper lower rename (id firm size controlfirm) =0 gen matching_case_id = id0 assert controlfirm0 == 0 & controlfirm1 == 1 reshape long id firm size controlfirm, i(tuple_id matching_case_id) drop _j shuffle

Note: The revisions from previous code here are fairly extensive, so I have not put them in bold face. Please use this code as a whole, rather than editing earlier code.
Comment
javes omwansa

Join Date: May 2022

Posts: 2
#13

26 May 2022, 22:45

I am working with an almost similar data set for stunting on 189 cases and 380 controls which i have been trying to match without replacement on cases and gender unsuccessfully. Could you please offer any commands to fit since each time I do i end up with only cases while the controls disappeared? I was working after modifying this command for my situation but it did not work thanks.
local totalcases = 100 // for example local totalcontrols = 10000 local nctl = 4 // I'm choosing 4 controls per case // mock control data set seed 8846 set obs `totalcontrols' gen int id_ctl = _n gen byte case = 0 gen byte age = 20 +ceil(65*runiform()) // broad age range assumed gen int sex = runiform() > 0.5 tempfile controls save `controls' clear // mock case data set obs `totalcases' gen int id = _n gen byte case = 1 gen byte age = 20 +ceil(65*runiform()) gen sex = runiform() > 0.5 tempfile cases save `cases' // ************************************************** ***************************** // Actual solution starts here use `controls' gen rand = runiform() sort rand // random order for controls drop rand save `controls', replace // use `cases' // exact match on sex, within +/- 3 years for age compress rangejoin age -3 3 using `controls', by(sex) drop *_U // clean up rename (id age sex) (id_case age_case sex_case) // clean up // Sample `nctl' controls w/o replacement bysort id_ctl: keep if _n ==1 // use each control only once bysort id_case: keep if _n <= `nctl' // keep up to `nctl' controls for each case // // put the control data onto the file and check it out merge 1:1 id_ctl using `controls', keep(match) rename (age sex) (age_ctl sex_ctl) // Check how many controls were found for every case bysort id_case: gen byte numcontrols = _N if _n ==1 tab numcontrols
Comment

Announcement