How can I choose a random sample matched on certain variables?

Chris Sean

Join Date: Mar 2022

Posts: 18
#1

How can I choose a random sample matched on certain variables?

25 Dec 2023, 07:03

I have the following datasets. The first one is called "countdata.dta" and a minimal working example of it is as follows:

Code:

input id num_treat num_control 1 2 5 2 3 1

and another dataset called "fulldatafirm.dta" as follows:

Code:

input id firm trait 1 1004 4056 1 1013 4021 1 1072 5064 1 1075 4056 1 1117 4056 1 1121 4056 1 1161 4021 1 1209 5064 1 1210 5064 1 1230 5064 1 1239 4021 1 1254 4021 1 1266 4021 1 1279 4021 1 1300 4021 2 3734 8942 2 3761 8942 2 3814 8942 2 3833 8942 2 3835 8942 2 3851 8942 2 3855 8942 2 3897 7284 2 3937 7284 2 3946 7284 2 3969 7284 2 4001 7284 2 4029 7284 2 4036 5622 2 4049 5622 2 4052 5622 2 4058 5622 2 4077 5622 2 4087 5622

The "countdata.dta" tells you how many treatment (=1) and control (=0) firms to randomly select (without replacement) for each id in the "fulldatafirm.dta" dataset. However, the value of "trait" for the control firms MUST be in the union of the "traits" of the randomly selected treatment firms.

Let me illustrate how this works. For id=1, in "countdata.dta", num_treat=2, this means we want to randomly select 2 firms from "fulldatafirm.dta" and set treatment=1. Then, for the same id=1, we have num_control=5, meaning we want to randomly select 5 firms from "fulldatafirm.dta" (which are different from the two we just picked) and set treatment=0. However, the "trait" value of the control firms MUST be contained in the union of the trait values from the treatment firms. An example of one possible randomly selected sample is the following dataset:

Code:

input id firm treatment trait 1 1075 1 4056 1 1161 1 4021 1 1004 0 4056 1 1013 0 4021 1 1117 0 4056 1 1121 0 4056 1 1239 0 4021 2 3833 1 8942 2 3835 1 8942 2 3851 1 8942 2 3734 0 8942

So, for id=1, we randomly picked firm=1075 and firm=1161 for treatment=1. Note that firm=1075 has trait=4056 and firm=1161 has trait=4021. This means that for the 5 control firms we randomly select, the trait of these 5 observations must EITHER be 4056 OR 4021. Same goes for id=2.

Now, I want to repeat this random selection, say, 1000 times. So at the end, I have 1000 different datasets.

What I have done so far is constructed the following code that does the random selection but without the "trait" restriction (each randomly selected dataset is not saved or anything in the following code, that isn't important, I am just showing how each iteration works):

Code:

set seed 1234 local reps 1000 forvalues i = 1/`reps' { use countdata, clear gen num_tot=num_treat+num_control merge 1:m id using fulldatafirm keep if _merge == 3 drop _merge duplicates drop gen double shuffle = runiform() gen byte treatment = . gen byte selected_tot = . by id (shuffle), sort: replace treatment = _n <= num_treat by id (shuffle), sort: replace selected_tot = _n <= num_tot keep if selected_tot==1 keep id firm treatment }

However, how can I add in the "trait" restriction? The above code just chooses the control sample without ensuring the control firm comes from the same set of "trait" as the randomly selected treatment firm.

Last edited by Chris Sean; 25 Dec 2023, 07:09.
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30143

25 Dec 2023, 11:47

This should do it:

Code:

clear*
input id    num_treat    num_control
1    2    5
2    3    1
end
tempfile countdata
save `countdata'

clear
input id    firm    trait
1    1004    4056
1    1013    4021
1    1072    5064
1    1075    4056
1    1117    4056
1    1121    4056
1    1161    4021
1    1209    5064
1    1210    5064
1    1230    5064
1    1239    4021
1    1254    4021
1    1266    4021
1    1279    4021
1    1300    4021
2    3734    8942
2    3761    8942
2    3814    8942
2    3833    8942
2    3835    8942
2    3851    8942
2    3855    8942
2    3897    7284
2    3937    7284
2    3946    7284
2    3969    7284
2    4001    7284
2    4029    7284
2    4036    5622
2    4049    5622
2    4052    5622
2    4058    5622
2    4077    5622
2    4087    5622
end
tempfile fulldatafirm
save `fulldatafirm'

set seed 1234
local reps 1000

forvalues i = 1/`reps' {
    use `countdata', clear
    merge 1:m id  using `fulldatafirm', keep(match) nogenerate
    duplicates drop

    gen double shuffle = runiform()
    gen byte selected_tot = .
    
    //    SELECT TREATED GROUP
    by id (shuffle), sort: gen byte treatment = _n <= num_treat
    
    //    CREATE A FRAME WITH THE LEVELS OF TRAIT AMONG THE SELECTED TREATMENT GROUP
    frame put trait if treatment, into(allowable)
    frame allowable {
        duplicates drop
    }
    
    //    SELECT CONTROLS RESTRICTING TRAIT TO VALUES IN THE TREATMENT GROUOP
    frlink m:1 trait, frame(allowable)
    gen byte ng = missing(allowable)
    by id (ng treatment shuffle), sort: gen byte control = _n <= num_control ///
        & !ng & !treatment

    keep if treatment | control
    keep id firm treatment trait
    frame drop allowable
    
    // CODE TO SAVE THE SAMPLE GOES HERE
}

Note: There is a simpler way to do this if the values of num_treat and num_control are always small, but it cannot be extended to large numbers. So the code written above, a bit more complicated, is entirely general in this regard.

You do not say how you want to handle the situation, which could easily arise, even probably will arise, where there are not sufficient controls available with the requisite values of trait to select num_control of them. This code resolves that situation by selecting as many as are available, leaving the sample a bit short. I chose that not because it is necessarily the best way to handle this but because it was the easiest from a coding perspective.

Comment

Chris Sean

Join Date: Mar 2022

Posts: 18
#3

27 Dec 2023, 05:57

This is perfect, thank you Clyde.
Comment

Announcement