I have the following datasets. The first one is called "countdata.dta" and a minimal working example of it is as follows:
and another dataset called "fulldatafirm.dta" as follows:
The "countdata.dta" tells you how many treatment (=1) and control (=0) firms to randomly select (without replacement) for each id in the "fulldatafirm.dta" dataset. However, the value of "trait" for the control firms MUST be in the union of the "traits" of the randomly selected treatment firms.
Let me illustrate how this works. For id=1, in "countdata.dta", num_treat=2, this means we want to randomly select 2 firms from "fulldatafirm.dta" and set treatment=1. Then, for the same id=1, we have num_control=5, meaning we want to randomly select 5 firms from "fulldatafirm.dta" (which are different from the two we just picked) and set treatment=0. However, the "trait" value of the control firms MUST be contained in the union of the trait values from the treatment firms. An example of one possible randomly selected sample is the following dataset:
So, for id=1, we randomly picked firm=1075 and firm=1161 for treatment=1. Note that firm=1075 has trait=4056 and firm=1161 has trait=4021. This means that for the 5 control firms we randomly select, the trait of these 5 observations must EITHER be 4056 OR 4021. Same goes for id=2.
Now, I want to repeat this random selection, say, 1000 times. So at the end, I have 1000 different datasets.
What I have done so far is constructed the following code that does the random selection but without the "trait" restriction (each randomly selected dataset is not saved or anything in the following code, that isn't important, I am just showing how each iteration works):
However, how can I add in the "trait" restriction? The above code just chooses the control sample without ensuring the control firm comes from the same set of "trait" as the randomly selected treatment firm.
Code:
input id num_treat num_control 1 2 5 2 3 1
Code:
input id firm trait 1 1004 4056 1 1013 4021 1 1072 5064 1 1075 4056 1 1117 4056 1 1121 4056 1 1161 4021 1 1209 5064 1 1210 5064 1 1230 5064 1 1239 4021 1 1254 4021 1 1266 4021 1 1279 4021 1 1300 4021 2 3734 8942 2 3761 8942 2 3814 8942 2 3833 8942 2 3835 8942 2 3851 8942 2 3855 8942 2 3897 7284 2 3937 7284 2 3946 7284 2 3969 7284 2 4001 7284 2 4029 7284 2 4036 5622 2 4049 5622 2 4052 5622 2 4058 5622 2 4077 5622 2 4087 5622
Let me illustrate how this works. For id=1, in "countdata.dta", num_treat=2, this means we want to randomly select 2 firms from "fulldatafirm.dta" and set treatment=1. Then, for the same id=1, we have num_control=5, meaning we want to randomly select 5 firms from "fulldatafirm.dta" (which are different from the two we just picked) and set treatment=0. However, the "trait" value of the control firms MUST be contained in the union of the trait values from the treatment firms. An example of one possible randomly selected sample is the following dataset:
Code:
input id firm treatment trait 1 1075 1 4056 1 1161 1 4021 1 1004 0 4056 1 1013 0 4021 1 1117 0 4056 1 1121 0 4056 1 1239 0 4021 2 3833 1 8942 2 3835 1 8942 2 3851 1 8942 2 3734 0 8942
Now, I want to repeat this random selection, say, 1000 times. So at the end, I have 1000 different datasets.
What I have done so far is constructed the following code that does the random selection but without the "trait" restriction (each randomly selected dataset is not saved or anything in the following code, that isn't important, I am just showing how each iteration works):
Code:
set seed 1234 local reps 1000 forvalues i = 1/`reps' { use countdata, clear gen num_tot=num_treat+num_control merge 1:m id using fulldatafirm keep if _merge == 3 drop _merge duplicates drop gen double shuffle = runiform() gen byte treatment = . gen byte selected_tot = . by id (shuffle), sort: replace treatment = _n <= num_treat by id (shuffle), sort: replace selected_tot = _n <= num_tot keep if selected_tot==1 keep id firm treatment }
Comment