Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How can I choose a random sample matched on certain variables?

    I have the following datasets. The first one is called "countdata.dta" and a minimal working example of it is as follows:

    Code:
    input id    num_treat    num_control
    1    2    5
    2    3    1
    and another dataset called "fulldatafirm.dta" as follows:

    Code:
    input id    firm    trait
    1    1004    4056
    1    1013    4021
    1    1072    5064
    1    1075    4056
    1    1117    4056
    1    1121    4056
    1    1161    4021
    1    1209    5064
    1    1210    5064
    1    1230    5064
    1    1239    4021
    1    1254    4021
    1    1266    4021
    1    1279    4021
    1    1300    4021
    2    3734    8942
    2    3761    8942
    2    3814    8942
    2    3833    8942
    2    3835    8942
    2    3851    8942
    2    3855    8942
    2    3897    7284
    2    3937    7284
    2    3946    7284
    2    3969    7284
    2    4001    7284
    2    4029    7284
    2    4036    5622
    2    4049    5622
    2    4052    5622
    2    4058    5622
    2    4077    5622
    2    4087    5622
    The "countdata.dta" tells you how many treatment (=1) and control (=0) firms to randomly select (without replacement) for each id in the "fulldatafirm.dta" dataset. However, the value of "trait" for the control firms MUST be in the union of the "traits" of the randomly selected treatment firms.

    Let me illustrate how this works. For id=1, in "countdata.dta", num_treat=2, this means we want to randomly select 2 firms from "fulldatafirm.dta" and set treatment=1. Then, for the same id=1, we have num_control=5, meaning we want to randomly select 5 firms from "fulldatafirm.dta" (which are different from the two we just picked) and set treatment=0. However, the "trait" value of the control firms MUST be contained in the union of the trait values from the treatment firms. An example of one possible randomly selected sample is the following dataset:

    Code:
    input id    firm    treatment    trait
    1    1075    1    4056
    1    1161    1    4021
    1    1004    0    4056
    1    1013    0    4021
    1    1117    0    4056
    1    1121    0    4056
    1    1239    0    4021
    2    3833    1    8942
    2    3835    1    8942
    2    3851    1    8942
    2    3734    0    8942
    So, for id=1, we randomly picked firm=1075 and firm=1161 for treatment=1. Note that firm=1075 has trait=4056 and firm=1161 has trait=4021. This means that for the 5 control firms we randomly select, the trait of these 5 observations must EITHER be 4056 OR 4021. Same goes for id=2.

    Now, I want to repeat this random selection, say, 1000 times. So at the end, I have 1000 different datasets.

    What I have done so far is constructed the following code that does the random selection but without the "trait" restriction (each randomly selected dataset is not saved or anything in the following code, that isn't important, I am just showing how each iteration works):

    Code:
    set seed 1234
    local reps 1000
    
    
    forvalues i = 1/`reps' {
    use countdata, clear
    gen num_tot=num_treat+num_control
    merge 1:m id  using fulldatafirm
    keep if _merge == 3
    drop _merge
    duplicates drop
    
        gen double shuffle = runiform()
        gen byte treatment = .
        gen byte selected_tot = .
     
    by id (shuffle), sort: replace treatment = _n <= num_treat
    by id (shuffle), sort: replace selected_tot = _n <= num_tot
    
    keep if selected_tot==1
    keep id firm treatment
    }
    However, how can I add in the "trait" restriction? The above code just chooses the control sample without ensuring the control firm comes from the same set of "trait" as the randomly selected treatment firm.
    Last edited by Chris Sean; 25 Dec 2023, 07:09.

  • #2
    This should do it:
    Code:
    clear*
    input id    num_treat    num_control
    1    2    5
    2    3    1
    end
    tempfile countdata
    save `countdata'
    
    clear
    input id    firm    trait
    1    1004    4056
    1    1013    4021
    1    1072    5064
    1    1075    4056
    1    1117    4056
    1    1121    4056
    1    1161    4021
    1    1209    5064
    1    1210    5064
    1    1230    5064
    1    1239    4021
    1    1254    4021
    1    1266    4021
    1    1279    4021
    1    1300    4021
    2    3734    8942
    2    3761    8942
    2    3814    8942
    2    3833    8942
    2    3835    8942
    2    3851    8942
    2    3855    8942
    2    3897    7284
    2    3937    7284
    2    3946    7284
    2    3969    7284
    2    4001    7284
    2    4029    7284
    2    4036    5622
    2    4049    5622
    2    4052    5622
    2    4058    5622
    2    4077    5622
    2    4087    5622
    end
    tempfile fulldatafirm
    save `fulldatafirm'
    
    set seed 1234
    local reps 1000
    
    forvalues i = 1/`reps' {
        use `countdata', clear
        merge 1:m id  using `fulldatafirm', keep(match) nogenerate
        duplicates drop
    
        gen double shuffle = runiform()
        gen byte selected_tot = .
        
        //    SELECT TREATED GROUP
        by id (shuffle), sort: gen byte treatment = _n <= num_treat
        
        //    CREATE A FRAME WITH THE LEVELS OF TRAIT AMONG THE SELECTED TREATMENT GROUP
        frame put trait if treatment, into(allowable)
        frame allowable {
            duplicates drop
        }
        
        //    SELECT CONTROLS RESTRICTING TRAIT TO VALUES IN THE TREATMENT GROUOP
        frlink m:1 trait, frame(allowable)
        gen byte ng = missing(allowable)
        by id (ng treatment shuffle), sort: gen byte control = _n <= num_control ///
            & !ng & !treatment
    
        keep if treatment | control
        keep id firm treatment trait
        frame drop allowable
        
        // CODE TO SAVE THE SAMPLE GOES HERE
    }
    Note: There is a simpler way to do this if the values of num_treat and num_control are always small, but it cannot be extended to large numbers. So the code written above, a bit more complicated, is entirely general in this regard.

    You do not say how you want to handle the situation, which could easily arise, even probably will arise, where there are not sufficient controls available with the requisite values of trait to select num_control of them. This code resolves that situation by selecting as many as are available, leaving the sample a bit short. I chose that not because it is necessarily the best way to handle this but because it was the easiest from a coding perspective.

    Comment


    • #3
      This is perfect, thank you Clyde.

      Comment

      Working...
      X