Randomly assigning treatment at a proportion to a variable

Jordan Fisher

Join Date: Apr 2019

Posts: 5
#1

Randomly assigning treatment at a proportion to a variable

17 Apr 2019, 16:08

Hello. I have a variable X with, say, 10000 observations. I want to randomly assign a treatment value of 1 to this variable but cap the number of treatments to 50%, or 5000, observations. How can I do so in my do-file?

Thank you!
Tags: None
Joseph Coveney

Join Date: Apr 2014

Posts: 4422
#2

17 Apr 2019, 17:00

Maybe something like the following.

Code:

set seed 1234 generate double randu = runiform() isid randu sort randu replace X = 1 in 1/5000
Comment
Jordan Fisher

Join Date: Apr 2019

Posts: 5
#3

17 Apr 2019, 18:31

STATA reads that the 5000 is an 'invalid observation number'. Is there a workaround to this?

Thanks for your help, Joseph.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4422
#4

17 Apr 2019, 18:45

Originally posted by Jordan Fisher View Post

STATA reads that the 5000 is an 'invalid observation number'.

Stata won't say that if what you said above is true. See below.

.ÿ
.ÿversionÿ15.1

.ÿ
.ÿclearÿ*

.ÿ
.ÿ/*ÿ"IÿhaveÿaÿvariableÿXÿwith,ÿsay,ÿ10000ÿobservations"ÿ*/
.ÿquietlyÿsetÿobsÿ10000

.ÿgenerateÿbyteÿXÿ=ÿ0

.ÿ
.ÿ/*ÿ"Iÿwantÿtoÿrandomlyÿassignÿaÿtreatmentÿvalueÿofÿ1ÿtoÿthisÿvariableÿ
>ÿÿÿÿÿbutÿcapÿtheÿnumberÿofÿtreatmentsÿtoÿ50%,ÿorÿ5000,ÿobservations"ÿ*/
.ÿ
.ÿsetÿseedÿ1234

.ÿgenerateÿdoubleÿranduÿ=ÿruniform()

.ÿisidÿrandu

.ÿsortÿrandu

.ÿreplaceÿXÿ=ÿ1ÿinÿ1/5000
(5,000ÿrealÿchangesÿmade)

.ÿ
.ÿtabulateÿX

ÿÿÿÿÿÿÿÿÿÿXÿ|ÿÿÿÿÿÿFreq.ÿÿÿÿÿPercentÿÿÿÿÿÿÿÿCum.
------------+-----------------------------------
ÿÿÿÿÿÿÿÿÿÿ0ÿ|ÿÿÿÿÿÿ5,000ÿÿÿÿÿÿÿ50.00ÿÿÿÿÿÿÿ50.00
ÿÿÿÿÿÿÿÿÿÿ1ÿ|ÿÿÿÿÿÿ5,000ÿÿÿÿÿÿÿ50.00ÿÿÿÿÿÿ100.00
------------+-----------------------------------
ÿÿÿÿÿÿTotalÿ|ÿÿÿÿÿ10,000ÿÿÿÿÿÿ100.00

.ÿ
.ÿexit

endÿofÿdo-file

.
Comment
Jordan Fisher

Join Date: Apr 2019

Posts: 5
#5

17 Apr 2019, 19:09

Got it, thanks for your help.
Comment
Aditi Roy

Join Date: Jul 2017

Posts: 29
#6

07 Nov 2019, 18:28

Hi, my problem is very similar to this one, yet different.

I have a variable X with, say, 12000 observation. I want to randomly assign a treatment value of 1 to this variable but cap different number of treatments to different years.
I want to cap
29 variables in 1990
42 variables in 1991
57 variables in 1992 and so on.

I am doing this manually by looking into the cell no but how can I do it just with codes (not looking at the cell no). How can I do so in my do-file?

Code:

generate random= uniform() sort year random gen treat =0 replace treat =1 in1/29 //29 treatment in 1990 replace treat =1 in 296/337 //42 treatment in 1991 replace treat=1 in 1423/1479 // 57 treatment in 1992

Thanks
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#7

08 Nov 2019, 09:15

Your terminology of "cap" "variables" is quite unusual to me. Translating that, I take your goal to be: "Of all the observations with year = YYYY, randomly assign some stipulated number of the observations to have the treatment variable set at 1, with the remainder set at 0." You were on the right track. Here's one way to do what I think you want:

Code:

// Simulate data for illustration purposes. clear set obs 12000 gen int year = 1990 + ceil(20* runiform()) // // Randomly assign treatment within each year. generate random= uniform() gen byte treat = . sort year random by year: replace treat = (_n <=29) if (year == 1991) by year: replace treat = (_n <=42) if (year == 1992) // etc.
Comment

Aditi Roy

Join Date: Jul 2017
Posts: 29

08 Nov 2019, 23:11

Thanks for your reply Mike Lacy . I already have a year variable in my dataset. I am sorry but can you please explain to me what is this line performing? Also, as I already have a year variable in my dataset and thus STATA returns "Variable year already defined."

Code:

  
 gen int year = 1990 + ceil(20* runiform())

Further, I do want to simulate the regression after randomly assigning treatment for 1000 times. So I do the following without including the above-mentioned code as the variable year is already there( and even if I do, the STATA issues "Variable year already defined" in addition to the other command. However, STATA is issuing the following command " an error occurred when simulate executed my_model" and it is only returning 1 observation.

Code:

set matsize 8000
set more off
cap program drop my_model
program define my_model,rclass


set obs 12000
set seed 8600
gen int year = 1990 + ceil(20* runiform())
generate random=uniform()
gen byte treat= .
sort year random
by year: replace treat= (_n <=29) if (year == 1990)    
by year: replace treat= (_n <=42) if (year == 1991)    
by year: replace treat= (_n <=42) if (year == 1992)    
.
.
.
.
.
.
.
by year: replace treat= (_n <=19) if (year == 2010)  


reg wheat treat i.country_id i.year i.country_id#c.line_time_trend [aw=ypop], cluster( country_id)
return scalar coeff = _b[treat]
return scalar se = _se[treat]
return scalar r2 = e(r2)

exit
end
    

simulate  coeff=r(coeff) se=r(se) r2=r(r2), reps(1000): my_model    

    
sum

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2416
#9

09 Nov 2019, 10:01

"can you please explain to me what is this line performing" : Per the comment at the top of my code, that was part of simulating some data for *illustration.*

"Further, I do want to simulate the regression after randomly assigning treatment for 1000 times": That's an entirely different question than what you originally asked, which I had no way to know.

Per the StataList FAQ: If you want us to help you know why you are getting certain error messages, you must show us exactly what you typed, and exactly how Stata responded. I don't think that matters now for your situation, given my answer below, but you'll want to know that for the future.

Leaving everything before aside:

I'm guessing now that you want to repeatedly run your regression command while randomly assigning the treatment variable within year, but keeping your stipulated within-year distribution on the treatment variable. You want to see how your parameter estimates and so forth vary across this random shuffling of the treatment variable.

This could be done with -simulate-, I suppose, but as it happens, this is a well-known kind of procedure, known as a "permutation test." Stata implements this procedure with the built-in -permute- command. See -help permute- This command will repeatedly shuffle the values of your treatment variable, run a command, and save some chosen results from that command into a Stata data set. This shuffling can be done within strata, which is the year variable in your case. After this is complete, you can then inspect a file of the estimates obtained at each repetition.

Here is an *illustration*, using just three years of simulated data. If this is not what you want, you will need to get some help from a colleague in more clearly describing your problem and goal.

Code:

// Create some "fake" data to illustrate the procedure. // This is not anything you need to do but rather just some data to work with. clear set obs 12000 gen int year = 1990 + floor(3* runiform()) gen x = runiform() // Just one other explanatory variable here. // Create a treatment variable with the 0/1 distribution you want, // I'm *guessing* that a treatment variable distributed in this way // exists already in your data set. If that is true, you will not include // the following. I had to include the next four lines to make data like yours. sort year by year: gen byte treat = (_n <=29) if (year == 1990) by year: replace treat = (_n <=42) if (year == 1991) by year: replace treat = (_n <=57) if (year == 1992) // end of creating fake data for illustration. // // -permute- does everything for you. permute treat bt = _b[treat] se = _se[treat] r2 = e(r2), /// reps(1000) strata(year) saving("YourChosenOutputFile.dta"): /// reg y x treat // Look at the results from each repetition clear use "YourChosenOutputFile.dta"

Last edited by Mike Lacy; 09 Nov 2019, 10:04.
Comment

Aditi Roy

Join Date: Jul 2017
Posts: 29

#10

09 Nov 2019, 23:22

Thanks a lot, Mike Lacy Yes, this is what I want to execute. I wanted to assign the treatment randomly within years (with cap) each time and run the regression for 1000 times. I will try to be more explicit and sorry for not posting the entire code. I will keep this in mind for the future.

Stata is issuing ---

weights not allowed
r(101);

as we cannot include weights with permuting.

How do I go around this? Your help is really appreciated. I tried the following code with three years--

Code:

  
set matsize 8000, permanently
set obs 12000
set seed 8600
//gen int year = 1990 + floor(3* runiform())  // As an year variable already exists in my dataset.
generate random=uniform()
gen byte treat= .
sort year random
by year: replace treat= (_n <=29) if (year == 1990)    
by year: replace treat= (_n <=42) if (year == 1991)    
by year: replace treat= (_n <=42) if (year == 1992)    
 

permute treat bt = _b[treat] se = _se[treat] p_value = (2*ttail(e(df_r), abs(_b[treat]/_se[treat])))r2 = e(r2), ///  
reps(1000) strata(year) saving("OutputFile.dta"): ///
reg wheat treat i.country_id i.year i.country_id#c.line_time_trend [aw=ypop], cluster( country_id)  
  

clear
use "OutputFile.dta"

Last edited by Aditi Roy; 09 Nov 2019, 23:30.

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2416
#11

10 Nov 2019, 12:53

Searching in -help permute- (did you try this?) shows that Stata does not allow weights for -permute-, although there is a -force- option to allow the use of weights. The help says:
"force suppresses the restriction that command may not specify weights or be a svy command. permute is not suited for weighted estimation, thus permute should not be used with weights or svy. permute reports an error when it encounters weights or svy in command if the force option is not specified. This is a seldom used option, so use it only if you know what you are doing!"

Almost always, if Stata will not do something like this, there is a good reason, i.e., the professional statisticians at Stata Corp think it's not a valid approach. This would apply to your do-it-yourself approach to permutation methods as well as Stata's built-in approach. I'd suggest several things you might try:

1) Investigate the literature on using weighting in the context of a permutation test. Good keywords would include "permutation test" or "randomization test" with "weighting." That might help you find out why permutation methods with weights are generally not valid, or perhaps some limited circumstances under which they are.

2) Reconsider the need for a permutation test. While I'm a fan of them, perhaps that sort of approach is not needed here, i.e., the asymptotic theory will reasonably apply. I'm not knowledgeable enough to advise you, although others here (see 3 below) might be.

3) Post a new question here with a carefully prepared subject heading on either a) permutation tests and weighting or b) the need for a permutation test in your situation. The current subject heading for your posting on StataList will not indicate to people this aspect of your problem.
Comment
Aditi Roy

Join Date: Jul 2017

Posts: 29
#12

10 Nov 2019, 22:42

Thanks a lot, Mike Lacy . Also, I appreciate your ability to interpret my problem correctly when I failed to articulate it properly.
Comment
Aditi Roy

Join Date: Jul 2017

Posts: 29
#13

14 Nov 2019, 02:09

Sorry to bother you again Mike Lacy Something is not clear to me. After using the above-mentioned code in #10 for 100 times. I wanted to see the average effect as it shows in simulate. Now using permute - I get the following result for rep (100). And then when I sum to see the average of the coefficient. I see a different mean result. Now I am confused as I want to report the average effect of the coefficient, reporting the sum of bt(beta) is fine or not? I read help permute and saw the results is the difference in p values. Through the permute option, I have the stored p, beta ,se, and r-square . So for average effect of the regression should I sum their values and report or the outcome from the permute function.

Looking forward to your reply.

Attached Files
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#14

14 Nov 2019, 08:02

Because you did not show exactly what Stata code you executed, I can't tell what you did, and so I can't diagnose your problem That is, if you want more help from people here, you will need to post the exact code you executed, and the results you received. In doing this, I'd encourage you not to post your results as an image, but rather as text between code delimiters. (I believe the FAQ describes this suggestion.)

Various things you said don't make sense to me or are unclear, and make me think you're having some difficulty with understanding the meaning of what you're doing, as well as how to do it in Stata:

1) "simulate" does not show any average effect. It simply repeatedly runs the user's command, and saves the result. If simulate is showing some average value, it must be something you requested. I have no way to know what you averaged or how you did it.
2) What kind of average you intend is unclear to me. You *might* mean "the mean value of the coefficient across all random permutations."
3) "results is the difference in p values." I don't know what this means. . -permute- provides a simulated (empirical) sampling distribution under conditions that enforce a particular null hypothesis. The p-value reported by -permute- is the fraction of times that a parameter estimate on a random shuffling of the data set exceeds the value in the observed data set.
4) If you really did what was listed in #10, you are not getting a permutation test, as you have started off with a random assignment of your treatment. You need to start off with the observed assignment of your treatment variable, since that's what you want to compare with the permuted values.
5) Here's an issue that may not be causing you problems, but is worth noting: Any use of -permute- or -simulate- will give slightly different results each time unless you use a chosen random number seed before starting. See -help set seed-.
Comment

Announcement