Randomized Control Trials - An Interesting Exercise

danishussalam

Join Date: Jul 2014

Posts: 140
#1

Randomized Control Trials - An Interesting Exercise

16 Mar 2016, 08:30

Hello Everyone,

In an upcoming workshop, I intend to provide a demonstration of the following question: Does Randomization really ensures statistically similar samples on unobserved variables? In theory, randomization ensures statistically similar samples on both observe and unobserved characteristics.
To check this, I shall delete the gender variable from a dataset (case1) and select a number of random samples (with replacement). Since my dataset will not have a gender variable, it is considered unobserved in the given setting. I then will included the gender variable in all the samples (manually) to see if my sample gender proportions matches the gender proportions in the original data-set (one having gender variable or case1).

Assuming my original data-set is "case1", can someone please share a list of commands I can use for collecting many many such samples? only then i'll be able to prove that on average that mean of sample proportions will match population proportions.

The list of tasks is as follows:

1) Observe gender proportions in a data-set
2) Delete variable gender
3) Select a random sample
4) Add a column of gender in the sample
5) Observe gender proportions in the sample
6) Repeat

Is there a code that can help me do this in Stata for say 10,000 times? Please be informed that I do want to keep gender proportions from all samples and eventually report mean of all samples?

Thanks!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

16 Mar 2016, 09:13

I'll assume your data set case1.dta contains a unique identifier, call it id, and that your gender variable is called sex. You don't say what sample size you want, nor how many times you want to repeat. So I've written the code for sample size of 250 and 25 repetitions.

Code:

use case1, clear // VERIFY UNIQUE ID VARIABLE isid id // INITIALIZE RANDOM NUMBER GENERATOR FOR REPLICABILITY set seed 1234 // OR YOUR FAVORITE SEED // OBSERVE PROPORTIONS IN DATA SET tab sex forvalues i = 1/25 { // REPLACE 25 BY DESIRED NUMBER OF SAMPLES // DELETE GENDER VARIABLE preserve drop sex // SELECT RANDOM SAMPLE sample 250, count // REPLACE 250 BY YOUR DESIRED SAMPLE SIZE // BRING BACK THE GENDER VARIABLE merge 1:1 id using case1, assert(match using) keep(match) nogenerate keepusing(sex) // OBSERVE GENDER PROPORTIONS IN THE SAMPLE tab sex restore }

You also don't say whether you want to sample with or without replacement; the above code assumes without. If you want sampling with replacement, use -bsample 250- instead of -sample 250, count-.

Note also that the above code is deliberately written to mimic your sequence of steps. In terms of getting the actual results, a Monte Carlo simulation of the sampling distribution of the gender variable, you could do this more concisely using the -simulate- command.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

16 Mar 2016, 09:39

Am I missing something? The following is a question for the original poster; Clyde's code does what the poster requested, but I don't see that dropping sex before creating the sample, and then subsequently merging it back in, serves any purpose.

The process of simple random sample selection is independent of the values of any of the variables, including sex. The question to be addressed in the workshop is, it seems to me, to what extent does the outcome of the process ensure statistically similar values of both observed and unobserved variables. Another way of selecting the sample would be to generate a variable with a random number, sort on that variable, and keep the first 250 observations. This would make the sample selection transparently independent of sex. So what is gained by dropping sex before the sample selection, other than assuring the students that sex was not used in sample selection, which it seems to me has potential for confusing the audience about the process of random sample selection. Put another way, the process of simple random sampling makes no differentiation between observed and unobserved variables.

Alternatively, was it your desire to select a sample that was somehow representative of the distribution of the observed variables in the population, so that the process depended on the values of the observed variables, and the outcome of the process should have observed variables that reflect in some way the distribution of those variables in the population, and the issue become to what extent it reflects the distribution of the unobserved variables? That isn't what is accomplished with a simple random sample, which is what Clyde's code implements.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

16 Mar 2016, 10:14

Clyde's code does what the poster requested, but I don't see that dropping sex before creating the sample, and then subsequently merging it back in, serves any purpose.

I think that what the original poster wants is to demonstrate the phenomenon. The dropping of sex and merging it back in changes nothing in the sampling process, of course. But those steps are there to create a "visual" demonstration of that fact, as it were. The sampling is carried out on a data set in which sex is "unobserved" and then when the sex distribution in the sample is "subsequently revealed," lo and behold, it will resemble that of the "population." It's theater, not statistics.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

16 Mar 2016, 10:57

If you accept that the presence of the gender variable may affect the sampling process, how do you then rule out its potential for having a homeopathic effect - the lingering bits from its previous presence in the data may affect the sampling draw?
2 likes
Comment
danishussalam

Join Date: Jul 2014

Posts: 140
#6

21 Mar 2016, 01:10

Originally posted by Clyde Schechter View Post

I think that what the original poster wants is to demonstrate the phenomenon. The dropping of sex and merging it back in changes nothing in the sampling process, of course. But those steps are there to create a "visual" demonstration of that fact, as it were. The sampling is carried out on a data set in which sex is "unobserved" and then when the sex distribution in the sample is "subsequently revealed," lo and behold, it will resemble that of the "population." It's theater, not statistics.

Thank you Clyde. This is exactly what I meant. I just wanted to show the audience a visual demonstration of something that holds well in sampling theory.
Comment

Announcement

Randomized Control Trials - An Interesting Exercise

Comment

Comment

Comment

Comment

Comment