Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Randomized Control Trials - An Interesting Exercise

    Hello Everyone,

    In an upcoming workshop, I intend to provide a demonstration of the following question: Does Randomization really ensures statistically similar samples on unobserved variables? In theory, randomization ensures statistically similar samples on both observe and unobserved characteristics.
    To check this, I shall delete the gender variable from a dataset (case1) and select a number of random samples (with replacement). Since my dataset will not have a gender variable, it is considered unobserved in the given setting. I then will included the gender variable in all the samples (manually) to see if my sample gender proportions matches the gender proportions in the original data-set (one having gender variable or case1).

    Assuming my original data-set is "case1", can someone please share a list of commands I can use for collecting many many such samples? only then i'll be able to prove that on average that mean of sample proportions will match population proportions.

    The list of tasks is as follows:

    1) Observe gender proportions in a data-set
    2) Delete variable gender
    3) Select a random sample
    4) Add a column of gender in the sample
    5) Observe gender proportions in the sample
    6) Repeat

    Is there a code that can help me do this in Stata for say 10,000 times? Please be informed that I do want to keep gender proportions from all samples and eventually report mean of all samples?

    Thanks!


  • #2
    I'll assume your data set case1.dta contains a unique identifier, call it id, and that your gender variable is called sex. You don't say what sample size you want, nor how many times you want to repeat. So I've written the code for sample size of 250 and 25 repetitions.

    Code:
    use case1, clear
    
    // VERIFY UNIQUE ID VARIABLE
    isid id
    
    // INITIALIZE RANDOM NUMBER GENERATOR FOR REPLICABILITY
    set seed 1234 // OR YOUR FAVORITE SEED
    
    // OBSERVE PROPORTIONS IN DATA SET
    tab sex
    
    forvalues i = 1/25 { // REPLACE 25 BY DESIRED NUMBER OF SAMPLES
        // DELETE GENDER VARIABLE
        preserve
    
        drop sex
    
        // SELECT RANDOM SAMPLE
        sample 250, count // REPLACE 250 BY YOUR DESIRED SAMPLE SIZE
    
        // BRING BACK THE GENDER VARIABLE
        merge 1:1 id using case1, assert(match using) keep(match) nogenerate keepusing(sex)
    
        // OBSERVE GENDER PROPORTIONS IN THE SAMPLE
        tab sex
    
        restore
    }
    You also don't say whether you want to sample with or without replacement; the above code assumes without. If you want sampling with replacement, use -bsample 250- instead of -sample 250, count-.

    Note also that the above code is deliberately written to mimic your sequence of steps. In terms of getting the actual results, a Monte Carlo simulation of the sampling distribution of the gender variable, you could do this more concisely using the -simulate- command.

    Comment


    • #3
      Am I missing something? The following is a question for the original poster; Clyde's code does what the poster requested, but I don't see that dropping sex before creating the sample, and then subsequently merging it back in, serves any purpose.

      The process of simple random sample selection is independent of the values of any of the variables, including sex. The question to be addressed in the workshop is, it seems to me, to what extent does the outcome of the process ensure statistically similar values of both observed and unobserved variables. Another way of selecting the sample would be to generate a variable with a random number, sort on that variable, and keep the first 250 observations. This would make the sample selection transparently independent of sex. So what is gained by dropping sex before the sample selection, other than assuring the students that sex was not used in sample selection, which it seems to me has potential for confusing the audience about the process of random sample selection. Put another way, the process of simple random sampling makes no differentiation between observed and unobserved variables.

      Alternatively, was it your desire to select a sample that was somehow representative of the distribution of the observed variables in the population, so that the process depended on the values of the observed variables, and the outcome of the process should have observed variables that reflect in some way the distribution of those variables in the population, and the issue become to what extent it reflects the distribution of the unobserved variables? That isn't what is accomplished with a simple random sample, which is what Clyde's code implements.

      Comment


      • #4
        Clyde's code does what the poster requested, but I don't see that dropping sex before creating the sample, and then subsequently merging it back in, serves any purpose.
        I think that what the original poster wants is to demonstrate the phenomenon. The dropping of sex and merging it back in changes nothing in the sampling process, of course. But those steps are there to create a "visual" demonstration of that fact, as it were. The sampling is carried out on a data set in which sex is "unobserved" and then when the sex distribution in the sample is "subsequently revealed," lo and behold, it will resemble that of the "population." It's theater, not statistics.

        Comment


        • #5
          If you accept that the presence of the gender variable may affect the sampling process, how do you then rule out its potential for having a homeopathic effect - the lingering bits from its previous presence in the data may affect the sampling draw?

          Comment


          • #6
            Originally posted by Clyde Schechter View Post

            I think that what the original poster wants is to demonstrate the phenomenon. The dropping of sex and merging it back in changes nothing in the sampling process, of course. But those steps are there to create a "visual" demonstration of that fact, as it were. The sampling is carried out on a data set in which sex is "unobserved" and then when the sex distribution in the sample is "subsequently revealed," lo and behold, it will resemble that of the "population." It's theater, not statistics.

            Thank you Clyde. This is exactly what I meant. I just wanted to show the audience a visual demonstration of something that holds well in sampling theory.

            Comment

            Working...
            X