Which stata command can I use to conduct a randomization check?

Bruno Hoepers

Join Date: May 2016
Posts: 125

Which stata command can I use to conduct a randomization check?

21 Sep 2019, 18:36

Hi all. As part of my dissertation, I conducted a between-subjects factorial experiment with two factors, video treatment (treat_video_recoded) and news treatment (treat_news_recoded). I need to run a randomization check and verify whether the covariates present some difference between treatment and control for these two factors. Should I conduct several t-tests or is there a command in Stata that does the randomization check? I've tried to find such command, to no avail. A data sample follows below:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(treat_video_recoded treat_news_recoded age_group_recoded turnout_recoded news_attent_recoded pol_interest_recoded inv_unemp_recoded media_trust_recoded party_pref_recoded)
2 2 1 0 4 3 3 3 0
2 2 3 1 5 2 5 3 1
2 2 5 1 2 2 5 4 0
2 2 1 1 5 3 4 3 0
2 2 1 1 3 2 5 3 0
2 2 3 1 5 2 3 2 0
2 2 2 0 4 2 4 2 0
2 2 1 1 4 3 3 1 0
2 2 2 1 5 2 5 3 0
2 2 3 1 5 3 4 3 0
2 2 3 0 5 5 5 3 1
2 2 1 0 4 3 5 3 1
2 2 3 1 5 5 5 5 0
2 2 5 1 5 3 5 3 0
2 2 2 1 4 2 3 2 0
2 2 2 1 5 3 3 4 1
2 2 3 1 5 3 5 3 1
2 2 3 1 5 5 5 2 0
2 2 1 1 5 4 5 5 0
2 2 3 1 4 3 5 3 1
2 2 1 1 5 4 5 3 1
2 2 2 1 4 2 2 3 0
2 2 1 1 5 5 4 4 0
2 2 4 1 5 5 5 3 0
2 2 2 1 4 2 3 3 0
2 2 3 0 5 3 5 2 0
2 2 1 1 4 3 5 3 0
2 2 1 1 5 4 5 3 1
2 2 4 1 5 5 5 5 0
2 3 4 1 5 4 4 4 1
2 3 5 1 4 2 3 3 1
2 3 1 1 4 3 5 2 1
2 3 3 1 4 3 4 3 0
2 3 4 1 4 1 3 2 0
2 3 5 1 5 4 5 3 1
2 3 4 1 4 3 5 3 0
2 3 3 1 5 3 5 3 0
2 3 3 1 4 3 5 3 1
2 3 1 1 5 3 5 5 0
2 3 2 1 5 4 5 3 1
2 1 1 1 5 5 5 5 1
2 1 2 1 4 2 5 3 0
2 1 1 1 3 3 5 2 0
2 1 1 1 5 2 2 3 1
2 1 1 1 4 2 5 2 0
2 1 1 1 4 3 5 3 0
2 1 1 1 5 2 5 3 0
2 1 3 1 5 4 5 3 0
2 1 3 1 5 2 4 3 0
2 1 3 1 5 3 5 2 0
2 1 2 1 4 3 5 3 1
2 1 2 1 5 4 5 3 0
2 1 3 1 5 4 5 4 1
2 1 1 0 5 3 5 3 0
1 2 3 1 5 4 5 3 1
1 2 3 1 4 2 5 3 0
1 2 5 1 5 2 5 3 1
1 2 6 0 5 3 4 3 0
1 2 2 1 5 5 5 3 0
1 2 3 0 5 5 5 4 0
1 2 2 1 4 2 4 5 0
1 2 1 1 4 2 5 3 1
1 2 4 1 5 2 4 3 0
1 2 1 1 5 3 5 3 0
1 2 3 1 5 3 5 4 0
1 2 2 1 5 3 5 3 1
1 2 3 1 5 4 5 3 1
1 2 1 1 5 3 3 4 0
1 2 2 1 5 4 4 3 1
1 2 2 1 4 2 4 2 0
1 2 2 1 5 3 5 3 0
1 2 1 1 4 4 5 3 1
1 2 2 1 5 1 5 3 0
1 2 2 1 5 3 5 3 0
1 2 4 0 4 3 5 3 0
1 3 1 0 4 2 3 2 0
1 3 6 1 5 5 5 4 0
1 3 4 1 5 3 5 4 1
1 3 1 1 4 2 5 3 0
1 3 5 1 5 4 4 3 0
1 3 1 1 5 3 3 4 0
1 3 3 1 5 4 3 5 0
1 3 3 1 5 3 4 1 1
1 3 1 1 5 2 5 3 1
1 3 3 1 5 3 3 3 0
1 3 1 1 5 2 3 2 0
1 3 5 1 4 3 5 3 1
1 3 3 1 4 4 5 4 1
1 3 2 1 4 4 5 4 1
1 3 5 1 5 5 5 3 1
1 3 2 1 5 2 4 3 0
1 1 1 1 5 3 5 3 1
1 1 2 1 4 3 4 3 1
1 1 1 1 5 2 5 3 0
1 1 2 1 5 5 5 5 0
1 1 4 1 4 2 5 3 0
1 1 4 1 5 2 4 3 0
1 1 1 1 4 3 5 3 0
1 1 4 0 5 2 5 3 0
3 2 4 1 5 4 5 4 1
end

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#2

21 Sep 2019, 19:23

There are several reasons why a series of t-tests would be wrong. The first is that you have 6 treatment groups, not 2, and you really need an omnibus comparison of the covariates among the 6 groups, not just pairwise comparisons. The second is that, at least in the example data you give, the covariates appear to be discrete variables, which would imply doing cross-tabulations with the treatment groups instead.

At a deeper level, are you aware that the American Statistical Association has recommended that significance tests no longer be used? See https://www.tandfonline.com/doi/full...5.2019.1583913 for the "executive summary" and
https://www.tandfonline.com/toc/utas20/73/sup1 for all 43 supporting articles. Or https://www.nature.com/articles/d41586-019-00857-9 for the tl;dr.

However, even if you want to persist with outdated and misleading statistical practice because your dissertation committee requires you to, statistical testing was never appropriate for "randomization checking." There are two reasons for this. One is that the null hypothesis, that the distribution of the covariates from which the groups were sampled is identical in each group is always true in a randomized study. Unless there is some reason to believe that the randomization procedure itself was breached, a "significant difference" between groups on anything can only be a type I error. Now, you might think that these type I errors are precisely what you need to check for. But they are not: see the second reason.

For the second reason, think about why we care whether or not the randomization balances the covariates. We care about that because of concern that an imbalance in the covariates will bias our estimates of the treatment effects. Now notice that this is a sample-level issue. The imbalance in the covariates introducing bias into estimated treatment effects takes place whether the study is randomized or not: it has nothing to do with population-level distributions, it is all about the sample distributions of the covariates. So statistical hypothesis testing, which is about inference to populations from samples, is irrelevant: it is the answer to the wrong question! (Assuming you still believe it is a valid answer to any question at all.)

So what should you be doing to check randomization? First, you should just look at the sample distributions of the covariates in the 6 groups. If they are, as I perceived, discrete variables, then this would be done with

Code:

egen group = group(treat_*), label foreach v of varlist age_group_recoded-party_pref_recoded { tab `v' group, col }

You can then see the percentage at each level of each covariate in each randomization group and they should be roughly equal. By roughly equal, I mean close enough to equal that the observed degree of difference isn't large enough to make a meaningful difference in the outcome variable.

If the covariates are, in fact, integer variables that you want to treat as continuous, then replace the -tab- command with -tabstat `v', by(group)- to see the mean values in each group. Again, look for rough equality, close enough that the differences won't meaningfully alter the outcome.

What if you are not sure whether an observed difference is large enough to affect the outcomes? It depends on how strongly the covariate is associated with the outcome, of course. In fact, if the covariate were independent of the outcome, no amount of difference in the covariate would alter the outcome. So a borderline difference in the covariate is only likely to matter if the covariate has a fairly strong association with the outcome. But to reassure yourself, there is a simple check: regress the outcome on the treatment groups both with and without the covariates. If the covariates are adequately balanced, the regression coefficients for the six groups will be roughly the same in both regressions. Here, assuming the covariates are continuous:

Code:

regress outcome i.group age_group_recoded-party_referred regress outcome i.group

Then just look to see whether the coefficients of the groups in the two regressions are close enough that there is no practical, real-world difference between them. (If the covariates are discrete, then they have to be introduced with i. prefixes in the regression. If they are all discrete, then you will have a large number of indicator ["dummy"] variables in the model, more than the data set can reasonably handle. In that case, rather than throwing in all of the covariates at once, I would do a separate regression for each covariate.)
1 like
Comment
Bruno Hoepers

Join Date: May 2016

Posts: 125
#3

21 Sep 2019, 19:55

Thank you very much Clyde. The idea of regressing the outcome on the treatment groups both with and without the covariates is very easy to do and very valuable. Looking further in the literature, I found more criticisms like yours on statistical tests for randomization checking. For instance, http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf ; http://iscap.upenn.edu/sites/default...oofs150311.pdf . It seems that randomization check has no bearing on whether imbalance actually matters to the experimental analyses.
Comment

Announcement

Which stata command can I use to conduct a randomization check?

Comment

Comment