Organizing experimental data

Daryosh Jan

Join Date: Jul 2016

Posts: 22
#1

Organizing experimental data

01 Nov 2017, 10:50

Dear Stata users,

I did an experiment. For my experiment, I randomly assigned people into control and treatment groups. I collected data for both groups in the baseline (assigned control and treatment, 2015) and end-line (assigned control and treatment, 2016) surveys, assuming there is a policy change. Now, how should I organize my data in stata to run a difference-in-difference model?

Thank you,

Daryosh
Tags: None
Chris Larkin

Join Date: Apr 2016

Posts: 296
#2

01 Nov 2017, 12:58

A difference in difference model allows you to draw experimental (i.e. causal) conclusions from non-experimental data. Because you've randomly assigned participants to receive the intervention you needn't run a difference in difference analysis.

In this instance I would go with an OLS model with robust standard errors, of the form regress depvar treatmentindicator anycovariates, robust. This model assumes your dependent variable is constructed based on difference in survey scores between baseline and follow up. If it's not then controlling on baseline scores will likely increase the fit of your model -- although in an ideal experiment, with a large sample, and good randomization, it should not affect your ATE much
1 like
Comment
Daryosh Jan

Join Date: Jul 2016

Posts: 22
#3

01 Nov 2017, 14:55

Thank you so much for the quick response Chris Larkin. Now, I have two separate datasets. One for the baseline survey ( control and treatment) and another for end-line ( control and treatment). Should I use the append to stack them together? I didn't understand your explanation on dependent variable? For instance, in my case, the dependent variable that I want to analyze is number of contraceptive used. I would be very much happy if you clarify a little bit more.

Last edited by Daryosh Jan; 01 Nov 2017, 15:02.
Comment
Chris Larkin

Join Date: Apr 2016

Posts: 296
#4

01 Nov 2017, 15:42

Hi Daryosh, it'd be great if you read through the FAQs about how to post questions that are easy to answer -- specifically i'm referring to the suggestion to include a data example (using dataex) and any code. It's difficult for me/other forum users to give you good advice if you don't.

Nevertheless, I can speak generally. You will want to merge (not append) these two datasets you mention. Let's assume that in both datasets there are only three variables (unique participant ID, treatment allocation, and survey score); you should merge 1:1 the two datasets on unique participant ID. This will leave you with four variables: unique ID, treatment allocation, pre-survey score, and post-survey score. You could construct your dependent variable to equal the followup survey score minus the baseline survey score, which will give you the difference from baseline (and the marginal gain throughout the period of the intervention in the thing you're measuring). Or, you could simply include the followup survey score as your dependent variable and control on the baseline score for each individual within your model.
1 like
Comment
Daryosh Jan

Join Date: Jul 2016

Posts: 22
#5

01 Nov 2017, 21:04

Thank you sir. I am trying to upload my sample datasets here ( with only 10 observations each). However, I couldn't find an option? Would you please let me know how upload datasets for you information? Many thanks
Comment
Chris Larkin

Join Date: Apr 2016

Posts: 296
#6

02 Nov 2017, 08:20

If you have further questions then, yes, you can provide a data snippet and i'll have a look.

You don't upload it though; you use a program called dataex. On your version of Stata, type into your command prompt ssc install dataex. Then read the instructions for the program by typing help dataex. This will give you the information you need to provide a properly formatted dataset in Statalist. It says in the help file to include the [CODE] delimiters when you copy and paste your data output. Don't forget to do this!
Comment

Daryosh Jan

Join Date: Jul 2016
Posts: 22

02 Nov 2017, 15:06

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(id treat y) float age
 1 0 12 35
 2 1 13 20
 3 1 15 26
 4 1 13 44
 5 0 11 34
 6 1 11 19
 7 0  9 26
 8 0 12 28
 9 1 11 40
10 0  8 34
end

Code:


* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(id treat y1) float age
 1 0 14 36
 2 1 17 21
 3 1 14 27
 4 1 14 43
 5 0 13 35
 6 1 17 20
 7 0 11 27
 8 0 11 29
 9 1 15 41
10 0  8 35
end

The first one is baseline and the second one is end-line survey. The first column is unique id, the second column treatment indicator, the third column dependent variable ( number of contraceptive use) and the fourth column is a covariate (women's age). Would you please kindly run a model using that datasets?

Comment

Chris Larkin

Join Date: Apr 2016
Posts: 296

02 Nov 2017, 16:52

Code:

clear
input byte(id treat y_pre) float age_pre
 1 0 12 35
 2 1 13 20
 3 1 15 26
 4 1 13 44
 5 0 11 34
 6 1 11 19
 7 0  9 26
 8 0 12 28
 9 1 11 40
10 0  8 34
end
tempfile pre
save `pre'


clear
input byte(id treat y_post) float age_post
 1 0 14 36
 2 1 17 21
 3 1 14 27
 4 1 14 43
 5 0 13 35
 6 1 17 20
 7 0 11 27
 8 0 11 29
 9 1 15 41
10 0  8 35
end

merge 1:1 id using `pre', nogen
order id age_pre age_post treat y_pre y_post

*Option one
gen y_diff = y_post - y_pre
regress y_diff treat, robust

*Option two
regress y_post treat y_pre, robust

Comment

Daryosh Jan

Join Date: Jul 2016

Posts: 22
#9

02 Nov 2017, 17:59

Thank you so much for the quick response. I was following your instructions earlier today and I did run models considering both options. However, I could not get similar results ( The same coefficient for the treat variable) for the " treat" variable in the options one and two. In the first option the coefficient of interest is insignificant while in the second option it is significant at 10%. Would you please explain the reason? Also, I did see a number of experimental papers with the same pre and post surveys that they include a bunch of control variables in the model. However, you didn't ( We could have controlled for age) , is there a justification for that?
Comment
Chris Larkin

Join Date: Apr 2016

Posts: 296
#10

02 Nov 2017, 19:50

There's greater variance in the followup survey scores between treatment and control so if we just look at that it is significant at a 0.05 level. However, this is probably an artefact of your small sample and imbalance across trial arms rather than a true effect of the intervention. If you look at the mean number of condoms reportedly used at the baseline it is about 20% higher in the treatment group. When you model the difference between the baseline and the followup the variance is reduced.

On your second question, it's completely up to you whether you want to include covariates! I'm not doing this analysis for you, just offering advice on how to construct the models you want in Stata. My advice though, is only include them if it makes theoretical sense to do so. If you have a small sample, for example, including a covariate could help with your estimation as it allows proportions of women (for example) to vary between treatment and control. In practice though, with a large enough sample, and good balance, including extra covariates shouldn't matter unless you want to conduct subgroup analysis.
Comment

Announcement

Organizing experimental data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment