Using a group variable to set data for panel analysis

Patricia Alfonzo

Join Date: Oct 2015

Posts: 16
#1

Using a group variable to set data for panel analysis

26 Oct 2015, 20:20

Hi everyone,

I have a dataset with information for teachers at baseline (time==0) and endline (time==1). Some teachers participated in a training (treatment==1) and some did not (control: treatment==0). The teachers at baseline are not the same as the teachers at endline, but they were selected at random from the same schools and we know they are similar based on several demographic characteristics (e.g. age, education level, socioeconomic level, etc.). Because the teachers are not the same, when I try to set my dataset for panel analysis, Stata shows the following message: "repeated time values within panel"

Could I create a group variable including the time variable, as shown below, to conduct panel data analysis? In other words, would that grouped variable allow me to estimate the difference in difference (i.e. difference between the treatment and control groups across time)? Please see the code below (also note that I am controlling for other variables, but for simplicity here I only included the dependent variable and main predictor in the command below):

Code:
egen panelvar=group(school_id class_id time)
xtset panelvar
xtreg teacher_performance treatment

The xtset command with the panelvar works fine, but I want to make sure the analysis is accounting for differences between the treatment and control groups across time and not just estimating before-and-after differences.

I thought of collapsing my dataset at the school level so that I have one school id per time unit, but my sample becomes very small, as I only have 30 schools between treatment and control schools. I don't think my results will be robust enough with such sample.

Many thanks in advance!
P
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30024
#2

27 Oct 2015, 09:11

You need to better explain your data and what you originally tried. Your assertion `"Because the teachers are not the same, when I try to set my dataset for panel analysis, Stata shows the following message: "repeated time values within panel""' is not really consistent with having a data set where each teacher has one observation at baseline and one at endline. If each teacher had only those two observations, -xtset teacher_id time- would work just fine.

The commands you show above do not leave time as a within-panel variable, rather time is part of the definition of a single unit of analysis. This is unlikely to be what you want, and it certainly isn't what I would use to set up a difference in difference analysis to test the effect of training.

You need to figure out what variable, or combination of variables, uniquely identifies the two observations associated with a single teacher. If it is a single variable (i.e. a teacher ID variable) then you can use it as the panel variable in -xtset teacher_id time-. If it takes multiple variables to identify a single teacher (e.g., perhaps school_id and class_id), then you would want

Code:

egen panelvar = group(school_id class_id) xtset panelvar time

as your setup. And then for a difference in differences analysis you would want:

Code:

xtreg teacher_performance i.treatment##i.time

(In the last code block, I'm ignoring any possible covariates you might want to adjust for in the analysis.)
Comment

Patricia Alfonzo

Join Date: Oct 2015
Posts: 16

27 Oct 2015, 11:01

Clyde Schechter thank you for your response. Let me clarify my description of the dataset. Teachers are not the same at baseline and endline, meaning that each teacher in my dataset does not have one observation at t==0 and one at t==1. In other words, there's no unique teacher_id with values for both baseline and endline. The treatment is at the school level. Please see a sample of the dataset below:

school_id	class_id	teacher_id	time	treatment	teacher_performance
1	1	1	0	0	16
1	2	2	0	0	29
1	3	3	0	0	16
1	4	4	0	0	18
1	5	5	0	0	14
1	1	6	1	0	21
1	2	7	1	0	22
1	3	8	1	0	12
1	4	9	1	0	27
1	5	10	1	0	20
2	1	11	0	1	16
2	2	12	0	1	22
2	3	13	0	1	25
2	4	14	0	1	13
2	5	15	0	1	18
2	1	16	1	1	16
2	2	17	1	1	28
2	3	18	1	1	9
2	4	19	1	1	27
2	5	20	1	1	22

So, even if I use the unique identifier at the school level, I always end up with repeated time values. I know I don't have panel data but, given that I know teachers are similar based on several demographic characteristics, I was wondering if there was anyway I could do difference in difference analysis (without having to collapse the dataset at the school level) to see differences between teachers who got the training and those who didn't and also control for time.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30024
#4

27 Oct 2015, 12:06

Since you don't have panel data, it doesn't make sense to analyze it as if you did. What you have here is a two-level model, with a treatment implemented at the higher level, and outcomes measured at the lower level. A mixed-effects model is probably your best bet, something like:

Code:

mixed teacher_performance i.treatment##i.time || school_id:

Note: no -xtset- needed, nor even helpful, nor even possible.

In fact, it might make sense to go a step farther and see this as a three level model, with classes nested within schools:

Code:

mixed teacher_performance i.treatment##i.time || school_id: || class_id:

Now, for what it's worth, fitting these models to your example data, there is essentially zero variance at the school and class levels, so you could then just revert to a simple flat -regress teacher_performance i.treatment##i.time-. Of course, this example may not be representative of the larger data set, and the school (and class) effects there might be appreciable. You'll have to try it to find out.
Comment
Patricia Alfonzo

Join Date: Oct 2015

Posts: 16
#5

27 Oct 2015, 12:43

Clyde Schechter I tried the second command you suggested above on the larger dataset and attached is the ouput. Is the difference between treatment and control across time the coefficient for treatment#time1#endline (2.921132)? I also ran the regular regresion and the coefficient (treatment#time1#endline) is similar and also significant - I'm attaching the output for that one as well. Is reatment#time1#endline the correct coefficient to interpret?
Attached Files

mixed-effects output.pdf (171.8 KB, 1 view)

reg output.pdf (95.4 KB, 1 view)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30024
#6

27 Oct 2015, 12:53

Yes, the coefficient of 1#endline treatment#time is the estimator of the difference in differences.

I notice that in your real data, you have appreciable variance at both the school and class levels. And I call your attention to the penultimate line in the mixed-effects output file where it shows that the likelihood ratio test contrasting the mixed model with the linear model has a huge chi square value and is highly significant. It follows that you should report the mixed effects results, not the ordinary linear regression. It is true that the fixed effects estimates, in this case, don't differ much, but still the mixed-effects model here is clearly a superior model of the data.
Comment
Patricia Alfonzo

Join Date: Oct 2015

Posts: 16
#7

27 Oct 2015, 13:06

Clyde Schechter when I add control variables to the mixed-effect model, the Stata output does not show standard errors for the control variables. Is this how it should be or should I be including the control variables in a different format? In the attached output, I added age (continuous variable), gender (binary), and PPI (socioeconomic index; continuous variable). When I add these variables one at a time, the output does show standard errors, but when I add them all, it doesn't.
Attached Files

mixed-effects with controls.pdf (189.8 KB, 1 view)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30024
#8

27 Oct 2015, 13:37

The model you show is not the correct code for adjusting for effects of covariates, and, in fact, the model you show really doesn't make sense at all except under extremely unlikely conditions. What you have done is add random slopes for age, gender, and PPI. Moreover, by not including these same variables among the fixed effects, you have, in effect, constrained the mean slope for each of these variables to be zero--constraints that are unlikely to reflect reality.

If what you want to do is incorporate covariates into the model and adjust for them (which is what is usually meant by the terminology "control variables"), the code would be:

Code:

mixed teacher_performance i.treatment##i.time i.gender c.age c.ppi || school_id: || class_id:

The inclusion of these covariates in the fixed-effects equation is appropriate even though, I imagine, in your data they are defined at the class level and are constant within class.
Comment
Patricia Alfonzo

Join Date: Oct 2015

Posts: 16
#9

27 Oct 2015, 13:42

Clyde Schechter Thanks for the correct code. Yes, I want to adjust for the effects of these covariates, which are indeed defined at the class level. Many thanks!
Comment

Announcement