difference- in- difference on a cross sectional data-set by group averages.

Aditi Roy

Join Date: Jul 2017

Posts: 29
#1

difference- in- difference on a cross sectional data-set by group averages.

02 Jan 2019, 04:01

Hi,

I want to implement difference-in-difference on a cross-sectional dataset by group averages ( eg: birth cohorts in states) as shown in this paper (attached).

So my basic regression is for individual i, living in state s, born in year t, so rather than this I want to get a group average where I will get it by birth cohorts in state s. I made the average of the outcome variable and ran the regression but I don't think that is how it should be done. I believe that if I am doing a group average then the no of observation should also decrease as it is not on an individual level but rather on a group level.

Can anyone please help me with the STATA loop code?

I am a neophyte, please do forgive me if I am not on point.

Thanks in advance. Looking forward.

Cheers,
A

MIT Press Journals

https://www.mitpressjournals.org

MIT Press Journals is a mission-driven, not-for-profit scholarly publisher devoted to the widest dissemination of its content.
Tags: cross-sectional, difference-in-difference, group averages, loop, regression
Clyde Schechter

Join Date: Apr 2014

Posts: 30115
#2

02 Jan 2019, 09:52

I doubt you will get anything beyond a vague suggestion of how to do this given the scanty information you have provided.

To get a helpful response, I think you need to show an example of the data you started with, followed by the exact code you used getting from there to the results. Then people will actually know what you're talking about, and what you've done and might be able to either validate your approach or troubleshoot it for you in concrete terms.

When posting back, be sure to use the -dataex- command to show the example data. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Aditi Roy

Join Date: Jul 2017

Posts: 29
#3

04 Jan 2019, 00:42

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(id year_of_birth years_of_education Post_X_policy) str1 states float(sex_ratio literacy_rate Current_age Household_asset Male_head survey_year) 1 1985 10 1 "A" .43 36 35 1 1 2000 2 1986 11 0 "B" .55 21 31 0 1 2005 3 1987 15 1 "A" .46 26 30 1 1 2000 4 1985 17 1 "C" .15 29 25 0 0 2000 5 1991 15 1 "B" .36 24 26 1 1 2005 6 1991 14 1 "A" .66 21 30 0 1 2000 7 1993 15 0 "B" .55 23 22 1 1 2000 8 1985 20 0 "C" .74 55 27 1 0 2005 end

I have 2 state-level variables i.e sex ratio and literacy and 3 individual levels: variable age, male head and current assets.

Now rather than using individual i, state s, born in year cohort t i.e the following regression:

reg years_of_education Post_X_policy sex_ratio literacy_rate Current_age Household_asset Male_head i.states i.year_of_birth i.survey_year, cluster(states)

I want to run the regression by group averages of individuals born in state s and year t.

I am doing the following

bys year_of_birth: egen avg_eduction= mean(years_of_education )
bys year_of_birth: egen avg_Post_X_policy= mean(Post_X_policy)
bys year_of_birth: egen avg_sex_ratio = mean(sex_ratio)
bys year_of_birth: egen avg_literacy_rate = mean(literacy_rate)
bys year_of_birth: egen avg_Current_age = mean(Current_age)
bys year_of_birth: egen avg_Household_asset= mean(Household_asset)
bys year_of_birth: egen avg_Male_head = mean(Male_head )

after this I am running the following reg
reg avg_years_of_education avg_Post_X_policy avg_sex_ratio avg_literacy_rate avg_Current_age avg_Household_asset avg_Male_head i.states i.survey_year, cluster(states)

Am I doing the correct thing?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30115
#4

04 Jan 2019, 09:16

Am I doing the correct thing?

It depends on what your specific research hypothesis is. By doing this you are recasting your unit analysis to state birth cohorts instead of individuals. You will not be able to infer anything like "a person is expected to have x more years of education if they have w more dollars in household assets" from this analysis. But you will have sharper estimates of these effects on birth cohorts. So you will be able to say things like "the average years of education is x years greater among those people born in a state and year where and when average Household assets were w dollars higher." Since you haven't actually said what question you are trying to ask, nobody can tell you whether this will give you the right answer. But it seems from what you wrote in #1 that this is what you want.

That said, assuming you want to pursue this approach, there is a problem with your code. Because you are taking averages over year of birth, every state will have the same values for these variables in a given year of birth. So you will have a single observation for each state in each birth year, but they will all be the same. And your state variable will end up dropping from the regression. Similar concerns arise with regard to survey year. Anyway, it doesn't make sense. If your unit of analysis is to be the birth cohort in a state, then you have to lose the survey year variable in your analysis and aggregate up to the state-birth cohort level (which will, indeed, decrease the number of observations).

So drop all those -egen- commands and instead do this:

Code:

collapse years_of_education Post_X_policy sex_ratio /// literacy_rate Current_age Household_asset Male_head, /// by(states year_of_birth)

to calculate the means and aggregate up.

Now you can run your regression as

Code:

regress years_of_education Post_X_policy sex_ratio /// literacy_rate Current_age Household_asset Male_head /// i.states, vce(cluster states)

(Note the absence of survey_year.)
Comment
Aditi Roy

Join Date: Jul 2017

Posts: 29
#5

05 Jan 2019, 03:17

Thanks a lot, Clyde for enlightening me. I am really grateful.

Also, I am including a linear time trend in the regression i.e. i.state#c.year_of_birth. How should I deal with this?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30115
#6

05 Jan 2019, 10:03

Adding i.state#c.year_of_birth is not adding a linear time trend to the model. Moreover, it will lead to a mis-specified and uninterpretable model.

If you really want to add a linear time trend, just add c.year_of_birth. If you want to allow the linear time trend to be different from one state to the next then add i.state##c.year_of_birth. The use of the ##, not #, is crucial. In order for a model with interactions to be properly specified it must contain not just the interaction term but all of the constituents of the interaction. Your model already contains i.state, which is good. But without including c.year_of_birth along with the interaction, you will not get a plain c.year_of_birth term, and that leaves an invalid model. That is why it is best to add interactions to regression models using the ## operator: Stata automatically expands that to include both the interaction and both constituents. (You can, of course, do this as i.state c.year_of_birth i.state#c.year_of_birth. But that's a lot more typing, and, it seems that people frequently forget something. The use of ## makes this foolproof.)

That said, I presume you have a fairly large number of states in your model, so this interaction is going to both eat up a large number of degrees of freedom and give you a very long regression output table with lots of state-specific time-trend slopes. If these time trends are being included simply for adjustment purposes, then probably no harm. But if you actually intend to draw conclusions about time-trends you are going to have a lot of results on your hand that may be difficult to summarize and say something meaningful about.

Finally, as I imagine you know, the norm in this community is to use our real given and family names as our username to promote collegiality and professionalism. So are you Naina, or are you Aditi? If the username you are working under is not your real name, please click on CONTACT US in the lower right corner of this page. Then send a message to the system administer requesting a change of your username to your real name. Thank you.
Comment
Aditi Roy

Join Date: Jul 2017

Posts: 29
#7

06 Jan 2019, 19:30

Thanks, Clyde.

I am using the state-specific linear year of birth trends for adjustment purpose only.

I am Aditi as well as Naina. My official name is Aditi and my nickname is Naina. As per the norm, I have removed my nickname from the signature post.
My apologies for the inconvenience.
Comment
Yonathan Adm

Join Date: Mar 2019

Posts: 3
#8

10 May 2019, 12:14

Dear Clyde,
I’m using difference in difference to estimate the impact of policy reform on child labor using survey data. My unit of analysis is a child who is between the age of 10 and 17. The problem is that I do not observe the same child over the two periods, before and after, since the data comes from National Labour Force Survey.

I have been thinking of using years of birth of a child to observe children by their year of birth in both before and after the reform period. The problem in this approach is that I only observe two years of birth which are common for the two periods. So, I tried to use the average of the all variables by child age and year of birth to overcome the problem and change the unit of analysis to child age and year of birth separately as indicated below. But there is a significant loss of observation. I’m not also sure if whether I’m doing it right.

So, can I do the individual level analysis on the assumption that the household level variation will not be cancelled out entirely as a Representative National Labour Force Survey? Or how can solve this problem?

Thank you very much!

collapse id07 id08 weight childwhour zemporment fathereduc nchildren ///
fatherage yearXtreated, by(childage)
collapse id07 id08 weight childage childwhour zemporment fathereduc nchildren ///
fatherage yearXtreated, by(childage)
svy: reg childwhour yearXtreated zemporment nchildren fathereduc childfemale childage i.region

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte region int id07 byte id08 double weight float(childage ybirth childwhour nchildren childfemale fathereduc zemporment yearXtreated) 14 40 1 97.65 15 1990 56 3 1 0 -.9472483 2005 14 40 7 97.65 14 1991 11 2 1 4 -.781602 2005 14 40 10 97.65 16 1989 35 2 0 0 -.9472483 2005 14 40 10 97.65 12 1993 35 2 0 0 -.3738648 2005 14 40 17 97.65 16 1989 34 3 0 6 -1.558854 2005 14 40 17 97.65 13 1992 34 3 0 6 -.9854704 2005 14 40 19 97.65 15 1990 42 3 1 9 -1.864657 2005 14 40 27 97.65 15 1990 50 3 0 2 -1.1511168 2005 14 70 15 89.85 16 1989 34 2 1 5 -2.8191965 2005 14 70 20 89.85 16 1989 11 3 1 6 -1.558854 2005 14 40 1 79.97 15 1990 35 3 1 6 -1.558854 2005 end label values region lf01 label def lf01 14 "addis ababa", modify

------------------ copy up to and including the previous line ------------------

Listed 11 out of 27925 observations
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30115
#9

10 May 2019, 12:31

Aggregating up the data the way you describe does not make sense to me. And certainly averaging the survey weights in that way is not valid at all. There is no reason you cannot do the analysis on data with multiple cross sections. To help you more specifically you need to explain:

1. Did the policy reform take effect at the same time for all of the children who were affected by it?

2. Does the policy, in principle, apply to all of the children in the data set, or only some. That is, for those children who appear only in the pre-reform data, would they have been subject to the reform if you also had data on them in the post-reform period?

3. Are there any children for whom data is available in both the pre- and post-reform periods?

4. What is the outcome variable you are trying to estimate the reform's effect on?

5. Please explain why you think the year of birth or age of the child are important here? In what ways, and by what mechanism, would those affect the outcome variable? This is important for correct modeling of those effects.

6. I see you have a variable you call yearXtreated. I imagine this is an interaction variable you have created. Do you have separate year and treated variables? And what is year: is it the year of the observation?

In the future, it is best not to address your posts to me, or any other person in particular. There are many people who might respond to this if it did not appear to be targeted to me. And, as it happens, I only saw this post because a last minute cancellation of something on my schedule freed up some time for me to be on Statalist now. I originally was not planning to visit Statalist for the rest of today--and I would never have seen this post. So don't foreclose the opportunity to get a response. Just address yourself to whoever happens to be reading your question.
Comment
Yonathan Adm

Join Date: Mar 2019

Posts: 3
#10

10 May 2019, 15:25

Thank you very much for your prompt reply. To answer your questions:
The policy took effect between 2000 and 2005 for all children who are in the treated regions (treated=1, reform implementing regions). That means children in the non-implementing regions (treated =0) were not affected by the policy reform until 2005, and they represent my control group.

The policy applies to children who are in the implementing regions only. I have data for all children (though not the same children) in the treated and non-treated region for both pre-treatment period, 1999 and post-treatment period, 2005.

Yes.

Child labor measured in terms of weekly number of hours work in paid

Because the data is a multiple cross sections I just wanted to observe the children in both pre and post treatment periods. So, what I thought was that aggregating up by age would help to have observation by age as group over the two periods and interpret the result easily. The weight, id07 and id07 were simply included to run svy: command in the regression. I tried excluding them and the error message says weight is not available. Maybe I’m clear at all what the aggregation does. Sorry about that!

Yes, yearXtreated is an interaction variable that I created, and I have separate year and treated variables. Year stands for the years of observation, 1999 and 2005.

I hope this address your questions. I will not also address a question to a particular person in the future and foreclose the opportunity to get response form others.

Best,
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30115
#11

10 May 2019, 17:18

OK. So don't worry about the fact that you have cross-sectional rather than longitudinal data. Longitudinal data is better and gives more efficient estimates, but you can still work with what you have.

Forget your yearXtreated variable. We'll do this with factor-variable notation and let Stata create the interaction term for you. The main advantage of this is that you will be able to use the -margins- command afterward, which simplifies the interpretation of the results. To prepare yourself for this, I suggest reading -help fvvarlist- first, and then https://www3.nd.edu/~rwilliam/stats/Margins01.pdf, which is, in my opinion, the clearest explanation of the -margins- command around. It has some worked examples, including some that are similar to your project and it will help you work with the -margins- output.

Your basic regression will be carried out with the individual level data:
[code]
svy: regress hours_employed i.treated##i.year perhaps_some_covariates
[code]

And you can follow that up with:

Code:

margins treated#year

to get the expected value of hours_employed in the treated group both pre and post, and the untreated group both pre and post.

And you can further follow that up with:

Code:

margins year, dydx(treated)

That will give you the difference in expected value of hours_employed between the treated and untreated groups in each year.

And last, but not only not least, actually, probably most important: the coefficient in the -regress- output of your 1.treated#2005.year variable will be your difference-in-differences estimate of the effect of the policy reform.

Now there are some nuances not dealt with in the above.

If you have a large number of children who were observed in both years, then I think you need to make this a random-effects model with a random intercept at the level of the child. So something like

Code:

svy: mixed hours_employed i.treated##i.year perhaps_some_covariates || child_id:

If you have only a small number of children observed in both years, I would just go with the original -regress- model. If you are in a grey zone, you could consider a) leaving them out altogether, or, better, b) selecting at random one and only one observation for each child and then sticking with -regress-.

As I've indicated in italics, you may want to add other variables to your model to adjust for their effects. I already have the sense that either age or year of birth are already in your mind for this purpose. I should point out that you can't add both of them to the model at the same time because there will be a colinearity between age, year of birth, and year. (age = year - year of birth) You need the year variable to make your diff-in-diff estimation comprehensible, so I would suggest leaving out age or year of birth. Pick one only. I see other variables in your data set that might be of interest; this content is way out of my area of knowledge so I won't pretend to advise you which ones are important and which are not. Remember that you want to have, at a bare minimum, 25 observations for each predictor in the model, and preferably 50 or more, to avoid overfitting the noise in the data. So unless your data set is very large, choose your covariates wisely, and resist the temptation to just throw in everything that might be relevant.

I suppose it is also possible that not just the level of childhood employment, but the actual impact on childhood employment made by the policy reform itself might vary according to age or year of birth. Well, if that's true you need a more complicated model that includes interaction terms between the effect and age (resp. year of birth). Such a model gets pretty complicated, but if reality is likely to be like that, you can't really escape it. Post back if you go down this road and want help with it.
Comment
Nursena Sagir

Join Date: Jan 2022

Posts: 27
#12

21 Feb 2023, 02:18

Hi all,

I have a follow-up question on this thread.

I’m using difference-in-differences design to estimate the impact of policy reform on depression prevalence using administrative data. My unit of analysis is a student who is age 18, enrolled to different types of higher education (Type A and B). The reform took effect at the same time (2015) for all of the students who are enrolled Type A education after 2015 but not students in Type B education. Those students who were enrolled Type A education before 2015 are not affected by the reform. That's why I cannot compare all cohorts enrolled in Type A and B education throughout the years although the pre&post data are available. But I can compare outcomes for one specific cohort (i.e., depression diagnosis for the first year Type A and B students who enrolled in 2014 vs. 2015).

The problem in this approach is that I use repeated cross sections. So, I think I have to use the average of the outcome variable by cohort and type of education. If I should not use the collapse command to create averages as Clyde mentioned , can I use the following model?:

Code:

* Create depression diagnosis for the first year in school g depression_prevalence = depression_dummy & /// inlist(school_year, "1") * Create treatment variable g Treated = school_type == "A" & /// inlist(year, "2015","2016","2017") * Regression reghdfe depression_prevalence Treated, a(school_type year) vce(cluster school_type)

After using this model, can I still use scripts "it" in the model equation although I do observe one student for one time only?

As my outcome it binary, can I use reghdfe package?

Thanks in advance for your reply!

Best regards,
Nursena
Comment
Nursena Sagir

Join Date: Jan 2022

Posts: 27
#13

21 Feb 2023, 02:18

Sorry for duplication.

Last edited by Nursena Sagir; 21 Feb 2023, 02:24.
Comment
Nursena Sagir

Join Date: Jan 2022

Posts: 27
#14

21 Feb 2023, 02:23

Sorry for duplication.
Comment

Announcement

difference- in- difference on a cross sectional data-set by group averages.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment