Difference in differences regression?

Dana Wray

Join Date: Mar 2017

Posts: 3
#1

Difference in differences regression?

09 Mar 2017, 17:43

Hi all,

I have two probably simple questions that I can't figure out, as I'm fairly new to DD models. I'm computing a relatively simple difference-in-differences regression model, where I have two control provinces and one treatment province. I have 22 years of data (7 pre, 3 during, and 12 post); it is uneven due to the policy interventions before mine that I am studying.

1) When I am including two control groups, do I simply code them as below (where 0 = both of the control provinces and 1 = the treatment province) – or do I have to separate them somehow? If it is the former case, how would I interpret the results of the control group, then? Obviously I am looking mainly at the interaction variable but I'm confused as to what "province" would tell me. Before doing this, I simply ran two different DD models (one with each different province as the control).

Code:

reg outcome post##province, cluster(vce province)

2) I would like to measure three time periods instead of two, as the policy in question was 'rolled out' over three years. Would it make sense to add in those 3 years in the regression, like the code example below? Or should I interact this time period with province as well (dur##province)?

Code:

reg outcome post##province dur, cluster(vce province)

Thank you in advance!
Tags: difference in differences, fixed effects, interaction
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#2

09 Mar 2017, 19:19

Well, it looks like you have longitudinal data here, so you should be using a panel-data estimator, not -regress-.

With a panel estimator, the whole issue of whether to treat the two control groups as one gets finessed.

Here's what I would do. First, you create a treatment variable that is coded 0 for both control provinces and 1 for the intervention province. Let's call that treatment. Next, you create a three-level time variable coded 0 for pre, 1 for during, and 2 for post. Let's call that era. Then you run:

Code:

xtset province xtreg outcome i.treatment##i.era, fe vce(robust) margins treatment#era margins era, dydx(treatment)

This will give you estimates of the within-province effects of treatment in each era, as well as the expected values of outcome in treatment and control in each era. The use of the -xtreg, fe- estimator will allow for a difference in outcome level in your outcome variable. Do be aware that in this fixed-effects estimator, the treatment variable will be omitted because it is constant within each province. That is not a problem. It doesn't mean anything anyway.
1 like
Comment
Dana Wray

Join Date: Mar 2017

Posts: 3
#3

09 Mar 2017, 19:54

Thank you so much, Clyde!

I do have a small additional question. With any DD model with multiple control groups (but only one treatment group), would I use this approach – even if I didn't have the three time periods? I thought the more 'classic' approach was the code I presented above. Would that be in a situation where the multiple years are collapsed into simple pre- and post- periods? Or would that be in a situation where I have one control and one treatment?

Edited to add: I'm getting my idea of what DD models "generally" look like from this post here, if that clarifies it a bit.

Last edited by Dana Wray; 09 Mar 2017, 20:50.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#4

09 Mar 2017, 22:23

If you had only the usual 2 time periods, the approach would still be the same. The only difference would be that the era variable would just have two levels instead of 3.

The code you presented in #1 is not quite the classic DID model. The classic DID model has a treatment group and a control group, and a pre- era and a post-era.

Code:

appropriate_regression_command i.treatment##i.era

Heterogeneity within the treatment and control groups is not encompassed in the classical DID model; though you can modify it to deal with that when appropriate.

What you presented in #1 substitutes province for treatment, but that is different. If you did not have longitudinal data and if there were only 1 control province, what you wrote in #1 would be equivalent to the classic DID model. But having longitudinal data gives you some real advantages here: the use of a panel-data estimator such as -xtreg, fe- enables you to automatically account for heterogeneity among panels (in either treatment or control group) "for free." And it also gives you a within-province estimator of treatment effect, which is what you actually want since the treatment is inherently a distinction that acts within the treated entities.

There are many modifications of the classic DID model. What they all have in common is reliance on a treatment#time interaction term. But there can be multiple treatments. Or there can be more than 2 time periods. And sometimes we are interested not so much in the levels of the outcome variable as in their rates of change over time. And there are sampling designs based on matched pairs. There is cross-sectional data and there is longitudinal data. The number of combinations seems endless: new ones keep cropping up here on Statalist on a regular basis!

I want to change one aspect of the advice I gave earlier. When I wrote -vce(robust)- in the -xtreg- command, I forgot that you only have three provinces. With only three clusters, the cluster robust vce estimator is not valid. So you should not use that with -xtreg-, nor with any other command.
Comment
Sophie Verhoef

Join Date: Mar 2017

Posts: 6
#5

19 Mar 2017, 05:31

Hi there!

I am using Stata IC 14.2 and I am trying to do a DiD, but I could really use some help as I am also rather new to DiD (and Stata) and my model requires taking quite a few things into account.

I am trying to figure out if individual BMI decreases in a county when Diabetes prevalance in that county increases, with 0, 2 and 4 year lags (as it may take some time to decrease BMI). I have data on the years 2000 until 2006. Every year, other individuals are being surveyed, so the data does not follow the same individuals over time. In which county an individual lives is indicated by "ctycode".

I am not a 100% sure, but I think this should be my model:

deltaBMIi = β1 + β2 (# increase in number of people within your county diagnosed with diabetes )ct-1 + β3 Year FEt + β4 County FEc + β5 Indiv. FEi + εictd
The first problem I encounter is that my variable "iyear" which indicates in which year the survey was conducted is a string variable, therefore xtset is not allowed. However, the values of iyear are simply 2000, or 2001 etc.

Also, I used the commmand "ssc install diff" because it seemed to make doing a DiD even more convenient, but then I started doubting what the treatment would be in my model. I think it is *living in a county where diabetes prevalence increased*, yet how am I going to indicate this?

Then, I have age, race, gender, and marital status which I may want to include when controlling for individual fixed effects. I am not sure how to make sure these are all included in individual fixed effects, or preferably, how to estimate multiple models whereby I vary which variables I include.

If someone has an idea on which commands are most suitable for my DiD that would be greatly appreciated!

Kind regards,

Sophie
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#6

19 Mar 2017, 11:16

First the problem of your year variable. If it is a string that looks to human eyes like the numbers 2000 through 2006, then the -destring- command will convert this to a numeric variable that you can use for analytic purposes. -help destring-

It isn't at all clear to me why this would be a DID analysis. Do you have a group of counties in which the number of people diagnosed with diabetes increased and another group in which it did not? This seems more like you just have a bunch of counties, with varying degrees of increase in incident diabetes cases. Moreover, I don't see a clear before-after distinction in your design. After all, the incidence of diabetes has been on the rise since before 2000. I also worry about basing this on count of incident cases: it would seem that incidence rates are more relevant here. In fact, pretty much the only counties in which I would expect to see anything other than an increase in the number of cases are in counties which are losing population! Also, you seem to go back and forth between incidence and prevalence--they are different and you need to be clear which you want to deal with.

If I understand your data design correctly, there are different people in the different waves of the survey. So this is not panel (longitudinal) data: it's pooled cross-sectional data. There is no reason, then, to include person level fixed effects: each person appears only once and the person-level fixed effects will completely exhaust all degrees of freedom. Even if I have that wrong, you cannot include unchanging person-level attributes such as race and (usually) gender along with person-level fixed effects. If you include person-level fixed effects, then all effects of unchanging attributes of people are automatically adjusted for in the analysis, and those same effects are also non-estimable. You can include age and marital status as these can change. But, as I say, it doesn't sound to me as if there is any reason for, or even possibility of, including person-level fixed effects. If you do not include person level fixed effects then there should be no difficulty including race and gender.

Anyway, your situation is, in my mind, unclear. So, I think that before discussing particular code, we need to clarify the actual data design and the research goals.

Being new to Stata, I recommend that you pause and prepare yourself better before undertaking a complicated problem. The PDF manuals that come installed with your Stata include two overviews that are essential reading for all Stata users: Getting Started [GS] and the User's Guide[U]. Read these two in their entirety. They will take you through the fundamental commands that are used in every day data management and routine data analysis. It's a fair amount of reading, and you won't retain all the details. But when you've done it, you will know what the basic commands are and how, in general, they work. You will be able to approach problem-solving armed with a general sense of what commands will be likely to be needed; and you will be able then to refresh yourself on the details by referring to the help files and command-specific manual sections.
1 like
Comment
Sophie Verhoef

Join Date: Mar 2017

Posts: 6
#7

21 Mar 2017, 05:57

Thank you for your response Clyde!

You are right, I think I skipped a few steps there.

It was very easy to destring the variable with that command, so that's one "problem" solved.
Also, you are right about it being different people and indeed there is no reason to include person level fixed effects.

I have read the overviews you suggested and I rethought my research goals and design. Let me try to explain more clear where I would like to go with this research:

In essence, I want to study "peer effects" on health. I think I need to provide a little background information in order for it to make sense:

We tend to make "bad" health decisions like smoking, drinking, little exercise + too many calories, and we are not easily motivated to change these health behaviours. We are aware of the major health risks, but we do not personally feel vulnerable to them - "it won't happen to me". For e.g. smoking, the most effective "motivational tool" is when we encounter a life threatening event - a.k.a. become ill. Of course, we would like to prevent this from happening and hope there are other tools that motivate people to make better health decisions - before the stage of experiencing the harmful consequences is reached. For example, an individual lung test has also been proven to be effective. Why so? Because it personalizes the health risks - it hits close to home then!
So then I started thinking, are we motivated to change our health behaviour when one of our peers experiences the harmful consequences (a negative health shock). E.g. if one of our friends or family members gets/dies from lung cancer, are we then incentivized to quit smoking? Or if someone we know has to deal with the daily struggles of living with diabetes (type 2), will we then reconsider our eating/exercising habits. Do we then feel more personally vulnerable to the health risks of our lifestyle decisions, because we have "seen it with our own eyes"?

With regard to "peers", I would ideally study friends/family networks. However, there is no(t yet) an intergenerational data set, whereby social networks are registered, that makes this feasible. The "next best thing" - the smallest scale I can get data on - is therefore county-level. I know that in the best possible scenario, my "evidence" will be suggestive as this is a very ad hoc definition of peers.
Yet, (in my view) it is not unrealistic that if there is an increase in diabetes prevalence (I have the percentages of diagnosed diabetes prevalence by county for the years 2004-2010, I also have incidence but you are right that that does not say much) in a county, it is more likely that that you know someone (who knows someone) within your county who either died from or struggles with diabetes. As a result, (and taking into account the data/variables I have - which are limited) I expect to see more people doing any form of exercise on the short term, and on a "medium" term (maybe 2 and 4 year lags, also taking into account my data runs from 2004-2010) lower BMI in that county. Or if deaths from lung cancer in a county (I only have data on this for 2009 and 2010) increased, that this would result in a higher quitting rate in that county within that year or the year thereafter, compared to a county where lung cancer death rates stayed stable or decreased.

So, in brief, a negative health shock to a "peer" (j) (= a person within your county) will result in an individual (i) changing health behaviour (for the better - so quit smoking/start exercising/etc.)

As I said, this area is still very scarcely researched and suitable data is very limited (quite a large number of observations though). I am aware of the many limitations and assumptions that will have to be made, but this will be more of an exploratory study!

I hope this clarifies where I wish to go with this research.
And with the individual level data (which also includes the county code) and the diabetes prevalence by county (2004-2010) and the lung cancer death rates by county (only 2009 and 2010), I thought a DiD would the best way to see whether a negative health shock to our peers affects our health behaviour (for the better). As in "before" and "after" the negative health shock? But maybe it is just cross-section then?

Once again, thank you so much for thinking along, an "outsider" asking critical questions is very helpful!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#8

21 Mar 2017, 09:21

Thanks for the additional information. I don't see even faint traces of a DID design here. What you have is pooled cross-sectional data, but there is no "treatment" or "control group" nor is there any time point at which a "shock" or "policy" or "event" takes place. If you had individual level-data on "peer gets diabetes" or "peer dies of lung cancer," then you would have these things and a DID analysis would make sense. But your exposure variable here is just a county-level aggregate, so everybody in the same county has equivalent exposure in any given year.

I would see approaching the problem along these lines:

1. Start with the county level aggregate data. For each county and each year you have a diabetes prevalence and (for years 2009-10 only) lung cancer mortality rate. Create new variables reflecting the two and four year lags of diabetes prevalence, and the one-year lag for lung cancer mortality rate. So something like this:

Code:

use county_data, clear xtset county year gen dm_prev_lag_2 = L2.dm_prev gen dm_prev_lag_4 = L4.dm_prev gen lc_mort_lag_1 = L1.lc_mort save lagged_county_data, replace

2. Now merge this with your individual-level health behavior data.

Code:

use health_survey, clear merge m:1 county year using lagged_county_data, keep(match) nogenerate

3. Before doing analyses, I would probably explore graphically the relationships between your exposure variables and your outcome variables. (-lowess-, with the -logit- option when looking at dichotomous outcomes, is an approach I find very helpful for these purposes.)

4. Run some regression analyses. Maybe something like:

Code:

logit quit_smoking lc_mort_lag_1 logit weight_loss dm_prev_lag_2 dm_prev_lag_4

etc. If you have continuous health behavior variables (such as amount of weight lost) that would be even better. Be guided by your graphical analysis in choosing what regressions to do: transform variables in ways that the graphs suggest.

Obviously there are problems with the overall design that will be difficult to overcome with analysis. For example, diabetes prevalence is likely to be associated with aspects of the food-marketing-distribution process that prevails in the county, and that in turn will influence the individual's eating behaviors as well, presumably in the same direction as it influences diabetes prevalence. It isn't clear how one could restructure the analysis to overcome this kind of effect. Similarly lung cancer mortality rates are likely higher in counties where smoking prevalence is (or was a few decades back) higher, and I'm pretty sure it's established that you are more likely to smoke if your peers smoke. So living in a county with high lung cancer mortality rate may well be associated with a greater likelihood of being a smoker yourself. So the actual processes here are rather complicated with effects operating potentially in both directions. I'm pessimistic that you will find the results you are looking for. But you will learn something in the process of trying in any case.

Your initial idea of looking at the effect of a "shock"ing event in an individual's peer network would be a much stronger test of your ideas. While some of the same circularity and effects in both directions would still apply, the use of a DID analysis in this kind of data would overcome some of these limitations. Moreover, with individual data, you could go farther to reduce bias by adjusting for individual-level covariates and peer-group level covariates. But, you have to work with the data you can get. Good luck.
Comment
Sophie Verhoef

Join Date: Mar 2017

Posts: 6
#9

22 Mar 2017, 02:52

Thank you so much Clyde!!!! It is all starting to make sense now, thanks to your clear comments and explanations!! Very grateful for you sharing your expertise & helping me speed up my learning process :-)
Comment
SODIQ OLAWALE OJODU

Join Date: Jul 2018

Posts: 33
#10

14 Jul 2018, 11:34

Good day Clyde Schechter
Thanks for your help and active support on this platform.

Actually, I am currently working on my dissertation, and particularly working on the topic of The impact of the UK minimum wage on mental health of the low-wage earner. I believe Difference-in-Difference is the only approach to this study. To be honest I have not made much progress as I have just downloaded the Data from the British Household Panel Survey(From wave 1 to Wave 18), and everything looks somewhat cumbersome to me as I feel the data needs cleaning. The approach of my study is to investigate the impact before and after the minimum wage on mental health., between 1997- 2001. JUST IN CASE, UK minimum wage was Introduced in 1999. That means my Time<1999 is my pre-treatment period and Time>1999 is posttreatment period and also, The BHPS data include General Health Questionnaire that I intend to use to assess the mental health of the poor people. Also I would be glad if you could pinpoint the best approach to assigning between the treatment and Control group(either RCT or Self Select) I really don't have much knowledge on the STATA command to carry on with the Data Cleaning. I'm stuck and I really need help

Thanks in advance
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#11

14 Jul 2018, 17:00

Well, your description of the problem does not include any information that would help figure out how to assign the treatment and control groups. The treatment group would be those people who became subject to minimum wage in 1999, and the control group would be those who did not. But you will probably encounter appreciable difficulties with this DID approach if you try to apply it to the entire population, because you can look at it a couple of ways:

1. The minimum wage law applies to everybody. OR
2. While, in theory it applies to everybody, in practice it only applies to people who were previously earning less than minimum wage.

If you use definition 1, then there is no control group at all. If you use definition 2, there is a high probability that the parallel trends before intervention assumption fails. I would expect measures of mental health to follow rather different paths before 1999 for those who were earning very little and those who earned more. To make this DID approach work, if it can be made to work at all, you would have to restrict the analysis to those who were earning no more than a slight increment above the minimum wage, or something like that. Then those who were originally below minimum wage could be treatment and those just above it would be controls. But even this seems questionable to me. Alternatively, perhaps the law applied only in certain parts of the UK? If that is the case, this might define natural treatment and control groups, although, again, I would be concerned about parallel trends before 1999.

I think for this you really need input from an economist, and I am not one of those and I think the value of my advice on this is quite limited.

As for the data cleaning, that is a very broad topic. Your intuition that your data probably needs cleaning is almost certainly correct. Even the best, professionally curated data sets need cleaning, sometimes extensive cleaning. But, I am not familiar with the British Household Panel Survey and know nothing about it. So again, I can't advise you as to what kind of cleaning it will need. I would be happy to help you with specific Stata commands for specific, defined data-cleaning problems. But I can't begin to help you with such a broad-based question.

Frankly, if this is your dissertation, the best advice I can give you is to make your dissertation advisor earn his or her pay by getting you started on this. As I say, I'm happy to contribute help on the details of using Stata to implement your study design or do specific data cleaning tasks. But the general design of your study is outside my domain of expertise, and the general approach to the data set is as well. Sorry I can't be more helpful at this point.
Comment
SODIQ OLAWALE OJODU

Join Date: Jul 2018

Posts: 33
#12

15 Jul 2018, 12:52

Thanks for your time. I will stick to your advise and get back to you.

Regards
Comment
SODIQ OLAWALE OJODU

Join Date: Jul 2018

Posts: 33
#13

20 Jul 2018, 15:19

Hi Clyde Schechter
Please I need your help with stata command

I have gotten started with my work on Impact of minimum wage on mental health of the low wage workers;
For my study I have data from British Household Survey, 1997,1998, 1999,2000, 2001, 20002,2003. NB: UK minimum wage was introduced in 1999.As per minimum wage £3.60, people who aged 22 and above earns this amount.
My goal is to study the impact of the mental health of the low paid worker.
To measure the mental health, the Data of British Household survey includes the General Health Questionnaire

I would be happy if you could help me with STATA command to achieve the following;
My concept is to have a two group. The treatment and the Control Group.
Treatment Group
The treatment Group includes people who were earning below (£3.6 minimum wage) in 1997,1998, and had an increase in their wage after it was introduce in 1999.(1999-2003). Basically, I have variables for labour income per month, number of hours normally work per week, number of overtime hours per week, and number of hours worked as paid overtime,. Basically, I am interested in generating an hourly wage for each worker.

Control Group,
People who have similar characteristics with the treatment group but they were not exposed to the treatment. This means they do not receive any wage rise, because they are not affected by the minimum wage. By my estimation, I want it to be group that receive 20% below minimum wage and 20% above the minimum. I also need the stata command to generate the hourly wage

Lastly I have GHQ 12 version. I would like to incorporate this following Difference in Difference approach.

Thanks alot.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#14

20 Jul 2018, 15:47

You may have already gone some ways towards doing this, but from what is written I cannot tell how far, if at all. It reads like you are asking me to do this entire project for you, from importing the British Household Survey all the way through designing the model, coding it, and interpreting the results. I think you will agree that that would not be appropriate. And perhaps it really is not what you intend, but I cannot tell.

So I suggest that you take this one step at a time. The first step is importing your data into Stata and cleaning and managing it. Start that on your own. If you run into specific problems figuring out Stata code along the way, post that as a new thread to see if somebody here can help with that specific problem. Read the Forum FAQ before posting, and be sure to follow the advice on how to make your question clear, include all the necessary information, and display that information in the best ways (-dataex- for example data, code delimiters for code and Stata output).

You appear to have made some of the major design decisions already. When you have the design completely chosen, you should try implementing it in Stata code. There are many examples of difference-in-differences models on this Forum, and if you review several of them, you may well catch on to how it's done; you can try to model your code after what you see elsewhere on the Forum, adapting it to your particular variables. Since a difference-in-differences design is usually best interpreted by using the -margins- command, be sure to read the excellent Richard Willialmis'
https://www3.nd.edu/~rwilliam/stats/Margins01.pdf for a very well-written explanation of that command and several worked examples. With that under your belt, there is a good chance you will be able to do this part independently. If you hit snags, by all means start a new thread about the difficulties you encounter.

While you're reading the FAQ in preparation for your next post, you will also learn that an important part of writing clear questions is to not use abbreviations or jargon that would not be understood by people outside your discipline. In particular, what on earth is the GHQ 12 version?

Finally, I would emphasize that it is not a good idea to direct a question at a specific responder. While anyone is free to respond to any question here, nobody is obligated to do so. Moreover, if Responder 1 sees a question directed explicitly to Responder 2, Responder 1 may pass over it, even though Responder 1 could easily provide an answer sooner, and possibly even a better answer. Also, if Responder 2 is on vacation, your post may languish unattended to for weeks, or even forever. So it is best just to post your question without reference to a particular responder.
2 likes
Comment
SODIQ OLAWALE OJODU

Join Date: Jul 2018

Posts: 33
#15

20 Jul 2018, 16:22

Thanks alot
Comment

Announcement

Difference in differences regression?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment