Question regarding diff in diff

Daniel Earp

Join Date: Dec 2017

Posts: 19
#1

Question regarding diff in diff

15 Dec 2017, 05:03

Hello everyone, I have a problem regarding a study I'm trying to do.

I have a panel data on 158 different neighborhoods, from january 2006 to december 2016 (132 months in total), I aim to show the effect of the creation of primary medical facilities in some (not all, around 50%) of the neighborhoods on the death rate by diabetes mellitus of these neighborhoods, the medical facilities started to be constructed/implemented in 2009, but it's an ongoing process, meaning that as I type this more are being created, I thought of doing a diff in diff approach to separate the treated group from the non treated group, but my main problem is that the implementation was gradual, meaning that some facilities were created in february 2009, some in december, some in june 2010 and so on... so I don't know if it would be possible to make a "dynamic diff in diff" like this, is it?

Also, I tried doing a regular panel regression with fixed effects for month and neighborhood, it looks like this:

Code:

xtreg taxa_obito1 taxa_cf1 i.meseano2, fe

("taxa_obito1" is death rate by diabetes, "taxa_cf1" is the rate of primary medical facilities in the neighborhood at a specific month and year), the results are what I expected (and wanted) them to be, more primary medical facilities have a negative effect on death rate by diabetes (not saying anything about causality yet..), however, when I run

Code:

xtgls taxa_obito1 taxa_cf1

the effect is positive! Meaning that it's saying that more medical facilities is positively correlated with deaths by diabetes, which seems weird to me. I should note that a lot of the values of the taxa_cf1 variable are zeroes, since not all of the neighborhoods have any medical facilities and those who do didn't have them until a certain time period, could that be affecting the model? When using this panel approach should I just delete the neighborhoods that didn't receive any facilities yet and focus on the ones that did?

I should say that the diff in diff approach seems more thorough to me, although I don't know how to get around the "implementation in different periods for different neighborhoods" problem

Sorry if the post was sort of confusing... any help would be appreciated
Tags: None
Daniel Earp

Join Date: Dec 2017

Posts: 19
#2

15 Dec 2017, 09:26

Anybody?
Comment
German Rodriguez

Join Date: Feb 2017

Posts: 169
#3

15 Dec 2017, 10:12

Perhaps the problem was not clear, or not enough details were provided. I for one, am not exactly sure what you mean by the "implementation in different periods for different neighborhoods" problem, which wouldn't be a problem if you know when the implementation occurred in each neighborhood.

I'll assume that's the case, that the data are in long format with one row per neighborhood month/year, and that taxa_cf1 is coded 1 if that neighborhood had a medical facility in that particular month and year, and 0 otherwise. I will also assume that you xtset the data with neighborhood as the grouping factor. As you can see, a few assumptions here.

If that's the case, I think your fixed-effects model is preferable. The fixed effect for neighborhood controls for persistent unobserved characteristics of each neighborhood, and the time dummies control for an overall time trend. The GLS approach does neither. In the fixed-effects model identification comes from neighborhoods that changed status over time, so you are already focusing on those.
1 like
Comment
Daniel Earp

Join Date: Dec 2017

Posts: 19
#4

15 Dec 2017, 11:04

Originally posted by German Rodriguez View Post

Perhaps the problem was not clear, or not enough details were provided. I for one, am not exactly sure what you mean by the "implementation in different periods for different neighborhoods" problem, which wouldn't be a problem if you know when the implementation occurred in each neighborhood.

I'll assume that's the case, that the data are in long format with one row per neighborhood month/year, and that taxa_cf1 is coded 1 if that neighborhood had a medical facility in that particular month and year, and 0 otherwise. I will also assume that you xtset the data with neighborhood as the grouping factor. As you can see, a few assumptions here.

If that's the case, I think your fixed-effects model is preferable. The fixed effect for neighborhood controls for persistent unobserved characteristics of each neighborhood, and the time dummies control for an overall time trend. The GLS approach does neither. In the fixed-effects model identification comes from neighborhoods that changed status over time, so you are already focusing on those.

First of all, thank you for answering me and for your patience, I'll try to explain myself better.

First, what I meant by "implementation in different periods for different neighborhoods" is that the treatment starts in different periods (year and month) in the treated neighborhoods, meaning some received medical facilities in march 2010, some in may 2011 and so on, I'm not familiar on how to do a diff-in-diff approach in this case with several time periods.

Indeed I did

Code:

xtset numbairro meseano2

where numbairro is the number that identifies the neighborhood and meseano2 is the month and year (from january 2006 to december 2016), taxa_cf1 is the rate of medical facilities per capita in a certain neighborhood at a given time period, multiplied by 100.000, but I also have a variable like the one you described (coded 1 if neighborhood had a medical facility in a particular month and year and 0 otherwise)

So basically I'm between a panel regression with fixed effects for time and neighborhood that tries to capture the effect of an increase in the medical facilities rate per neighborhood on the death rate, or the diff-in-diff model with multiple time periods specified above.

Again, thank you for your patience, I'm new to Stata so still struggling with it...

Last edited by Daniel Earp; 15 Dec 2017, 11:06.
Comment
German Rodriguez

Join Date: Feb 2017

Posts: 169
#5

15 Dec 2017, 13:47

This may be a question of terminology, because there are situations where diff-in-diff and fixed effects models are exactly equivalent. Perhaps a simple example where they give the same answer will help? You may also consider providing your own sample data, check out dataex from SSC or built-in Stata 15.1. This will make Stata listers more likely to respond.

Consider the webuse data on the weight of pigs. We'll keep just two weeks, randomly pick some pigs to get a special diet on the second week, and make them gain a bit more weight:

Code:

webuse pig, clear keep if week < 3 set seed 1234 gen diet = runiform() > .5 & week == 2 replace weight = weight + rnormal(10,15) if diet

First I'll fit a fixed effects model with group and time effects like yours. No xtset, just making everything explicit:

Code:

. quietly xtreg weight i.week diet, i(id) fe . di _b[diet] 8.9272067

Looks like our pigs gained on average 9 units more when on the diet. Now let us compute a difference (weight gain), keep just one observation per pig, and run a simple regression

Code:

. bysort id (week): gen diff = weight[2] - weight[1] . drop if week == 1 (48 observations deleted) . quietly reg diff diet . di _b[diet] 8.9272067

We get exactly the same estimate. And this one is a diff-in-diff estimate because it is the difference in weight gain between those who went on the diet and those who didn't. It is also a fixed-effects estimator.

The nice thing about the fixed-effects approach is that it doesn't have to be balanced, and it can accommodate situations where the treatment starts at different times for different units, which I think is your concern. I am not sure how best to handle the number of facilities, but would be inclined to use a time-varying indicator o whether they had received facilities, analogous to the diet. But others may have different views.
1 like
Comment

Daniel Earp

Join Date: Dec 2017
Posts: 19

15 Dec 2017, 19:05

Thanks for the answer! I have more questions if I could be so bold...

So do you recommend that I run

Code:

xtreg taxa_obito1 treat i.meseano2, fe

where treat = 1 if the neighborhood had a medical facility in a particular month and 0 otherwise, will that yield me results as if I were doing a diff-in-diff using multiple treatment periods?

like I said, taxa_obito1 is the death rate by diabetes (per capita, multipled by 100000), at a specific neighborhood in a given month and year,numbairro is the number id of each neighborhood and meseano2 is the month and year (e.g. march 2007), since it varies from jan 2006 to dec 2016 there are 132 time observations, they look like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int numbairro double taxa_obito1 float meseano2 byte treat
1          0 16802 0
1          0 16833 0
1          0 16861 0
1          0 16892 0
1          0 16922 0
1          0 16953 0
1 .015821668 16983 0
1          0 17014 0
1          0 17045 0
1          0 17075 0
1          0 17106 0
1 .015821668 17136 0
1          0 17167 0
1          0 17198 0
1 .031643337 17226 0
1 .015821668 17257 0
1 .015821668 17287 0
1          0 17318 0
1          0 17348 0
1 .015821668 17379 0
1          0 17410 0
1          0 17440 0
1 .015821668 17471 0
1          0 17501 0
1          0 17532 0
1          0 17563 0
1 .015821668 17592 0
1          0 17623 0
1          0 17653 0
1          0 17684 0
1 .015821668 17714 0
1          0 17745 0
1          0 17776 0
1          0 17806 0
1 .015821668 17837 0
1          0 17867 0
1 .047465005 17898 0
1          0 17929 0
1          0 17957 0
1          0 17988 0
1          0 18018 0
1          0 18049 0
1          0 18079 0
1 .031643337 18110 0
1          0 18141 0
1 .015821668 18171 0
1          0 18202 0
1          0 18232 0
1          0 18263 0
1          0 18294 0
1 .015821668 18322 0
1          0 18353 0
1          0 18383 0
1          0 18414 0
1 .015821668 18444 0
1 .015821668 18475 0
1          0 18506 0
1          0 18536 0
1          0 18567 0
1 .015821668 18597 0
1          0 18628 0
1 .015821668 18659 0
1          0 18687 0
1 .015821668 18718 0
1          0 18748 0
1          0 18779 0
1          0 18809 0
1 .015821668 18840 0
1          0 18871 0
1          0 18901 0
1          0 18932 0
1 .015821668 18962 0
1          0 18993 0
1          0 19024 0
1          0 19053 0
1          0 19084 0
1          0 19114 0
1          0 19145 0
1          0 19175 0
1          0 19206 0
1          0 19237 0
1          0 19267 0
1          0 19298 0
1          0 19328 0
1          0 19359 0
1          0 19390 0
1 .015821668 19418 0
1          0 19449 0
1          0 19479 0
1          0 19510 0
1          0 19540 0
1          0 19571 0
1          0 19602 0
1          0 19632 0
1          0 19663 0
1          0 19693 0
1 .015821668 19724 0
1          0 19755 0
1          0 19783 0
1          0 19814 0
end
format %tm meseano2

this preview only shows the first neighborhood but it keeps going, this particular neighborhood (number 1) is very small and has very few deaths, hence the amount of zeroes, but it should be noted that this is not a rule, many neighborhoods have a lot more deaths at any given period, with more variability

Again, thanks for taking your time to help me!

Comment

Daniel Earp

Join Date: Dec 2017
Posts: 19

16 Dec 2017, 05:39

I found this answer on statalist by Jeff Wooldridge on a problem I think was similar to mine and attempted to do what he said, I ran

Code:

xtreg taxa_obito1 treat i.meseano2, fe

, a sample of the regression:

Code:

  xtreg taxa_obito1 treat i.meseano2, fe

Fixed-effects (within) regression               Number of obs     =     20,856
Group variable: numbairro                       Number of groups  =        158

R-sq:                                           Obs per group:
     within  = 0.0203                                         min =        132
     between = 0.1188                                         avg =      132.0
     overall = 0.0031                                         max =        132

                                                F(132,20566)      =       3.22
corr(u_i, Xb)  = -0.0323                        Prob > F          =     0.0000

------------------------------------------------------------------------------
 taxa_obito1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       treat |  -.0010505    .000554    -1.90   0.058    -.0021364    .0000355
             |
    meseano2 |
      16833  |  -.0011015   .0021561    -0.51   0.609    -.0053276    .0031245
      16861  |  -.0024033   .0021561    -1.11   0.265    -.0066293    .0018228
      16892  |  -.0018025   .0021561    -0.84   0.403    -.0060285    .0024236
      16922  |   .0021029   .0021561     0.98   0.329    -.0021232    .0063289
      16953  |   .0008011   .0021561     0.37   0.710     -.003425    .0050271

I ran testparm i.meseano2 and it shows Prob > F = 0.0000 so I'm guessing I was correct to include i.meseano2 in the xtreg

However, when I run the same regression with cluster this is what happens:

Code:

xtreg taxa_obito1 treat i.meseano2, fe cluster(numbairro)

Fixed-effects (within) regression               Number of obs     =     20,856
Group variable: numbairro                       Number of groups  =        158

R-sq:                                           Obs per group:
     within  = 0.0203                                         min =        132
     between = 0.1188                                         avg =      132.0
     overall = 0.0031                                         max =        132

                                                F(132,157)        =      14.01
corr(u_i, Xb)  = -0.0323                        Prob > F          =     0.0000

                            (Std. Err. adjusted for 158 clusters in numbairro)
------------------------------------------------------------------------------
             |               Robust
 taxa_obito1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       treat |  -.0010505   .0011209    -0.94   0.350    -.0032645    .0011635
             |
    meseano2 |
      16833  |  -.0011015   .0025178    -0.44   0.662    -.0060747    .0038717
      16861  |  -.0024033   .0021528    -1.12   0.266    -.0066555    .0018489
      16892  |  -.0018025   .0018852    -0.96   0.340    -.0055261    .0019211
      16922  |   .0021029   .0023681     0.89   0.376    -.0025746    .0067804
      16953  |   .0008011   .0021368     0.37   0.708    -.0034195    .0050217
      16983  |   .0036049   .0021562     1.67   0.097     -.000654    .0078638

The coefficient remains the same but the p-value changed from a barely acceptable one to a high one, what is the main difference between these 2 models? Am I on the right path here?

Comment

German Rodriguez

Join Date: Feb 2017

Posts: 169
#8

16 Dec 2017, 08:01

Seems to be the same advice I gave you yesterday, a fixed-effects model that is equivalent to diff-in-diff with multiple time periods. The choice of predictor depends on whether you view the treatment as adding a medical facility or increasing the ratio of facilities to population. Using clustered standard errors produces a robust variance estimator, I am not surprised it is different. Something else you might consider if you have the actual counts of deaths is to use a fixed-effects Poisson model, with the death counts as outcome and the population as exposure, a standard approach to count data.
Comment

Daniel Earp

Join Date: Dec 2017
Posts: 19

16 Dec 2017, 18:08

Thanks for the answer, this is the model I came up with, I would very much appreciate opinions on it, I defined the dummy variables trat1, trat2, trat3...trat9 in a way that trat1 = 1 if the given neighborhood received it's first medical facility in a certain time period and 0 otherwise and so on, e.g., trati = 1 if the given neighborhood received it's ith medical facility in a certain time period and 0 otherwise. In some cases, more than 1 facility were installed at the same time period, which means that, for instance, a certain neighborhood could have trat3=1 and trat4=1 in the same time period and so on. From what I can tell I'm basically distinguishing the treaments saying that having another facility is a different type of treatment, my goal was to try to capture the effect of extra facilities instead of the "has facility - doesn't have facility" logic. Is this reasonable?

I changed to yearly dates to simplify my reasoning and because I don't expect new medical facilities to have an immediate impact on the next months.

Code:

xtreg taxa_obito1 trat1 trat2 trat3 trat4 trat5 trat6 trat7 trat8 trat9 i.ano, fe

Fixed-effects (within) regression               Number of obs     =      1,738
Group variable: numbairro                       Number of groups  =        158

R-sq:                                           Obs per group:
     within  = 0.0910                                         min =         11
     between = 0.4553                                         avg =       11.0
     overall = 0.0138                                         max =         11

                                                F(19,1561)        =       8.23
corr(u_i, Xb)  = -0.1929                        Prob > F          =     0.0000

------------------------------------------------------------------------------
 taxa_obito1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       trat1 |  -.0101113   .0106687    -0.95   0.343    -.0310378    .0108152
       trat2 |  -.0006348   .0188673    -0.03   0.973    -.0376428    .0363731
       trat3 |  -.0297648   .0405268    -0.73   0.463    -.1092575    .0497279
       trat4 |  -.0830819   .0497743    -1.67   0.095    -.1807135    .0145498
       trat5 |  -.1731677   .0585419    -2.96   0.003    -.2879967   -.0583386
       trat6 |  -.0021817   .0616618    -0.04   0.972    -.1231304     .118767
       trat7 |   -.007754   .0840263    -0.09   0.926    -.1725704    .1570623
       trat8 |  -.1134312   .0840175    -1.35   0.177    -.2782303     .051368
       trat9 |   -.394849   .0840684    -4.70   0.000    -.5597479   -.2299502
             |
         ano |
       2007  |   .0142195   .0088531     1.61   0.108    -.0031457    .0315846
       2008  |    .022631   .0088531     2.56   0.011     .0052658    .0399962
       2009  |   .0309341   .0088554     3.49   0.000     .0135644    .0483038
       2010  |   .0561215   .0089108     6.30   0.000      .038643       .0736
       2011  |   .0294726   .0089858     3.28   0.001     .0118471    .0470981
       2012  |   .0085764   .0089047     0.96   0.336    -.0088901    .0260428
       2013  |   .0009944   .0088596     0.11   0.911    -.0163836    .0183724
       2014  |  -.0092809   .0088545    -1.05   0.295    -.0266488    .0080871
       2015  |  -.0068273   .0088772    -0.77   0.442    -.0242399    .0105852
       2016  |   .0052042    .009007     0.58   0.563    -.0124629    .0228714
             |
       _cons |   .2406295   .0062601    38.44   0.000     .2283505    .2529086
-------------+----------------------------------------------------------------
     sigma_u |  .33445532
     sigma_e |  .07868779
         rho |  .94755056   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(157, 1561) = 174.91                 Prob > F = 0.0000

Does this model make sense?

Comment

German Rodriguez

Join Date: Feb 2017

Posts: 169
#10

18 Dec 2017, 07:34

I would recommend using a fixed-effects Poisson model with monthly death counts as outcome and population as exposure, as noted earlier. Your predictor could be the number of facilities in existence each month treated as a factor variable. This gives you diff-in-diff estimates. Your latest proposal appears to aggregate the data by combining months with different conditions, which may dilute effects. It is not clear to me how your treatment variables are coded and how they change over time. So I don't see the advantages of this approach.
Comment

Announcement