Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • understanding a treatment effect with group fixed effects

    I am trying to understand why the effect of a treatment variable in a regression changes when a) including group fixed effects and b) the base group changes.

    I am using a sample of this dataset from an experiment (https://www.aeaweb.org/articles?id=1...aer.102.7.3317). The -dataex- of my subset is at the end of this post. Basically, subjects are assigned to groups, and groups are assigned to either treatment or control. The variable "y" is some response variable that subjects choose over time.

    Pooling the data and running a random-effects regression aimed at finding the average treatment effect results in:

    Code:
    . xtreg y round i.treatment i.group_id, re baselevels
    note: 305.group_id omitted because of collinearity
    
    Random-effects GLS regression                   Number of obs     =        150
    Group variable: subject                         Number of groups  =         30
    
    R-sq:                                           Obs per group:
         within  = 0.0270                                         min =          5
         between = 0.7425                                         avg =        5.0
         overall = 0.5514                                         max =          5
    
                                                    Wald chi2(10)     =      60.96
    corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000
    
    ------------------------------------------------------------------------------
               y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
           round |        -.6   .3301812    -1.82   0.069    -1.247143    .0471432
                 |
       treatment |
              0  |          0  (base)
              1  |  -6.666667   4.341019    -1.54   0.125    -15.17491    1.841574
                 |
        group_id |
            101  |          0  (base)
            102  |   9.333333   4.341019     2.15   0.032     .8250928    17.84157
            103  |  -6.666667   4.341019    -1.54   0.125    -15.17491    1.841574
            104  |   10.66667   4.341019     2.46   0.014     2.158426    19.17491
            105  |  -2.666667   4.341019    -0.61   0.539    -11.17491    5.841574
            301  |   6.666667   4.341019     1.54   0.125    -1.841574    15.17491
            302  |        -12   4.341019    -2.76   0.006    -20.50824   -3.491759
            303  |         -8   4.341019    -1.84   0.065    -16.50824    .5082406
            304  |   6.666667   4.341019     1.54   0.125    -1.841574    15.17491
            305  |          0  (omitted)
                 |
           _cons |   15.13333    3.22543     4.69   0.000     8.811607    21.45506
    -------------+----------------------------------------------------------------
         sigma_u |  4.6610611
         sigma_e |  5.7189057
             rho |  .39913545   (fraction of variance due to u_i)
    ------------------------------------------------------------------------------
    where one group is the base level and another group is dropped to avoid collinearity with the treatment indicator.

    Now, I change which group is the base level, I will see different values for the treatment variable, some positive, some negative, some significant, some not:

    Code:
    . qui levelsof group, local(groups)
    
    . foreach group in `groups' {
      2.         qui xtreg y round i.treatment ib(`group').group_id, re 
      3.         di "estimated coefficient = " _b[1.treatment] 
      4. }
    estimated coefficient = -6.6666667
    estimated coefficient = 2.6666667
    estimated coefficient = -13.333333
    estimated coefficient = 4
    estimated coefficient = -9.3333333
    estimated coefficient = -16
    estimated coefficient = 2.6666667
    estimated coefficient = -1.3333333
    estimated coefficient = -16
    estimated coefficient = -9.3333333
    Why does this happen? And is it because there are large differences between the group means?

    Code:
    . tabstat y, by(group_id) stats(mean sd) nototal
    
    Summary for variables: y
         by categories of: group_id 
    
    group_id |      mean        sd
    ---------+--------------------
         101 |  6.666667  9.759001
         102 |        16  8.280787
         103 |         0         0
         104 |  17.33333  7.037316
         105 |         4  8.280787
         301 |        20         0
         302 |  1.333333  5.163978
         303 |  5.333333  9.154754
         304 |        20         0
         305 |  13.33333  9.759001
    ------------------------------
    Here is the data:

    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte treatment int group_id float subject byte(round y)
    1 101 1101 1 20
    1 101 1101 2  0
    1 101 1101 3 20
    1 101 1101 4  0
    1 101 1101 5  0
    1 102 1102 1  0
    1 102 1102 2 20
    1 102 1102 3 20
    1 102 1102 4 20
    1 102 1102 5 20
    1 103 1103 1  0
    1 103 1103 2  0
    1 103 1103 3  0
    1 103 1103 4  0
    1 103 1103 5  0
    1 104 1104 1 20
    1 104 1104 2 20
    1 104 1104 3 20
    1 104 1104 4 20
    1 104 1104 5 20
    1 105 1105 1  0
    1 105 1105 2  0
    1 105 1105 3  0
    1 105 1105 4  0
    1 105 1105 5  0
    0 301 1301 1 20
    0 301 1301 2 20
    0 301 1301 3 20
    0 301 1301 4 20
    0 301 1301 5 20
    0 302 1302 1  0
    0 302 1302 2 20
    0 302 1302 3  0
    0 302 1302 4  0
    0 302 1302 5  0
    0 303 1303 1  0
    0 303 1303 2  0
    0 303 1303 3  0
    0 303 1303 4  0
    0 303 1303 5  0
    0 304 1304 1 20
    0 304 1304 2 20
    0 304 1304 3 20
    0 304 1304 4 20
    0 304 1304 5 20
    0 305 1305 1 20
    0 305 1305 2 20
    0 305 1305 3 20
    0 305 1305 4 20
    0 305 1305 5 20
    1 101 2101 1 20
    1 101 2101 2  0
    1 101 2101 3 20
    1 101 2101 4 20
    1 101 2101 5  0
    1 102 2102 1 20
    1 102 2102 2 20
    1 102 2102 3 20
    1 102 2102 4 20
    1 102 2102 5 20
    1 103 2103 1  0
    1 103 2103 2  0
    1 103 2103 3  0
    1 103 2103 4  0
    1 103 2103 5  0
    1 104 2104 1  0
    1 104 2104 2  0
    1 104 2104 3 20
    1 104 2104 4 20
    1 104 2104 5 20
    1 105 2105 1  0
    1 105 2105 2  0
    1 105 2105 3  0
    1 105 2105 4  0
    1 105 2105 5  0
    0 301 2301 1 20
    0 301 2301 2 20
    0 301 2301 3 20
    0 301 2301 4 20
    0 301 2301 5 20
    0 302 2302 1  0
    0 302 2302 2  0
    0 302 2302 3  0
    0 302 2302 4  0
    0 302 2302 5  0
    0 303 2303 1 20
    0 303 2303 2  0
    0 303 2303 3 20
    0 303 2303 4  0
    0 303 2303 5  0
    0 304 2304 1 20
    0 304 2304 2 20
    0 304 2304 3 20
    0 304 2304 4 20
    0 304 2304 5 20
    0 305 2305 1 20
    0 305 2305 2 20
    0 305 2305 3 20
    0 305 2305 4 20
    0 305 2305 5 20
    1 101 3101 1  0
    1 101 3101 2  0
    1 101 3101 3  0
    1 101 3101 4  0
    1 101 3101 5  0
    1 102 3102 1 20
    1 102 3102 2 20
    1 102 3102 3 20
    1 102 3102 4  0
    1 102 3102 5  0
    1 103 3103 1  0
    1 103 3103 2  0
    1 103 3103 3  0
    1 103 3103 4  0
    1 103 3103 5  0
    1 104 3104 1 20
    1 104 3104 2 20
    1 104 3104 3 20
    1 104 3104 4 20
    1 104 3104 5 20
    1 105 3105 1 20
    1 105 3105 2 20
    1 105 3105 3 20
    1 105 3105 4  0
    1 105 3105 5  0
    0 301 3301 1 20
    0 301 3301 2 20
    0 301 3301 3 20
    0 301 3301 4 20
    0 301 3301 5 20
    0 302 3302 1  0
    0 302 3302 2  0
    0 302 3302 3  0
    0 302 3302 4  0
    0 302 3302 5  0
    0 303 3303 1 20
    0 303 3303 2  0
    0 303 3303 3 20
    0 303 3303 4  0
    0 303 3303 5  0
    0 304 3304 1 20
    0 304 3304 2 20
    0 304 3304 3 20
    0 304 3304 4 20
    0 304 3304 5 20
    0 305 3305 1  0
    0 305 3305 2  0
    0 305 3305 3  0
    0 305 3305 4  0
    0 305 3305 5  0
    end
    ------------------ copy up to and including the previous line ------------------


    Last edited by Guy Tournesol; 24 Nov 2018, 11:49.

  • #2
    It happens because your model is not identifiable. There is absolute colinearity among your treatment variable and the group variables. So none of those effects is separately identifiable in your model. Stata identifies the model by picking one of the colinear variables to omit--but depending on which one it picks, all of the coefficients of everything else involved in the colinearity change. None of the outputs you have gotten actually represents a treatment effect, nor, for that matter, an effect of a given group. In fact, no treatment effect can ever be identified from this model, nor from any model that includes group-level fixed effects and the treatment variable.

    In addition to that, although this is not related directly to the problem you are discussing here, your model is mis-specfying a three-level data structure as if it were only two levels. You fail to account for the nesting of subjects within groups.

    As best I can guess from your data example, though you do not really describe how these data were collected, you then have repeated assessments of y over several "rounds." These rounds are also probably not properly specified in your model because you have treated it as a continuous variable. Now, if you have reason to believe that there is a single fixed increment to y associated with the progression of each successive round, then your modeling of round is appropriate, but if the effect of round is not like that, then this, too, is incorrect. My best guess is that you model needs to look more like this:

    Code:
    mixed y i.treatment i.round || group_id: || subject:

    Comment


    • #3
      Thanks, Clyde. Let me make sure I understand you. I thought group fixed effects controlled for the fact that a subject is nested inside a group, which itself is nested inside a treatment/control. But this is not the case simply because including the treatment and group indicators leads to multicollinearity?

      Oh, and one follow up. Using -mixed- the way you did implies including random effects at the subject and group level. Is there a way one can cluster the standard errors within the group? Edit: yes, just use -cluster(group)- at the end of the call. Sorry.
      Last edited by Guy Tournesol; 24 Nov 2018, 12:58. Reason: Found answer to one of the questions

      Comment


      • #4
        I thought group fixed effects controlled for the fact that a subject is nested inside a group, which itself is nested inside a treatment/control. But this is not the case simply because including the treatment and group indicators leads to multicollinearity?
        Well, basically that's right. In this case, had you not needed to identify a treatment effect in your model, the use of i.group as fixed effects would have been one way of representing subjects within groups. (In fact, if the number of groups is small, it would be the preferred way.) But, even in that situation, one has to be careful about how the variables are coded. The way you have done it, each subject has a unique id number, so having i.group in the model with a random effect at the subject level would accomplish this. So in your situation, but for the treatment effect issue, you could have used group-level fixed effects in the way you tried. But sometimes people code the subjects as 1, 2, 3, 4, 5,... and re-use those numbers in every group. In that case, it would represent group crossed with subject, and that would be a mis-specification for the data description you gave.

        Comment


        • #5
          Clyde Schechter Hello Professor Schechter and everyone,

          I have a query regarding the inclusion of fixed effects. I read Professor Schechter’s advice regarding fixed effects and I think I have similar set up.

          My setup: I have pooled cross-sectional data on manufacturing units from 2011 to 2022. Each unit is classified by a 5-digit industrial code, with the first 2 digits denoting the broad manufacturing sector. Treatment is assigned at the sector-level. Units in the treated sector (identified by 2-digit code 18) are treated from 2015 onwards. My control group consists of units in sector identified by code 19. I currently include year fixed effects.

          Code:
           reg y treated##post i.sector i.year
          I’m unsure whether to include sector fixed effects in this specification. If I include sector fixed effects, my sample data in stata will look like.
          Unit Year Sector 5-digit code Treated Post Treated*Post Sector Binary Variable Year Binary Variable
          A 2014 18 18101 1 0 0 0 0
          B 2015 18 18201 1 1 1 0 1
          C 2014 19 19101 0 0 0 1 0
          D 2015 19 19301 0 1 0 1 1
          Note: Reference category for sector binary variable is sector 18. Reference category for year binary variable is 2014.

          As I understand the fixed effects method for a category controls for observable and unobservable variables which stay constant within that category. In my case, treatment does not vary within a sector for the entire period. The treated variable is constant within each sector and predicted by sector binary variable. Further, for sector 19, sector binary variable will also predict treated*post. Whenever sector binary variable is 1, treated*post will be 0.

          1) I am not sure if I can correctly identify the coefficient on treated*post when I include sector fixed effects. 2) If then, how can I control for baseline sectoral differences between the treated and control groups. As if I remove sector fixed effects, I get results which are way off from the difference between pre and post means of treated and control groups.

          Thank you very much.
          Last edited by Chinmay Korgaonkar; 14 Apr 2025, 21:11.

          Comment


          • #6
            Because all of your treatment group begins treatment at the same time, you can either just run -regress outcome i.treated##i.post, cluster(sector)- with no fixed effects for sector or year, or you can do a two-way fixed effects analysis, -xtreg outcome i.treated##i.post i.year, fe cluster(sector)-. Either way you will get exactly the same DID estimate of the causal effect (i.e. the coefficient of 1.treated#1.post). Their standard errors will differ slightly, but usually not very much. Here's an illustration:
            Code:
            . clear*
            
            .
            . webuse grunfeld
            
            .
            . gen byte treated = 1.company
            
            . gen byte post = (year >= 1945)
            
            .
            . regress mvalue i.treated##i.post, cluster(company)
            
            Linear regression                               Number of obs     =        200
                                                            F(1, 9)           =          .
                                                            Prob > F          =          .
                                                            R-squared         =     0.6875
                                                            Root MSE          =     740.41
            
                                           (Std. err. adjusted for 10 clusters in company)
            ------------------------------------------------------------------------------
                         |               Robust
                  mvalue | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
            -------------+----------------------------------------------------------------
               1.treated |   3434.714     252.73    13.59   0.000     2862.999    4006.429
                  1.post |   89.10622   42.46387     2.10   0.065    -6.953736    185.1662
                         |
            treated#post |
                    1 1  |   357.6037   42.46387     8.42   0.000     261.5438    453.6637
                         |
                   _cons |   675.7764     252.73     2.67   0.025     104.0614    1247.491
            ------------------------------------------------------------------------------
            
            .
            . xtreg mvalue i.treated##i.post i.year, fe cluster(company)
            note: 1.treated omitted because of collinearity.
            note: 1954.year omitted because of collinearity.
            
            Fixed-effects (within) regression               Number of obs     =        200
            Group variable: company                         Number of groups  =         10
            
            R-squared:                                      Obs per group:
                 Within  = 0.3567                                         min =         20
                 Between = 0.7327                                         avg =       20.0
                 Overall = 0.1303                                         max =         20
            
                                                            F(9, 9)           =          .
            corr(u_i, Xb) = 0.2158                          Prob > F          =          .
            
                                           (Std. err. adjusted for 10 clusters in company)
            ------------------------------------------------------------------------------
                         |               Robust
                  mvalue | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
            -------------+----------------------------------------------------------------
               1.treated |          0  (omitted)
                  1.post |   694.7106   239.2749     2.90   0.017     153.4331    1235.988
                         |
            treated#post |
                    1 1  |   357.6037   44.43458     8.05   0.000     257.0857    458.1218
                         |
                    year |
                   1936  |    372.103   169.3019     2.20   0.056    -10.88448    755.0905
                   1937  |    644.819   274.0373     2.35   0.043     24.90364    1264.734
                   1938  |    139.671   109.7413     1.27   0.235    -108.5812    387.9231
                   1939  |    372.783   151.4932     2.46   0.036      30.0816    715.4844
                   1940  |    426.546   177.8517     2.40   0.040     24.21748    828.8745
                   1941  |    380.477   171.3888     2.22   0.054    -7.231496    768.1855
                   1942  |    173.375   93.12809     1.86   0.096    -37.29535    384.0454
                   1943  |    287.693   114.9123     2.50   0.034     27.74334    547.6427
                   1944  |    320.301   132.9679     2.41   0.039     19.50677    621.0952
                   1945  |   -297.837   107.6743    -2.77   0.022    -541.4132   -54.26086
                   1946  |   -233.978    97.5294    -2.40   0.040    -454.6048   -13.35119
                   1947  |   -510.645   213.5307    -2.39   0.040     -993.685   -27.60498
                   1948  |   -540.962   241.4719    -2.24   0.052    -1087.209    5.285364
                   1949  |   -520.388   207.2732    -2.51   0.033    -989.2725   -51.50354
                   1950  |   -460.887    198.712    -2.32   0.046    -910.4047   -11.36931
                   1951  |   -229.858   125.5264    -1.83   0.100    -513.8184    54.10237
                   1952  |    -183.56   95.10005    -1.93   0.086    -398.6913    31.57122
                   1953  |   39.83903   91.00977     0.44   0.672    -166.0394    245.7174
                   1954  |          0  (omitted)
                         |
                   _cons |    707.471   122.6399     5.77   0.000     430.0403    984.9017
            -------------+----------------------------------------------------------------
                 sigma_u |  1286.8485
                 sigma_e |  295.51883
                     rho |  .94990486   (fraction of variance due to u_i)
            ------------------------------------------------------------------------------
            In the simple -regress- model with no sector or year effects, the terms for non-interacted treated and post variables will be retained in the model. If desired, you can interpret the coefficient of treated as the pre-treatment mean difference in outcome level between the treated and untreated groups, and the coefficient of post as the difference between the mean outcome levels before and after treatment in the untreated group.

            If you use the two-way fixed effects analysis with exactly the code I referred to, the non-interacted treated variable will be omitted, and the 1.post variable will be retained. But there will be an extra year indicator variable omitted from the analysis. And no interpretation at all can be given to the coefficient of 1.post. (There are other ways to code the same two-way fixed effects models, and if you use one of those, you might have a different set of variables retained and omitted. But regardless, the treated#post interaction coefficient will always be the same, and no interpretation at all can be assigned to the coefficients whichever of the 1.treated, 1.post, and sector or year indicator variables are retained in the model.)

            Comment


            • #7
              To me, this looks like common timing with repeated cross sections, not panel data. There appear to be two sectors, 18 and 19, with sector 18 treated. It doesn't matter whether the sector dummy variable is one for section 18 or 19, but once you have defined "treat," you can use treat rather than the sector dummy. While you can do this using hdidregress in Stata 18 and 19 or jwdid, you can also do it "by hand." You have to have year dummies, yr2015, ..., yr2022. The second command gives the average effect:

              Code:
              reg y i.treat i.year c.treat#c.yr2015 c.treat#c.yr2016 c.treat#c.yr2017 c.treat#c.yr2018 c.treat#c.yr2019 c.treat#c.yr2020 c.treat#c.yr2021 c.treat#c.yr2022, vce(robust)
              lincom  (c.treat#c.yr2015 + c.treat#c.yr2016 + c.treat#c.yr2017 + c.treat#c.yr2018 + c.treat#c.yr2019 + c.treat#c.yr2020 + c.treat#c.yr2021 + c.treat#c.yr2022)/8
              Or
              Code:
              gen cohort = 0
              replace cohort = 2015 if treat
              jwdid y, tvar(year) gvar(cohort)

              Comment


              • #8
                Clyde Schechter and Jeff Wooldridge Hello Professor Clyde and Professor Wooldridge, Thank you very much for your time and detailed advice.

                I first used i.time##i.treated and then jwdid. As I understand i.treated in first set-up is equivalent to including sector fixed effects. I hope this is fine given that the treatment varies at the sector level (sector 18 is treated and 19 is control). I can not include unit fixed effects as my data is pooled cross-sectional data.

                I get similar results in both setups. These are on expected lines given the difference between the pre and post means for the treated and control groups. [Although the standard errors are very large compared to the point estimate.]

                [I also use wild cluster bootstrap-t procedure due to small number of clusters with highly unequal sizes.]

                Thank you very much once again. Sorry for the delayed reply as I was preoccupied with another assignment.

                Code:
                generate time=0
                replace time=1 if year >= 2015
                generate treated=0
                replace treated=1 if sector== 18
                
                reghdfe Total_workers_ln i.time##i.treated [pweight=sampling_weight], absorb(state_cd) vce(cluster five_
                > digit_ind_code) 
                (MWFE estimator converged in 1 iterations)
                
                HDFE Linear regression                            Number of obs   =     23,368
                Absorbing 1 HDFE group                            F(   3,     23) =       7.24
                Statistics robust to heteroskedasticity           Prob > F        =     0.0014
                                                                  R-squared       =     0.1067
                                                                  Adj R-squared   =     0.1056
                                                                  Within R-sq.    =     0.0267
                Number of clusters (five_digit_ind_code) =         24Root MSE     =     1.3527
                
                                   (Std. err. adjusted for 24 clusters in five_digit_ind_code)
                ------------------------------------------------------------------------------
                             |               Robust
                Total_work~n | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
                -------------+----------------------------------------------------------------
                      1.time |   .0661731   .0911284     0.73   0.475    -.1223403    .2546866
                   1.treated |    .505581   .2561522     1.97   0.061    -.0243103    1.035472
                             |
                time#treated |
                        1 1  |   .0117752   .0941938     0.13   0.902    -.1830796      .20663
                             |
                       _cons |   3.335856   .2455563    13.58   0.000     2.827884    3.843828
                ------------------------------------------------------------------------------
                Code:
                gen cohort = 0
                replace cohort = 2015 if treated
                
                jwdid Total_workers_ln [pweight=sampling_weight], fevar(state_cd) tvar(year) gvar(cohort) cluster(five_digit_ind_code)
                WARNING: Singleton observations not dropped; statistical significance is biased (link)
                (MWFE estimator converged in 5 iterations)
                
                HDFE Linear regression                            Number of obs   =     23,368
                Absorbing 3 HDFE groups                           F(   2,     23) =       1.48
                Statistics robust to heteroskedasticity           Prob > F        =     0.2490
                                                                  R-squared       =     0.1070
                                                                  Adj R-squared   =     0.1057
                                                                  Within R-sq.    =     0.0001
                Number of clusters (five_digit_ind_code) =         24Root MSE     =     1.3526
                
                                           (Std. err. adjusted for 24 clusters in five_digit_ind_code)
                --------------------------------------------------------------------------------------
                                     |               Robust
                    Total_workers_ln | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
                ---------------------+----------------------------------------------------------------
                cohort#year#c.__tr__ |
                          2015 2015  |  -.0346398   .0921738    -0.38   0.711    -.2253159    .1560363
                          2015 2016  |   .0579633   .1032216     0.56   0.580    -.1555669    .2714935
                                     |
                               _cons |   3.687416   .0801337    46.02   0.000     3.521647    3.853185
                --------------------------------------------------------------------------------------
                
                         
                . estat simple
                ------------------------------------------------------------------------------
                             |            Delta-method
                             | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
                -------------+----------------------------------------------------------------
                      simple |   .0123043   .0940785     0.13   0.896    -.1720861    .1966947
                ------------------------------------------------------------------------------

                Comment

                Working...
                X