Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Difference-in-Difference with Panel Data

    Hello, everyone!

    I am fairly new to Stata and I am trying to work out how to complete a DID analysis using Panel Data. My data set contains 12 countries in a Panel Data format between 1980 and 2015. For each country, I have a list of observed variables over the time period.

    During the time series, a policy change is implemented within 3 of the 12 countries (2004). I would like to use these 3 countries as a treatment group and the remaining 9 as the control group.

    I have included three treatment variables that take the following values:

    Code:
     
    Treatment Variable Indicators
    Treat 1 if unit of observation is Treated Unit
    0 if unit of observation is Control Unit
    Post 1 if period is post-treatment
    0 if period is pre-treatment
    TreatPost (Treat * Post) 1 if unit is treated and in post-treatment period
    0 otherwise
    ​​​​​​
    In order to determine the significance of the policy change, I would like to use the DID approach. I constructed a panel data regression with the following command:

    Code:
    xtreg W_Trade_M Treat Post TreatPost Population Unemployment Avg_Month_Wage, re
    (Population, Unemployment, Avg_Month_Wage are observed variables within the data set.)

    Code:
    -------------------------------------------------------------------------------
        W_Trade_M |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    --------------+----------------------------------------------------------------
            Treat |   2.17e+10   3.05e+10     0.71   0.476    -3.80e+10    8.15e+10
             Post |   1.13e+10   5.41e+09     2.08   0.037     6.65e+08    2.19e+10
        TreatPost |   9.14e+10   7.40e+09    12.35   0.000     7.69e+10    1.06e+11
       Population |   1863.556   605.2821     3.08   0.002     677.2246    3049.887
     Unemployment |  -3.82e+09   8.54e+08    -4.48   0.000    -5.49e+09   -2.15e+09
    Avg_Month_W~e |   2.21e+07    2943090     7.52   0.000     1.64e+07    2.79e+07
            _cons |   6.64e+09   2.17e+10     0.31   0.759    -3.59e+10    4.92e+10
    --------------+----------------------------------------------------------------
          sigma_u |  4.186e+10
          sigma_e |  2.315e+10
              rho |  .76576531   (fraction of variance due to u_i)
    -------------------------------------------------------------------------------
    STATA produces the results and the coefficients on my explanatory variables seem legitimate. However when interpreting the coefficient on the TreatPost variable to conclude the DID estimate it seems unrealistically large. Is this a question of me interpreting the coefficient incorrectly or is there something wrong with my setup.

    Any help would be greatly appreciated.

    Many thanks,
    Last edited by Will Page; 20 Apr 2017, 09:20.

  • #2
    Your interpretation looks correct: according to this model, the control groupo experienced an increase in W_Trade_M of about 1.13e10 units, whereas in the treatment group it went up by about 10.27e10, a difference of 9.14e10. And although you don't show us the code the implemented your model, the description you give of the variables is correct.

    I think that it is not a good idea to trust your intuitions on what is "too large" when everything is in such astronomical numbers. It can also be a problem estimating regressions when you have variables whose scales differ by so many orders of magnitude (0-1 vs numbers of order 1010.) I imagine that your outcome variable is denominated in some relatively small units, perhaps currency units like dollars, or euros, or maybe yen or yuan. If I were you I would change the units on that variable to millions or even billions of currency units, so that the numbers in the regression data will all be of similar magnitudes. At the least, the results will be easier to wrap your mind around, and perhaps some numerical errors in the estimation will be avoided.

    One mroe thing, if you are using a modern version of Stata, you should use factor-variable notation (-help fvvarlist-) for this model. Scrap your TreatPost variable and code it this way:

    Code:
    xtreg W_Trade_M i.Treat##i.Post Population Unemployment Avg_Month_Wage, re
    The main advantage is that after that you will be able to calculate predicted means and marginal effects using the -margins- command.

    I'm also curious why you're using a random effects model. There are only 12 countries, so you are not getting a very thorough sampling of the country-effect space in your data. Why not fixed-effects here? (If you go to a fixed effects model, then you will use the Treat variable due to colinearity with the fixed effects, but that doesn't matter because it's just a nuisance parameter in that model anyway. You still want to interpret the treatment effect as coming from the Treat#Post interaction term.)

    Comment


    • #3
      Clyde Schechter,

      Could I ask you a question? Treat, Post, as well as Treat#Post are time-invariant variables. Will they be excluded from the model if Will uses fixed effects?
      --------------------
      (Stata 15.1 MP)

      Comment


      • #4
        Linh:
        yes, as you can see from the following toy-example:
        Code:
        use "http://www.stata-press.com/data/r15/nlswork.dta"
        . xtreg ln_wage i.race##i.birth_yr, fe
        note: 2.race omitted because of collinearity
        note: 3.race omitted because of collinearity
        note: 42.birth_yr omitted because of collinearity
        note: 43.birth_yr omitted because of collinearity
        note: 44.birth_yr omitted because of collinearity
        note: 45.birth_yr omitted because of collinearity
        note: 46.birth_yr omitted because of collinearity
        note: 47.birth_yr omitted because of collinearity
        note: 48.birth_yr omitted because of collinearity
        note: 49.birth_yr omitted because of collinearity
        note: 50.birth_yr omitted because of collinearity
        note: 51.birth_yr omitted because of collinearity
        note: 52.birth_yr omitted because of collinearity
        note: 53.birth_yr omitted because of collinearity
        note: 54.birth_yr omitted because of collinearity
        note: 1b.race#54.birth_yr identifies no observations in the sample
        note: 2.race#42.birth_yr omitted because of collinearity
        note: 2.race#43.birth_yr omitted because of collinearity
        note: 2.race#44.birth_yr omitted because of collinearity
        note: 2.race#45.birth_yr omitted because of collinearity
        note: 2.race#46.birth_yr omitted because of collinearity
        note: 2.race#47.birth_yr omitted because of collinearity
        note: 2.race#48.birth_yr omitted because of collinearity
        note: 2.race#49.birth_yr omitted because of collinearity
        note: 2.race#50.birth_yr omitted because of collinearity
        note: 2.race#51.birth_yr omitted because of collinearity
        note: 2.race#52.birth_yr omitted because of collinearity
        note: 2.race#53.birth_yr omitted because of collinearity
        note: 2.race#54.birth_yr omitted because of collinearity
        note: 3.race#41b.birth_yr identifies no observations in the sample
        note: 3.race#42.birth_yr identifies no observations in the sample
        note: 3.race#43.birth_yr omitted because of collinearity
        note: 3.race#44.birth_yr omitted because of collinearity
        note: 3.race#45.birth_yr omitted because of collinearity
        note: 3.race#46.birth_yr omitted because of collinearity
        note: 3.race#47.birth_yr omitted because of collinearity
        note: 3.race#48.birth_yr omitted because of collinearity
        note: 3.race#49.birth_yr omitted because of collinearity
        note: 3.race#50.birth_yr omitted because of collinearity
        note: 3.race#51.birth_yr omitted because of collinearity
        note: 3.race#52.birth_yr omitted because of collinearity
        note: 3.race#53.birth_yr omitted because of collinearity
        note: 3.race#54.birth_yr identifies no observations in the sample
        
        Fixed-effects (within) regression               Number of obs     =     28,534
        Group variable: idcode                          Number of groups  =      4,711
        
        R-sq:                                           Obs per group:
             within  = 0.0000                                         min =          1
             between = 0.0050                                         avg =        6.1
             overall =      .                                         max =         15
        
                                                        F(0,23823)        =       0.00
        corr(u_i, Xb)  =      .                         Prob > F          =          .
        
        -------------------------------------------------------------------------------
              ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        --------------+----------------------------------------------------------------
                 race |
               black  |          0  (omitted)
               other  |          0  (omitted)
                      |
             birth_yr |
                  42  |          0  (omitted)
                  43  |          0  (omitted)
                  44  |          0  (omitted)
                  45  |          0  (omitted)
                  46  |          0  (omitted)
                  47  |          0  (omitted)
                  48  |          0  (omitted)
                  49  |          0  (omitted)
                  50  |          0  (omitted)
                  51  |          0  (omitted)
                  52  |          0  (omitted)
                  53  |          0  (omitted)
                  54  |          0  (omitted)
                      |
        race#birth_yr |
            white#54  |          0  (empty)
            black#42  |          0  (omitted)
            black#43  |          0  (omitted)
            black#44  |          0  (omitted)
            black#45  |          0  (omitted)
            black#46  |          0  (omitted)
            black#47  |          0  (omitted)
            black#48  |          0  (omitted)
            black#49  |          0  (omitted)
            black#50  |          0  (omitted)
            black#51  |          0  (omitted)
            black#52  |          0  (omitted)
            black#53  |          0  (omitted)
            black#54  |          0  (omitted)
            other#41  |          0  (empty)
            other#42  |          0  (empty)
            other#43  |          0  (omitted)
            other#44  |          0  (omitted)
            other#45  |          0  (omitted)
            other#46  |          0  (omitted)
            other#47  |          0  (omitted)
            other#48  |          0  (omitted)
            other#49  |          0  (omitted)
            other#50  |          0  (omitted)
            other#51  |          0  (omitted)
            other#52  |          0  (omitted)
            other#53  |          0  (omitted)
            other#54  |          0  (empty)
                      |
                _cons |   1.674907   .0018961   883.35   0.000     1.671191    1.678624
        --------------+----------------------------------------------------------------
              sigma_u |  .42456905
              sigma_e |  .32028665
                  rho |  .63731204   (fraction of variance due to u_i)
        -------------------------------------------------------------------------------
        F test that all u_i=0: F(4710, 23823) = 8.44                 Prob > F = 0.0000
        Kind regards,
        Carlo
        (Stata 18.0 SE)

        Comment


        • #5
          Can somebody help me interpret the stata results as below , d1 is my time variable, d2 is the treatment, d1d2 is the interaction.
          xtreg $ylist $xlist, re
          Random-effects GLS regression Number of obs = 3726
          Group variable: id Number of groups = 414
          R-sq: within = 0.0685 Obs per group: min = 9
          between = 0.5135 avg = 9.0
          overall = 0.1334 max = 9
          Random effects u_i ~ Gaussian Wald chi2(10) = 571.64
          corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
          csri Coef. Std. Err. z P>z [95% Conf. Interval]
          adi -.0010732 .000624 -1.72 0.085 -.0022962 .0001497
          rdi .0043439 .0061877 0.70 0.483 -.0077838 .0164717
          fs2 -.0009373 .0004181 -2.24 0.025 -.0017567 -.0001179
          sr 9.23e-10 3.68e-10 2.51 0.012 2.02e-10 1.64e-09
          d1 -.0045756 .0019018 -2.41 0.016 -.0083031 -.0008481
          lroa .0000163 8.79e-07 18.51 0.000 .0000146 .000018
          aan .0001739 .0002009 0.87 0.387 -.0002198 .0005675
          sqswti .4564973 .0655897 6.96 0.000 .3279438 .5850508
          d2 -.0099561 .001486 -6.70 0.000 -.0128686 -.0070436
          d1d2 .0066732 .0019774 3.37 0.001 .0027977 .0105487
          _cons .0124624 .0021619 5.76 0.000 .0082252 .0166996
          sigma_u 0
          sigma_e .01577453
          rho 0 (fraction of variance due to u_i)

          Comment


          • #6
            Rather than interpret this output, which requires some calculations that are easy to get wrong, go back and re-do the regression using factor variable notation instead of using your calculated interaction term d1d2. See -help fvvarlist-. Then run
            Code:
            margins d1#d2 // EXPECTED VALUE OF CSRI IN EACH GROUP PRE- AND POST-
            margins d1, dydx(d2) // MARGINAL EFFECT OF TREATMENT IN EACH TIME PERIOD
            The clearest explanation of the margins command is the excellent Richard Williams' https://www3.nd.edu/~rwilliam/stats/Margins01.pdf. It contains several worked examples, including some interaction models similar to yours.

            In the event you need assistance with that, when showing the output, please put it between code delimiters so it will be more readable and better aligned. See Forum FAQ #12 for instructions on using code delimiters if you are not familiar with them.

            Comment


            • #7
              Hi thanks for your input. I am pasting the output of margins below. My outcome variable is CSR intensity, which is basically CSR spend /sales of previous year. Am studying a policy that came in the year 2013 mandating CSR spend, so considering companies who were spending on CSR even before the policy as my control grp (the unaffected grp) and those who started spnding on CSR only after the policy as treatment group (the affected group) . d1 is the time variable, 0 before 2013 and 1 after 2013, d2 is the treatment variable 0 for control and 1 treatment . i have data from 2010 to 2018. Since am new to stata and specifically to DID technique, finding it difficult to interpret. Your help wd b valuable.
              Code:
               margins d1#d2
              
              Predictive margins                                Number of obs   =       3726
              Model VCE    : Conventional
              
              Expression   : Linear prediction, predict()
              
              ------------------------------------------------------------------------------
                           |            Delta-method
                           |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
              -------------+----------------------------------------------------------------
                     d1#d2 |
                      0 0  |   .0097701   .1694388     0.06   0.954    -.3223238    .3418641
                      0 1  |   .0005745   .1694388     0.00   0.997    -.3315194    .3326685
                      1 0  |    .005136   .1694388     0.03   0.976    -.3269579    .3372299
                      1 1  |    .002192   .1678176     0.01   0.990    -.3267245    .3311085
              ------------------------------------------------------------------------------
              
              margins d1 , dydx(d2)
              
              Average marginal effects                          Number of obs   =       3726
              Model VCE    : Conventional
              
              Expression   : Linear prediction, predict()
              dy/dx w.r.t. : 1.d2
              
              ------------------------------------------------------------------------------
                           |            Delta-method
                           |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
              -------------+----------------------------------------------------------------
              1.d2         |
                        d1 |
                        0  |  -.0091956   1.50e-11 -6.1e+08   0.000    -.0091956   -.0091956
                        1  |   -.002944    .002066    -1.42   0.154    -.0069933    .0011053
              ------------------------------------------------------------------------------
              Note: dy/dx for factor levels is the discrete change from the base level.
              Thanks and Regards,
              Asha

              Comment


              • #8
                The first table gives the expected values of CSR intensity in the four conditions: control-before, control-after, treatment-before, and treatment-after. For example, in your results, the expected CSR intensity in the control group before 2013 was 0.0097701, with a 95% CI from -.3223238 to .3418641. For the control group after 2013 the expected CSR intensity was 0.005136, with a 95% CI from -.3269579 to .3372299. The second and fourth rows of that table show the corresponding values for the treatment group.

                The second table shows the differences between the treatment and control groups in the before and after periods. So, before 2013, the difference between treatment and control groups was -.0091956, 95% CI -.0091956 to -.0091956. And after 2013 the difference between treatment and control groups was -.002944 95% CI -.0069933 to .0011053. The negative signs here mean that the CSRI intensity was lower in the treatment group than in the control group in both periods (though in the treatment period we are not so sure of that because the confidence interval extends up to positive numbers.)

                The DID estimator of the treatment effect is not directly shown in the -margins- output. Instead, you can read that from the regression output: it is the coefficient of 1.d1#1.d2 in that table.

                Comment


                • #9
                  Dear Clyde,

                  Thanks getting the picture clear from your inputs, i have got .0062516 as the coefficient of 1.d1#1.d2, what does this signify?

                  Asha

                  Comment


                  • #10
                    It means that the change in CSR intensity from before to after 2013 was 0.0062516 greater in the treatment group than the change in the control group at the same time was. That is the DID estimator of the effect of the intervention on CSR intensity.

                    One way to understand it is that we look at how CSR intensity changed in the intervention group after 2013. Part of that change is due to the policy intervention. But some of it would have happened anyway. Our best way to estimate how much would have happened anyway is to see what did happen in the control group. So we subtract the observed change in the control group from the observed change in the group affected by the policy and we attribute that difference to the intervention. That's the theory behind DID estimation. The coefficient of the 1.d1#1.d2 interaction term actually makes that calculation.

                    Comment


                    • #11
                      thanks a lot for your assistance. my concepts are clear on DID. Can you share any published article that uses DID for data analysis?

                      Thanks and Regards,
                      Asha

                      Comment


                      • #12
                        Can somebody help me in plotting the trend graph for control and treatment group for DID estimation in STATA? Thank you in advance
                        Asha

                        Comment


                        • #13
                          Hello everyone, i have a panel data of more than 150 countries for 20 years. I have some observed variables to study based on DID estimation. Out of 150 countries there are 60 countries who have implemented a new policy over time. While the rest would be considered as control group. In particular, the treatment was made in different years in different counties. Some countries started implementing new policy in early 2000 and by 2008 almost all the treatment group have changed the policy. I am not sure what "time" dummy shoiuld i create for this. I am new to research work and i am not sure whether we create one year when the treatment was implemented to the treatment group. Please suggest can i use 2008 as the year when the treatment was made? Although the year of intervention was different...,

                          Comment


                          • #14
                            If I understand your explanation correctly, there is no single year in which the policy was started at all 60 countries: it was a variable process with some starting as early as 2000 and others as late as 2008. So this setup is not compatible with classical DID estimation. Rather you have to use generalized DID here. The key variable you need to set up is a dichotomous variable that is 1 when the year is equal to or later than the year in which the country began implementing the new policy, and 0 otherwise. That means that, among other things, it is also 0 in all observations of the control group. For the intervention group, it is 0 before the particular year in which the particular country implemented the policy, but 1 in all years after that.

                            You use that variable as a predictor in your fixed-effects regression. The model must also include country and individual year fixed effects. Also include whatever other time-varying covariates might be appropriate based on the substance of the problem. Note that in this model there is no variable that indicates intervention vs control group, and there is no variable that indicates pre- vs post- intervention.

                            Comment


                            • #15
                              Clyde Schechter Thank you for the reply. I understand the explanation you made for the data i have, but the last line creates some confusion. "Note that in this model there is no variable that indicates intervention vs control group, and there is no variable that indicates pre- vs post- intervention".

                              What does the above line means? whether it means that the model would not able to capture the effects of intervention among treatment group in pre and post period? Does it supposed to mean that the real essence of DID could not be achieved with this model using Dichotomous variable?

                              Comment

                              Working...
                              X