Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Basic Difference in Difference Panel Regression

    Hi everyone,

    I am trying to perform a basic DiD model ( I am only purely interested in the difference in firm's energy usage before and after a policy intervention). I am working with panel data with many firms before and after energy consumption usage data in kwh. Unfortunately, I do not have any other X variables like no. of ppl in the firm, size of the firm etc. My control group is another set of firms in a nearby community where the policy intervention is not yet implemented at the time periods I am looking at.

    As such, I thought of using two way fixed effect model to cater for the many uncertainties in the treatment and control group. I understand that energy consumption can be influenced by seasonal trends such as weather patterns across the different time periods. As such I want to control for time effects as well.

    I have tried using the following code for fixed effect and also the random effects model:

    xtreg Ln(Energy Consumption) treatedt t treated time_dum, fe

    where treated = treatment grp which also represent the community each firm belongs to
    t= indicates time period following treatment
    treatedt being the interaction of these two variables- DiD estimator
    time_dum= time dummy variables to control for time effects


    1 ) However, my treated variable keeps getting dropped by fixed effect model, which I understand why after combing through this forum . But I would need to know the coefficient and p-value of treated variable for my DiD analysis. Am I right? How can I keep this variable in the fixed effect model?

    2) Since treatedt is the difference before and after the intervention in the treatment group, deducting away the difference in control group (which also includes seasonal trend), do I still have to add in time dummy variables in the coding to account for seasonal time trend?

    P.S I am a beginner and any help is appreciated. Please correct me if I am wrong

    Thank you for your time.

  • #2
    But I would need to know the coefficient and p-value of treated variable for my DiD analysis. Am I right? How can I keep this variable in the fixed effect model?
    No, that's not right. You don't need to know anything about that variable--it's irrelevant. If you had an estimate of that coefficient, what it would represent is the mean difference in log energy consumption between the two treatment groups before the policy intervention went into effect. It has no bearing on the effectiveness of the policy intervention. It's just a curiosity. And it happens to be a curiosity that cannot be estimated in a fixed-effects model for reasons that it appears you already understand. You can't keep it in the model if you use fixed-effects estimation. If you are really curious about the value of this curiosity, then run -xtreg, be- to get it.

    The coefficient that matters in your DID model is the coefficient of the interaction term: it represents the DID estimate of the intervention's effect.

    2) Since treatedt is the difference before and after the intervention in the treatment group, deducting away the difference in control group (which also includes seasonal trend), do I still have to add in time dummy variables in the coding to account for seasonal time trend?
    If there are important seasonal trends, then, yes, you should include some representation of time in your model. Using time period-indicator variables as you have done is one way to do this, and it will completely adjust out any time-dependent trends, at the risk of perhaps overfitting the model to noise in the data. If it is possible to characterize seasonal variation in some more refined way, that might be a better approach. For example, instead of having an indicator variable for each different time, if the relevant variation is that energy use goes up in the summer, say, then just having an indicator for summer should capture that effect, with less overfitting. Or if the time trend is simply one of a general increase as time goes on, using the date variable itself would be the way to go It really all depends on what the nature of the time dependent effects are.

    By the way, in setting up your model, I advise you to use factor-variable notation rather than calculating your own interaction term. So -xtreg ln_energy_consumption i.treated##i.t // AND SOME OTHER REPRESENTATION OF TIME AS NEEDED- See -help fvvarlist- for more information about factor variable notation. In addition to producing a nicely labeled regression output, this will enable you to use the -margins- command afterward to quickly and easilycalculate predicted values and marginal effects.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      No, that's not right. You don't need to know anything about that variable--it's irrelevant. If you had an estimate of that coefficient, what it would represent is the mean difference in log energy consumption between the two treatment groups before the policy intervention went into effect. It has no bearing on the effectiveness of the policy intervention. It's just a curiosity. And it happens to be a curiosity that cannot be estimated in a fixed-effects model for reasons that it appears you already understand. You can't keep it in the model if you use fixed-effects estimation. If you are really curious about the value of this curiosity, then run -xtreg, be- to get it.

      The coefficient that matters in your DID model is the coefficient of the interaction term: it represents the DID estimate of the intervention's effect.


      If there are important seasonal trends, then, yes, you should include some representation of time in your model. Using time period-indicator variables as you have done is one way to do this, and it will completely adjust out any time-dependent trends, at the risk of perhaps overfitting the model to noise in the data. If it is possible to characterize seasonal variation in some more refined way, that might be a better approach. For example, instead of having an indicator variable for each different time, if the relevant variation is that energy use goes up in the summer, say, then just having an indicator for summer should capture that effect, with less overfitting. Or if the time trend is simply one of a general increase as time goes on, using the date variable itself would be the way to go It really all depends on what the nature of the time dependent effects are.

      By the way, in setting up your model, I advise you to use factor-variable notation rather than calculating your own interaction term. So -xtreg ln_energy_consumption i.treated##i.t // AND SOME OTHER REPRESENTATION OF TIME AS NEEDED- See -help fvvarlist- for more information about factor variable notation. In addition to producing a nicely labeled regression output, this will enable you to use the -margins- command afterward to quickly and easilycalculate predicted values and marginal effects.
      Hi Clyde,

      Thanks for your reply. When you mentioned that the treated variable and its coefficient is not of interest, does that mean that I can simply run a xtreg command which includes only ln(energy) and my interaction variable (DiD) without the t variable, treated variable etc? I have came across some research papers which reported these coefficients. Also, I have been experimenting around and found out that my interaction variable tends to have a significant p-value if i only regress on the interaction variable.

      Appreciate your help

      Comment


      • #4
        does that mean that I can simply run a xtreg command which includes only ln(energy) and my interaction variable (DiD) without the t variable, treated variable etc?
        No. You do need the t variable. You can only leave out the treated variable (and if you don't leave it out, Stata will omit it for you anyway). If you were not in a fixed effects model, you would need both the treated and t variables in addition to the interaction term in order for the interaction term to properly represent the intervention effect. The reason you don't need the treated variable here is that its information is captured by the fixed effects themselves, so the interaction term can still represent the intervention effect. But if you drop the t variable, there is nothing else there to represent that information and the interaction term would not be interpretable as the intervention effect.

        Added: Depending on how you represent time in your model, you may end up in a situation where t is omitted due to colinearity with your time indicators. If that happens, it is not a problem: as long as that information is there, the interaction term is interpretable. So leave t in the model. If Stata omits it for you, that's fine. If Stata doesn't omit it for you, then it needs to be there. Let Stata handle this for you.

        Also, I have been experimenting around and found out that my interaction variable tends to have a significant p-value if i only regress on the interaction variable.
        No, no, no, no, no!!! "Experimenting around" to look for a "significant p-value" is statistically invalid and just leads to cherry-picking type I errors out of the data. You have to use a model that is developed prior to having the data based on its scientific plausibility and then go with that. Dredging the data for p-values just pollutes the scientific literature with garbage Do read Ronald L. Wasserstein & Nicole A. Lazar (2016): The ASA's statement on p-values: context, process, and purpose, The American Statistician, which you can get at http://dx.doi.org/10.1080/00031305.2016.1154108.
        Last edited by Clyde Schechter; 12 Dec 2016, 10:05.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          No. You do need the t variable. You can only leave out the treated variable (and if you don't leave it out, Stata will omit it for you anyway). If you were not in a fixed effects model, you would need both the treated and t variables in addition to the interaction term in order for the interaction term to properly represent the intervention effect. The reason you don't need the treated variable here is that its information is captured by the fixed effects themselves, so the interaction term can still represent the intervention effect. But if you drop the t variable, there is nothing else there to represent that information and the interaction term would not be interpretable as the intervention effect.

          Added: Depending on how you represent time in your model, you may end up in a situation where t is omitted due to colinearity with your time indicators. If that happens, it is not a problem: as long as that information is there, the interaction term is interpretable. So leave t in the model. If Stata omits it for you, that's fine. If Stata doesn't omit it for you, then it needs to be there. Let Stata handle this for you.


          No, no, no, no, no!!! "Experimenting around" to look for a "significant p-value" is statistically invalid and just leads to cherry-picking type I errors out of the data. You have to use a model that is developed prior to having the data based on its scientific plausibility and then go with that. Dredging the data for p-values just pollutes the scientific literature with garbage Do read Ronald L. Wasserstein & Nicole A. Lazar (2016): The ASA's statement on p-values: context, process, and purpose, The American Statistician, which you can get at http://dx.doi.org/10.1080/00031305.2016.1154108.
          Thank you clyde! Not to worry, by "experimenting around", I was just trying to learn how to use the software by trying different eqns and noticed this trend.

          May I know how would the coding change if i were to have more 2 treatment periods (ie post1 post2) and 1 pre period? I will have 2 interaction variables- treatedpost1, treatedpost2. Do i regress them on the same equation or separately?

          Sorry for the influx of qns.

          Thanks for your time and help!



          Comment


          • #6
            All in one equation. So let's say your treatment period variable is coded t = 0 for pre-intervention, t = 1 for post-intervention 1, and t = 2 for post-intervention 2. THen the code would be
            Code:
            xtreg ln_energy_consumption i.treated##i.t , fe // ADD OTHER COVARIATES AS APPROPRIATE
            This assumes that by two post intervention periods you mean that the intervention was carried out in two stages, so 1 denotes the first stage in place and 2 denotes the second stage in place. If by two post-intervention periods you mean something else, then I think you should explain it in detail, as it might be different.

            Comment


            • #7
              Originally posted by Clyde Schechter View Post
              All in one equation. So let's say your treatment period variable is coded t = 0 for pre-intervention, t = 1 for post-intervention 1, and t = 2 for post-intervention 2. THen the code would be
              Code:
              xtreg ln_energy_consumption i.treated##i.t , fe // ADD OTHER COVARIATES AS APPROPRIATE
              This assumes that by two post intervention periods you mean that the intervention was carried out in two stages, so 1 denotes the first stage in place and 2 denotes the second stage in place. If by two post-intervention periods you mean something else, then I think you should explain it in detail, as it might be different.
              Hi clyde,

              Just to be sure , treatment time variable refers to t variable in my first post right (indicate 1 for post treatment period, 0 for pre treatment period) ? However My intervention was carried out in one shot before post1 and is still ongoing beyond post2. Post 2 simply indicates a later time period than post 1. Does this mean my t variable shld be treated as 0 for pre treatment, 1 for post1 and 1 for post2 instead?

              Thanks!

              Comment

              Working...
              X