Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Diff-in-diff collinearity problem with leads and lags

    Dear Statlist users,

    This is my first post so I hope I've done everything according to Statalist practice. If not, please point out any wrong-doings so that I may improve for future posts. Thanks!

    I'm currently writing my master thesis in economics where I'm going to use a difference-in-difference (DID) strategy. I use export data from 2016M1-2016M12 for a treatment group and 195 control groups. Treatment occurs at month 7.

    I have primarily followed this guide http://www.princeton.edu/~otorres/DID101.pdf but I have encountered a collinearity problem when I was to prove the parallel trends and the dynamic effects in the same regression (works fine when I do leads and lags separately). As I have 195 control groups my intention was to instead use leads and lags as recommended by https://stats.stackexchange.com/ques...mon-trend-betw and also done by Autor (2003) (http://economics.mit.edu/files/589 (p24 for regression output and p26 for graph)).

    However, when doing so my leads and lags are being omitted due to collinearity. This is my code

    Code:
    *Make data into a timeseries
    xtset Country Month, monthly
    
    *Create lnValue
    gen lnValue = ln(Value)
    
    *Create DID components
    *1. Timedummy
    gen time = (Month>=7) & !missing(Month)
    
    *2. Treatmentdummy
    gen treated = (Country==165) & !missing(Country)
    
    *3. Create interaction
    gen did = time*treated
    
    *Create leads for parallel trends assumption
    gen did1=f.did
    gen did2=ff.did
    gen did3=fff.did
    gen did4=ffff.did
    gen did5=fffff.did
    
    *Treatment lags for dynamic effects
    gen did11=l.did
    gen did22=ll.did
    gen did33=lll.did
    gen did44=llll.did
    gen did55=lllll.did
    
    reg lnValue time treated did1 did2 did3 did4 did5 did did11 did22 did33 did44 did55 i.Country i.Month, cluster(Country)
    which results in
    Code:
                               (Std. Err. adjusted for 197 clusters in Country)
    ------------------------------------------------------------------------------
                 |               Robust
         lnValue |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
            time |   -.105447   .1623735    -0.65   0.517    -.4256704    .2147765
         treated |    7.64953   .0811867    94.22   0.000     7.489418    7.809641
            did1 |          0  (omitted)
            did2 |          0  (omitted)
            did3 |          0  (omitted)
            did4 |          0  (omitted)
            did5 |          0  (omitted)
             did |   .0272714   .1623735     0.17   0.867     -.292952    .3474949
           did11 |          0  (omitted)
           did22 |          0  (omitted)
           did33 |          0  (omitted)
           did44 |          0  (omitted)
           did55 |          0  (omitted)
              .
              .
    ​​​          /Tons of FE estimates/
    ​​​​​          .
              .
           _cons |   7.953271   .0811867    97.96   0.000     7.793159    8.113382
    I noticed that Autor (2003) doesn't have a constant but by applying noconstant the problem still remains. It's also the same when removing the time and country fixed effects.

    Any help would be very much appreciated, thanks!

    //Sebastian

  • #2
    You can't have 11 lags and leads (including the 0 lag = original variable) in a data set where there are only 12 time periods. Remember that when you include, for example, three lags and three leads in the data set, you lose the first and last three observations for each country because there is not enough data forward or behind to calculate lags or leads for those observations. So as you add more lags and leads, you whittle down your sample. By the time you get up to 5 lags and leads, only months 6 and 7 remain in the estimation sample. At that point, if you know time and treated, you already can calculate both did and all of its lags and leads for those two months, hence the colinearity. You are simply attempting to do the impossible here. Either expand the time range at both ends, or drop down to just one or two lags and leads.

    By the way, you don't need to explicitly create those lag and lead variables. You can save yourself typing time (and opportunities for errors) by taking advantage of Stata's time series operators:

    Code:
    tsset country month
    regress lnvalue time treated L(-5/5).did // etc.
    will automatically incorporate all those lags and leads, as well as the original did. See -help tsvarlist- for details.

    Comment


    • #3
      Thank you Clyde!

      Best regards,
      Sebastian

      Comment


      • #4
        Originally posted by Clyde Schechter View Post
        You can't have 11 lags and leads (including the 0 lag = original variable) in a data set where there are only 12 time periods...
        Hi again,

        I continue with another question, though relevant to the same problem...

        I've extended my dataset to go from 2016m1 to 2017m12 and also included product data to the panel. What I want is to be able to evaluate the dynamic effects of treatment as previoulsy. However, this time by doing it on half year basis so that I would have a lag for the 6th and 12th period post treatment. However, they get omitted due to the reasons Clyde mentioned above. My questions are then:

        a) is my only possibility of doing this by adding more observations or is there any other possible ways of doing what I want without using the DID lags?

        b) by adding more observations, wouldn't it be the case that the estimates becomes less precise as they are based on observations further away from the exogenous treatment?

        For those interested, my code is now
        Code:
        *Make data into timeseries
        egen id = group(country product)
        tsset id month
        
        *Crete lnValue
        gen lnvalue = ln(value)
        
        *Create DID components
        gen time = (month>=7) & !missing(month)
        
        gen treated = (country==25) & !missing(country)
        
        gen did = time*treated
        
        *Regress
        reg lnvalue time treated did L(6).did L(12).did i.country i.month i.product, r cluster(country)
        Thanks,
        Sebastian



        Comment


        • #5
          Dear Statalist users,

          I hope you are going well. First of all, this is my second post on statalist, so I apologize if this is not the correct way to do it.

          I need your help. I am writing a Master thesis, which contains multiple time periods and multiple groups. Basically, I want to study a Policy fare change in public transport in Switzerland in some states and take other states as controls.

          I have basically a set-up pretty similar to the one presented by David Autor (2003). This is why I am trying to replicate this figure from his paper, in order to check my DiD assumption.

          I am trying to replicate this table in order to then Apply to my Master thesis. Any of you know how can I replicate this figure on stata?

          I have tried multiple codes, namely the "coefplot", and since yesterday I was unable to reproduce this figure.

          I really hope that any of you can help me. Thank you so much in advance for the help provided.

          Best,

          Michael Duarte
          Click image for larger version

Name:	Capture.PNG
Views:	1
Size:	56.0 KB
ID:	1543314


          Comment

          Working...
          X