Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Difference in Difference with three treatment groups, Problem of Collinearity

    Hello all,

    I have been working on a balanced panel dataset. I have two treatment groups and one control, i.e. three groups of countries for which I generated three dummies ("D1", "D2" and "D3") to use in the DiD regression. Each dummy equals 1 for the specified group of countries and 0 otherwise. No country belongs to more than one of the three groups. Furthermore, there's a time dummy which I indicate as "postevent", i.e., a dummy variable that equals 1 after a specific date (using monthly data) and 0 before that date. I am studying the effect of the treatment variable on a dependent variable "y" using the following regression, where countries are indicated by "c" and time and cross fixed effect are considered:

    Code:
    reg  y  i.D1##i.postevent  i.D2##i.postevent i.D3##i.postevent  i.c  i.date, cluster(c)
    The problem is that then I get the outcome that the interaction term of one of my last dummy (D3) is omitted due to collinearity:

    Code:
    note: D3 omitted because of collinearity.
    note: D3#1.postevent omitted because of collinearity.
    ...
    Initially I had a DiD with one control and one treatment group which works fine but I wanted to be more specific and defined a new country group and separated in from the other countries as explained above, however, it does not work like this and I don't know what I have missed or what mistake I have made. Unfortunately, I cannot include my data in the post. I hope the question is clear enough, otherwise please go ahead and ask me. I would appreciate your suggestions. How could I add the third groups of countries and still get a sane result?

    Best,
    Shadi

  • #2
    The fact that you have a DID model doesn't alter the principles behind what is often called the "dummy variable trap." If you have three groups, you must represent them with two indicators. Your variables D1 D2 and D3 suffer from the colinearity that D1 + D2 + D3 = 1 (= _cons). A model containing colinear variables is unidentifiable and Stata (and all other statistical packages) breaks such colinearities in order to identify the model. In this case Stata chose to eliminate D3. It could just as well have eliminated D1 or D2, or the constant term. But something had to go.

    Similar reasoning applies to the D*#postevent interaction terms.

    I should also point out that you have another colinearity in your model which, I'm confident Stata noticed and warned you about, but you same to have overlooked: the colinearity between i.postevent and the i.date variables. As you are doing a DID analysis, your postevent variable is defined as 1 for all dates after a certain point and 0 otherwise. This variable will necessarily be colinear with the i.date variables. And if you look carefully at your output, you will find that Stata has omitted one of them in addition to the usual baseline category that would be omitted with i.date in the absence of postevent.

    As popular as the "two way fixed effects" model is, it is not compatible with the classic DID analysis because of these colinearities. You really have to choose whether you want to use a classic DID analysis, without the two-way fixed effects, or whether you want to use a generalized DID model with two-way fixed effects. In other words, you can do:
    Code:
    regress y i.treatment##i.postevent, cluster(country) // CLASSIC DID, N.B. ##, NOT #
    
    OR
    regress y i.treatment#i.postevent i.country i.date, cluster(country) // GENERALIZED DID, N.B. #, NOT ##
    In the above code, the variable treatment would be a three level variable, coded 0/1/2 designating the three treatment groups in your data. (You could also represent treatment by two indicator ("dummy") variables for two of the groups, but the single variable approach is more convenient.) The generalized DID is usually used in a slightly different situation where the treatment is initiated at different times in different units, so that a simple postevent variable cannot be defined.

    By the way, in the presence of balanced data and all treated units initiating treatment at the same time, these two analyses will give you the same estimate for the treatment effects.
    Last edited by Clyde Schechter; 19 Jan 2023, 14:58.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      The fact that you have a DID model doesn't alter the principles behind what is often called the "dummy variable trap." If you have three groups, you must represent them with two indicators. Your variables D1 D2 and D3 suffer from the colinearity that D1 + D2 + D3 = 1 (= _cons). A model containing colinear variables is unidentifiable and Stata (and all other statistical packages) breaks such colinearities in order to identify the model. In this case Stata chose to eliminate D3. It could just as well have eliminated D1 or D2, or the constant term. But something had to go.

      Similar reasoning applies to the D*#postevent interaction terms.

      I should also point out that you have another colinearity in your model which, I'm confident Stata noticed and warned you about, but you same to have overlooked: the colinearity between i.postevent and the i.date variables. As you are doing a DID analysis, your postevent variable is defined as 1 for all dates after a certain point and 0 otherwise. This variable will necessarily be colinear with the i.date variables. And if you look carefully at your output, you will find that Stata has omitted one of them in addition to the usual baseline category that would be omitted with i.date in the absence of postevent.

      As popular as the "two way fixed effects" model is, it is not compatible with the classic DID analysis because of these colinearities. You really have to choose whether you want to use a classic DID analysis, without the two-way fixed effects, or whether you want to use a generalized DID model with two-way fixed effects. In other words, you can do:
      Code:
      regress y i.treatment##i.postevent, cluster(country) // CLASSIC DID, N.B. ##, NOT #
      
      OR
      regress y i.treatment#i.postevent i.country i.date, cluster(country) // GENERALIZED DID, N.B. #, NOT ##
      In the above code, the variable treatment would be a three level variable, coded 0/1/2 designating the three treatment groups in your data. (You could also represent treatment by two indicator ("dummy") variables for two of the groups, but the single variable approach is more convenient.) The generalized DID is usually used in a slightly different situation where the treatment is initiated at different times in different units, so that a simple postevent variable cannot be defined.

      By the way, in the presence of balanced data and all treated units initiating treatment at the same time, these two analyses will give you the same estimate for the treatment effects.
      Thank you very much for your reply. It works now having defined dummies as you have explained. I am having difficulties interpreting the coefficients, however. When there are two groups, treatment and control one can always interpret the 1 1 coefficient in the outcome as the DiD coefficient, i.e. the causal effect of the interest. But when we include three groups in the model, which are here categorized partly based on the level of treatment and partly because of special country specifics, there is no more a control group. Is it then correct to take the group with dummy value of zero as a reference groups and interpret the 1 1 and 1 2 coefficients as the difference between the groups with dummy values of 1 and 2 and this reference groups? Otherwise what interpretation would you suggest?

      Looking forward to your answer,
      Shadi

      Comment


      • #4
        Originally posted by Clyde Schechter View Post
        The fact that you have a DID model doesn't alter the principles behind what is often called the "dummy variable trap." If you have three groups, you must represent them with two indicators. Your variables D1 D2 and D3 suffer from the colinearity that D1 + D2 + D3 = 1 (= _cons). A model containing colinear variables is unidentifiable and Stata (and all other statistical packages) breaks such colinearities in order to identify the model. In this case Stata chose to eliminate D3. It could just as well have eliminated D1 or D2, or the constant term. But something had to go.

        Similar reasoning applies to the D*#postevent interaction terms.

        I should also point out that you have another colinearity in your model which, I'm confident Stata noticed and warned you about, but you same to have overlooked: the colinearity between i.postevent and the i.date variables. As you are doing a DID analysis, your postevent variable is defined as 1 for all dates after a certain point and 0 otherwise. This variable will necessarily be colinear with the i.date variables. And if you look carefully at your output, you will find that Stata has omitted one of them in addition to the usual baseline category that would be omitted with i.date in the absence of postevent.

        As popular as the "two way fixed effects" model is, it is not compatible with the classic DID analysis because of these colinearities. You really have to choose whether you want to use a classic DID analysis, without the two-way fixed effects, or whether you want to use a generalized DID model with two-way fixed effects. In other words, you can do:
        Code:
        regress y i.treatment##i.postevent, cluster(country) // CLASSIC DID, N.B. ##, NOT #
        
        OR
        regress y i.treatment#i.postevent i.country i.date, cluster(country) // GENERALIZED DID, N.B. #, NOT ##
        In the above code, the variable treatment would be a three level variable, coded 0/1/2 designating the three treatment groups in your data. (You could also represent treatment by two indicator ("dummy") variables for two of the groups, but the single variable approach is more convenient.)
        How would you suggest that I do an event study on the same setting now that I have a three level dummy? I have done one when I had a two level dummy but now I expect to see two sets of coefficients on the event study graph, although when I perform the following I see only one graph:

        Code:
         date, gen(dummydate)   // generating date dummies to define leads and lags, date is monthly
           
        gen lead10 = dummydate71*D   //generating leads and lags using the generated date dummies in the previous step and D the treatment dummy with three levels 0, 1, and 2
            gen lead9 = dummydate72*D
            gen lead8 = dummydate73*D
            gen lead7 = dummydate74*D
            gen lead6 = dummydate75*D
            gen lead5 = dummydate76*D
            gen lead4 = dummydate77*D
            gen lead3 = dummydate78*D
            gen lead2 = dummydate79*D
            gen lead1 = dummydate80*D    
        
         
        
            gen lag1 = dummydate82*D
            gen lag2 = dummydate83*D
            gen lag3 = dummydate84*D
            gen lag4 = dummydate85*D
            gen lag5 = dummydate86*D
            gen lag6 = dummydate87*D
            gen lag7 = dummydate88*D
            gen lag8 = dummydate89*D
            gen lag9 = dummydate90*D
            gen lag10 = dummydate91*D
        
        
        reg  y lead* lag* i.country i.date, cluster(country)
            
            estimate store thr_D   
            
            coefplot thr_D, keep(lead* lag*)  //basic event study plot
        I would appreciate your comments.

        Best,
        Shadi

        Comment


        • #5
          Is it then correct to take the group with dummy value of zero as a reference groups and interpret the 1 1 and 1 2 coefficients as the difference between the groups with dummy values of 1 and 2 and this reference groups? Otherwise what interpretation would you suggest?
          Yes.

          How would you suggest that I do an event study on the same setting now that I have a three level dummy?
          Sorry, but my knowledge of event studies is limited to what I read about them here on Statalist. We don't use this technique in my field. I'm afraid I can't advise you on this.

          Comment

          Working...
          X