Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Difference-in-Difference model with continuous treatment variable and multiple treatment periods

    Hallo everyone!

    I am new by Statalist and apologize if my question has been already answered here. I checked up some posts but they were not really answer on my question.

    I want to investigate the effect of labor policy on capital structure in 15 EU countries. I would like to apply Difference-in-Difference model as it is common by identifying the effect of state policy. However, my treatment variable is a contineous variable, namely it is an index (on the country level) that can take any value between 0 and 6 (not only integer values but also e.g. 2.75, 0.37 etc.). This index measure the strongness of labor regulation and it changes in the respective country according to changes in labor law in this country. The standard DID with binary treatment-variable and year-dummy for pre- and post intervention are not really applicable here as I undestand it. But after studying some papers I found a very similar setting as I have and they say they employ a DID research design. They estimate the effect of World War II on female labor supply in the US and describe their model as follows:
    yists+γd1950+X′istβ+φ(d1950⋅m)+ϵist

    y are weeks worked by female i, in state s, in year t. They have two periods, 1940 and 1950 where d1950 is a dummy for the latter year, X is a vector of individual characteristics, δs are state dummies, and ms is the mobilization rate of men in each state (proxy for WWII effect). Their interaction estimates whether states with higher mobilization rates during WWII saw a stronger rise in females' weeks worked from 1940 to 1950. This is given by the coefficient φ.

    What I do not really undestand why one have to take the latest year from the sample period and does it mean that it is a dummy that takes "1" for the year 1950 here? It seems to me more plausible to take the sample`s beginning date.
    The next question: Can I just use xtreg command: xtreg outcome controls_variables i.Y2015_dummy##c.Index, fe vce(r)
    And why is it DID analysis or better what does it mean "the generalized DID strategy"?

    Sorry for such a long post, I just wanted to be clear and I would be thankful for any help!

    Best regards
    Marina

  • #2
    why one have to take the latest year from the sample period and does it mean that it is a dummy that takes "1" for the year 1950 here? It seems to me more plausible to take the sample`s beginning date.
    It is plausible, and you can do it either way. The key thing is that the variable distniguishes the period where the treatment is in use among the treatment group, from the period when it is not. Whether you code that 0/1 or 1/0 doesn't matter--you just have to interpret your results accordingly.

    Can I just use xtreg command: xtreg outcome controls_variables i.Y2015_dummy##c.Index, fe vce(r)
    So there are several aspects to this. The use of -xtreg, fe- is probably appropriate assuming you have panel data. You may want to consider a random effects model (and, as this sounds like you are working in economics you will want to check that with a Hausman test). Bear in mind that with -xtreg-, the -vce(robust)- option is interpreted as -vce(cluster robust panel)-. If your panel identifier is country, and you have only 15 of them, you may be better off not specifying the cluster robust estimator. The cluster robust option is helpful when the number of panels is large; when the number is small it can actually make things worse. There is no consensus on how many is large enough, and I suspect that it actually depends on other aspects of the data. But at least some, though probably not all, people would say that 15 is not sufficient for cluster robust estimation. Finally there is the question of your key term i.Y2015_dummy##c.Index. That is appropriate if the labor policy you refer to is in effect only in year 2015 and not in the other years in your data set, or, if it is in effect in all of the years other than 2015 and not in 2015. It's not a question of picking out a single year. The variable you use here must distinguish all the years when the labor policy is in effect from all the years when it is not in effect (among the countries that use it).

    And why is it DID analysis or better what does it mean "the generalized DID strategy"?
    It is still DID analysis because although it does not conform to the original DID model of dichotomous treatment group and dichotomous treatment time periods represented as i.treatment_group##i.time_period, the logic remains the same. You are estimating the difference over time of the treatment vs control group difference in outcome. As it uses the same underlying logic, but different kinds of variables, it is a "generalized" DID approach.

    Comment


    • #3
      Dear Mr. Schechter,

      thank you very much for your detailed response! It has helped me! But I have some more question and would be very helpful if you could maybe help me.

      The key thing is that the variable distinguishes the period where the treatment is in use among the treatment group, from the period when it is not.
      I understand how it works in the standard DID but I don't really understand what does means in my case, because actually there do not be one time period (one year) that distinguish pre and post intervention. My sample covers 1994-2015 time period and the index for labor protection strictness changes in each country at different time points, and in the most countries many times depending on whether there was law changes in respective year and in respective country. E.g. the index for Germany is 2.75 since 1994 to 2001 in 2002 it jumps to 3.25 and remains by 3.25 for the next four years. But first there do not be unique "year of policy change" for all countries and second some countries have more than one time period then the policy changes or when it will be "treated"

      I don't really understand what I have to do in stata in order to take in account this fact that I don't have dichotomous time period? I have read that by generalized DID one have just to add dummy for countries and dummy for years. Is ist correct and what does it means in language of Stata : xtreg controls_var i.countries i.year##c.index ?

      Thank you very much in advance!

      Comment


      • #4
        So, you don't really have a DID design, and you have to do something slightly different. There has to be a dichotomous variable distinguishing pre-treatment from post-treatment periods. When there is not a single time-point that defines that, then you have to create that variable in a different way. For those in the treatment group it is easy. You set that variable to 0 for all years preceding the year that panel began treatment and 1 for those years thereafter. The question then is how to set this variable for the controls. There are three approaches that can be used.

        1. There may be some theoretical basis for identifying a year "in which the panel would have started treatment" if it were in the treatment group. For example, if the treatment is a matter of certain legislation being passed, and if all of the controls had considered but rejected such legislation at some time, then the time at which the panel rejected that legislation would be the time to separate pre- and post-treatment eras for that panel.

        2. If #1 does not apply, you can attempt to create matched pairs between treatment panels and control panels. Use whatever variable you have that are predictive of outcome and create a matching. (The propensity-score matching routine in Stata may be helpful here, though they are not the only way to do this.) Then assign to each control panel the same date dividing pre- and post-treatment that its matched treatment panel has.

        3. If there are no reasonable models of the outcome that would enable you to do #2, then the last approach is to do it at random. You match each treatment panel with a randomly selected control panel and again use the treatment panel's date for the control panel.

        Note that in both #2 and #3, if you have an excess of controls, you can do matched triples, or, more generally, matched tuples if you like.

        Added: Bear in mind that #1 and #2 have a much firmer foundation for the kind of causal inference desired in DID. So think hard about a way to do one of these and only lapse to #3 if there is no viable alternative.

        Comment


        • #5
          Thank you very much for your response!

          I have already thought about propensity score matching, that would be in any case a possible solution for my "problem".
          However, I read one paper where authors investigates the same research question as I and use the very similar index for the strictness of the labor protection legislation. I am confused because they adopted DID research design and describe it as follows:

          "Consider t = 0 to be the starting period in our sample. From t = 1 to t = 2, country B initially serves as a control group for legal change; after that it serves as a treated group for subsequent years. Therefore, most countries belong to both treated and control groups at different points in time. This specification is robust to the fact that some groups might not be treated at all or that other groups were treated prior to 1985, which is our sample’s start date. It is also robust to using a continuous index.
          And their regression specification is:

          yit = λi + δ · EPLk.t-1 + β · Xi,t-1 + αj · γt + eit,

          where i denotes a firm, t denotes a year, j is an industry, and k is a country. The dependent variable yit is our measure of leverage; EPL is Indicator for Employment Protection Legislation; Xit−1 is the vector of control variables; the λi is a firm fixed effect; αj · γt is an industry/year fixed effect; and eit is the error term."

          It is exact what I have but I don`t undestand how I implement this in Stata, well that different countries can be treated and controls at different point in time? And why is it robust to using a contineous index?

          I have also finde the following explanation about how to handle multiple intervention:
          "estimate the following estimation yict =βdict + pt +mc + u
          yict is outcome for unit i (e.g. firm) in period t (e.g. year) and cohort c, where “cohort” indexes the different sets of firms treated by each event. E.g. different firms might be affected by a change in regulation at different points in time; firms affected at one point in time are a ‘cohort’.
          . Click image for larger version

Name:	DID.jpg
Views:	3
Size:	25.3 KB
ID:	1371251
          I don't really undestand how I should definy "d" (indicator on whether cogort is affected by time) in stata, well how I specify "d"?

          Thank you for any help!
          Attached Files

          Comment


          • #6
            No matter what they say,
            yit = λi + δ · EPLk.t-1 + β · Xi,t-1 + αj · γt + eit,
            is not a difference in differences design. It's a cohort design where the entities enter the treatment state at different times. There's nothing wrong with it, but calling it difference-in-differences is beyond a stretch. Whether the treatment is continuous or discrete, it still works the same way. Actually, it is a bit like a simple observational comparison of treated and untreated entities, the only control for other differences between them being the covariates X.

            If you data contains variables y (outcome), when_treatment_begun , covariates X, and chronological time variable (e.g. calendar year, or quarter, or whatever) t, then you can fit this model with a command like:

            Code:
            gen byte in_treatment = (t > when_treatment_begun)
            regress y i.in_treatment i.time X
            If your data has a panel structure, then use -xtreg- instead of regress. If treatment is continuous rather than discrete, use c.treat instead of i.treat. Similarly if time is continuous.


            The second model that you show something closer to a difference-in-differences model. They partition the data into cohorts: each cohort is defined by the time it starts the treatment. (For the control cohort, this is never.) This cohort variable takes the place of the treatment variable in the classic DID design. The variable d is the interaction between cohort and time (instead of the interaction between treatment and time). To implement this in Stata, you need another variable, when_treatment_begun, which shows the time that each firm/country/entity begins treatment, and set to missing for those that never experience the treatment.

            Code:
            gen byte in_treatment = (time >= when_treatment_begun) // SAME AS treat IN THE PREVIOUS MODEL
            egen cohort = group(when_treatment_begun), missing
            regress y i.cohort##i.in_treatment
            With panel data, again, use -xtreg- instead: since cohort will be constant within country, the uninteracted cohort term will drop from the model, but this is of no importance. In this model the in_treatment term corresponds to pt, and the cohort term corresponds to mc, while the cohort#in_treatment interaction term is dict. If there were only one treatment and all treated entities began treatment at the same time, and there were also a never-treated group, then this approach would be identical to the classical DID design.

            Again, I wouldn't really call this a DID model, although it is very much in the same spirit, and, as noted, includes the DID as a special case. I would be more inclined to call this an observational version of the stepped-wedge design. It's really a matter of semantics, I suppose.

            Comment


            • #7
              Thank you very much for your explanation and more insight!!! It is now much more clear to me! Ich still have just a couple of questions to be sure that I have understood it right:

              Ich have a panel data that covers time period from 1994 to 2015 and includs all listed firms in 15 EU-countries. My outcome variable "y" (leverage ratio) is given on the firm level for each year, all control variables are also on the firm level. My variable of interest (strictness of employment protection legislation) is a contineous index which is given on the country level (because the policy is apllied on the country level) for each year.
              For the first model:
              I am not sure what do you mean with the variable "when_treatment_begun"? Because different entities, countries in my case (or firms in different countries) enter the treatment state at different times. Does it mean I have to cteate a dummy which is 1 for each year when there was policy change in one of the countries (e.g. in Germany this is year 1997, for France 2002 for other countries probably other years, it means, my dummy is 1 for 1997, 2002, etc. )I think, i understand it wrong. In addition some countries had two changes of policy in two different years.
              With continuous index it can be handled just with c.index instead of i.index. But in the regression line "regress y i.in_treatment i.time X" do you mean that my index variable is under "X"? I thougt "X" are only control variables.

              For the second model:
              The same question with "when_treatment_begun" variable. And again I don`t undestand where my index variable is in the regression spesification?
              and set to missing for those that never experience the treatment.Do you mean that it have to be countries where my index for all year is equal "0"?

              Sorry for many questions!!! I have never worked with DID and additional my case is not really DID design, and honestly different notation in different papers confuses me a lot.

              Thank you in advance!!!

              Comment


              • #8
                I am not sure what do you mean with the variable "when_treatment_begun"? Because different entities, countries in my case (or firms in different countries) enter the treatment state at different times. Does it mean I have to cteate a dummy which is 1 for each year when there was policy change in one of the countries (e.g. in Germany this is year 1997, for France 2002 for other countries probably other years, it means, my dummy is 1 for 1997, 2002, etc. )I think, i understand it wrong. In addition some countries had two changes of policy in two different years.
                So the variable when_treatment_begun would take the value 1997 in all observations for Germany, 2002 in all observations for France, etc. Its value would be the actual year in which they began enhanced labor protections. It is a single variable, not a sequence of dummies. Note that it does not directly appear in the model: it is used to calculate the in_treatment variable (and in the second model, also to calculate the cohort variable.)

                With continuous index it can be handled just with c.index instead of i.index. But in the regression line "regress y i.in_treatment i.time X" do you mean that my index variable is under "X"? I thougt "X" are only control variables.
                So for a continuous index like this, it would be -regress y c.index i.time X-. The index would not be among the X variables (although this is really just the way we think about it, to Stata there is no difference.)

                The same question with "when_treatment_begun" variable.
                And the same answer.

                And again I don`t undestand where my index variable is in the regression spesification?
                It isn't. That model does not use continuous treatment indices: you are either treated or you aren't in that model. If you wanted to modify that model to accommodate different treatment intensities, assuming there are only a small number of values to the index, you could probably do:

                Code:
                gen byte in_treatment = (time >= when_treatment_begun) // SAME AS treat IN THE PREVIOUS MODEL
                egen cohort = group(when_treatment_begun index), missing
                regress y i.cohort##i.in_treatment
                This would be similar to the second model in #5 and #6, but it make the cohorts smaller: a cohort here is not just the countries that adopted EPL in the same year but who also adopted it to the same extent (value of index). A drawback to this approach is that it would not support any estimation of the relationship between the value of the index and the effect on the outcome. If you need to estimate that relationship, then the first model is more suited to your purpose.

                and set to missing for those that never experience the treatment.
                No, that was for the when_treatment_begun variable. For the index, it should be set to zero for those countries that didn't experience the treatment.

                Do you mean that it have to be countries where my index for all year is equal "0"?
                You can use both of these models even if that is not the case. But it is not a true difference in differences model unless this is true.


                Comment


                • #9
                  Thank you so so much for detailed explanations!!!

                  The last question about the "first model":
                  Is my regress command: - xtreg y c.index i.in_treatment i.time X, fe cluster(country) - (this is rather wrong?) or is it either - xtreg y c.index i.time X, fe cluster(country)- or -xtreg y i.in_treatment i.time X, fe cluster(country)? (about cluster (country) I have else to give thoughts to it, now I am rather interested in the spesification of independent variable, well my EPL_index and time)

                  And the last question for the "second model":
                  A drawback to this approach is that it would not support any estimation of the relationship between the value of the index and the effect on the outcome.
                  Do you mean, this model do not take into account that index has different values but considers it only as "treatment" or "no treatment"?

                  Thank you very much for your effort and your time!

                  Comment


                  • #10
                    For the first model it would be -xtreg y c.index i.time X, fe cluster(country)-.

                    As for the drawback of the second model, it would be implicitly taken into account because each "cohort" would have be defined, in part, by the value of the index. Therefore the cohort term of the model would incorporate that information into the analysis. But because the cohort term is also affected by the time when the EPL was started, the time of start and the index value are combined and it is not possible to separately identify their effects.

                    Comment


                    • #11
                      Thank you very much for your responce!

                      I have conducted the "second model", you can see the output below. But I am not sure how to interpret the results, I mean, in "standard" DID we consider the coefficient of interaction term and those of the main effect variable (cohort in this case), but I am not sure what i have to look at here? What is my coefficient of interest? Interaction term has different coefficients for each pair of cohort. I am also confused that some cohorts are omitted (because of collinearity), is it all right?

                      And to your answer for the first model: the regression line -xtreg y c.index i.time X, fe cluster(country)- seems me to be a "classical" regression for panel data with firm-fixed effects and year-fixed effects (i.time). Why do we can consider it as
                      ...cohort design where the entities enter the treatment state at different times
                      ? (your post #6)

                      Thank you in advance!


                      . xtreg Book_Debt i.cohort##i.in_treatment, fe
                      note: 3.cohort omitted because of collinearity
                      note: 5.cohort omitted because of collinearity
                      note: 6.cohort omitted because of collinearity
                      note: 7.cohort omitted because of collinearity
                      note: 8.cohort omitted because of collinearity
                      note: 1b.cohort#0b.in_treatment identifies no observations in the sample
                      note: 9.cohort#1.in_treatment omitted because of collinearity
                      note: 10.cohort#0b.in_treatment identifies no observations in the sample
                      note: 10.cohort#1.in_treatment omitted because of collinearity
                      note: 11.cohort#0b.in_treatment identifies no observations in the sample
                      note: 11.cohort#1.in_treatment omitted because of collinearity

                      Fixed-effects (within) regression Number of obs = 88,217
                      Group variable: id Number of groups = 8,559

                      R-sq: Obs per group:
                      within = 0.0103 min = 1
                      between = 0.0139 avg = 10.3
                      overall = 0.0126 max = 22

                      F(13,79645) = 64.06
                      corr(u_i, Xb) = -0.1388 Prob > F = 0.0000

                      -------------------------------------------------------------------------------------
                      Book_Debt | Coef. Std. Err. t P>|t| [95% Conf. Interval]
                      --------------------+----------------------------------------------------------------
                      cohort |
                      2 | .0866117 .0131824 6.57 0.000 .0607743 .112449
                      3 | 0 (omitted)
                      4 | .0725415 .0060336 12.02 0.000 .0607156 .0843674
                      5 | 0 (omitted)
                      6 | 0 (omitted)
                      7 | 0 (omitted)
                      8 | 0 (omitted)
                      9 | -.0080083 .0029496 -2.72 0.007 -.0137896 -.002227
                      10 | -.0302602 .0069598 -4.35 0.000 -.0439013 -.016619
                      11 | .0098384 .010035 0.98 0.327 -.0098302 .0295069
                      |
                      1.in_treatment | .0871863 .0034509 25.26 0.000 .0804225 .09395
                      |
                      cohort#in_treatment |
                      1 0 | 0 (empty)
                      2 1 | -.0575974 .0101846 -5.66 0.000 -.0775593 -.0376356
                      3 1 | -.0593895 .0057971 -10.24 0.000 -.0707517 -.0480272
                      4 1 | -.0757916 .0053881 -14.07 0.000 -.0863522 -.065231
                      5 1 | -.0666237 .0062055 -10.74 0.000 -.0787865 -.0544609
                      6 1 | -.0718331 .0039304 -18.28 0.000 -.0795368 -.0641295
                      7 1 | -.0673856 .0083044 -8.11 0.000 -.0836623 -.051109
                      8 1 | -.093119 .0041208 -22.60 0.000 -.1011958 -.0850422
                      9 1 | 0 (omitted)
                      10 0 | 0 (empty)
                      10 1 | 0 (omitted)
                      11 0 | 0 (empty)
                      11 1 | 0 (omitted)
                      |
                      _cons | .1609398 .0016104 99.94 0.000 .1577834 .1640963
                      --------------------+----------------------------------------------------------------
                      sigma_u | .14879353
                      sigma_e | .10942364
                      rho | .64900428 (fraction of variance due to u_i)
                      -------------------------------------------------------------------------------------
                      F test that all u_i=0: F(8558, 79645) = 16.02 Prob > F = 0.0000

                      Comment


                      • #12
                        The -xtreg- output says that the -xtset- was done using a variable called id as the panel variable. What does that represent and why are there so many (8,559) of them. Since you began this thread talking about 15 countries, I was expecting to see 15 groups, and a total sample of perhaps several hundred observations (perhaps 15 groups * 10 years of data for each group, perhaps observed as often as quarterly). Either this is an error or there is more structure to this data than you have mentioned previously.

                        Comment


                        • #13
                          Hi Marina, I have an essentially identical problem- an index of the strength of a particular type of employment law that differs between US states. Before I enter the discussion in more detail, would you be willing to provide the citations for the articles you refer to so I can read them first? Thanks, Deborah

                          Comment


                          • #14
                            Originally posted by Clyde Schechter View Post
                            No matter what they say,

                            is not a difference in differences design. It's a cohort design where the entities enter the treatment state at different times. There's nothing wrong with it, but calling it difference-in-differences is beyond a stretch. Whether the treatment is continuous or discrete, it still works the same way. Actually, it is a bit like a simple observational comparison of treated and untreated entities, the only control for other differences between them being the covariates X.

                            If you data contains variables y (outcome), when_treatment_begun , covariates X, and chronological time variable (e.g. calendar year, or quarter, or whatever) t, then you can fit this model with a command like:

                            Code:
                            gen byte in_treatment = (t > when_treatment_begun)
                            regress y i.in_treatment i.time X
                            If your data has a panel structure, then use -xtreg- instead of regress. If treatment is continuous rather than discrete, use c.treat instead of i.treat. Similarly if time is continuous.


                            The second model that you show something closer to a difference-in-differences model. They partition the data into cohorts: each cohort is defined by the time it starts the treatment. (For the control cohort, this is never.) This cohort variable takes the place of the treatment variable in the classic DID design. The variable d is the interaction between cohort and time (instead of the interaction between treatment and time). To implement this in Stata, you need another variable, when_treatment_begun, which shows the time that each firm/country/entity begins treatment, and set to missing for those that never experience the treatment.

                            Code:
                            gen byte in_treatment = (time >= when_treatment_begun) // SAME AS treat IN THE PREVIOUS MODEL
                            egen cohort = group(when_treatment_begun), missing
                            regress y i.cohort##i.in_treatment
                            With panel data, again, use -xtreg- instead: since cohort will be constant within country, the uninteracted cohort term will drop from the model, but this is of no importance. In this model the in_treatment term corresponds to pt, and the cohort term corresponds to mc, while the cohort#in_treatment interaction term is dict. If there were only one treatment and all treated entities began treatment at the same time, and there were also a never-treated group, then this approach would be identical to the classical DID design.

                            Again, I wouldn't really call this a DID model, although it is very much in the same spirit, and, as noted, includes the DID as a special case. I would be more inclined to call this an observational version of the stepped-wedge design. It's really a matter of semantics, I suppose.
                            Sorry, I have a question:
                            Code:
                             when_treatment_begun
                            is the syntax that you put or it the variable that I need to figure out. I am working in DiD for multiple events. I read your post but I cannot understand. In my case, I also have each year there are treated firms belong to the same industry subject to the treatment and each year there are some industries under treatment.

                            Comment


                            • #15
                              In the code that you cite, when_treatment_begun is something that you need to have in your data: it is a variable that, in each observation, gives the time (year, date, month, whatever it is) when the firm in that observation began treatment (and missing value if the firm in the observation never gets treatment). Whether it is something you need to calculate from other information you have or already present in your data as a variable, only you would know. But either way, you must have that information, and if it isn't already there as a variable, you need to create that variable.

                              Comment

                              Working...
                              X