Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matched Difference in Differences

    Dear statalist users.

    I am investigating value creation for start-ups that have received government support (treatment group) vs startups that have not received government support.
    I started off by performing PSM along the variables size, geography, industry and year of application for treatment.

    As the observations contained both value creation in year0 and year3 (these were not separate observations), I used expand 2, to duplicate the observations in order to separate observations of value creation in year 0 and year 3.

    Now I want to use the matched sample for the differences in differences estimation strategy.

    My model looks like this:

    Code:
    reg ValCre3 treated_match##Postvalcre, cluster(Kunde1)
    Where ValCre3 is value creation in year 3, Postvalcre is a variable that takes the value 1 if post treatment and 0 otherwise. treated_match takes the value 1 if the observation is treated and within the matched sample, and 0 otherwise. And treated_match##Postvalcre is the interaction variable indicating treated post treatment. I use clustered standard errors to correct for the fact that some companies (denoted by the variable Kunde1) receive several treatments.

    The output is as shown below:


    Linear regression Number of obs = 2,224
    F(1, 906) = .
    Prob > F = .
    R-squared = 0.0001
    Root MSE = 3150.1

    (Std. Err. adjusted for 907 clusters in Kunde1)
    -------------------------------------------------------------------------------------
    | Robust
    ValCre3 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
    --------------------+----------------------------------------------------------------
    1.treated_match | 74.7199 229.8976 0.33 0.745 -376.4739 525.9137
    1.year3 | 5.62e-12 . . . . .
    |
    treated_match#year3 |
    1 1 | -5.55e-12 . . . . .
    |
    _cons | 816.3444 202.2573 4.04 0.000 419.3972 1213.292
    -------------------------------------------------------------------------------------


    Could anybody tell me why no results are shown for the interaction variable treated_match#year3? Is the model wrongly specified, or have I made other mistakes?

  • #2
    I realized my value creation variable should contain value creation for both year 0 and year 3, so I re-specified it. It now works fine I think. My code for the differences in differences after having performed PSM matching using psmatch2:

    Code:
    expand 2, gen(year3)
    
    gen treated_match=0 if _nn==0 & treatment==0
    replace treated_match=1 if _nn==1 | _nn==2 | _nn==3 
    
    gen post_treat=treated_match*year3
    
    gen Valuecreation=0
    replace Valuecreation=ValCre0 if year3==0
    replace Valuecreation=ValCre3 if year3==1
    
    reg Valuecreation treated_match##year3, cluster(Kunde1)
    nn refers to nearest neighbors from the psmatch2, and is a way to separate the matched sample from the total sample. Year3 indicates three years after treatment.

    Results are shown below:

    Linear regression Number of obs = 2,224
    F(3, 906) = 7.64
    Prob > F = 0.0000
    R-squared = 0.0060
    Root MSE = 2672.8

    (Std. Err. adjusted for 907 clusters in Kunde1)
    -------------------------------------------------------------------------------------
    | Robust
    Valuecreation | Coef. Std. Err. t P>|t| [95% Conf. Interval]
    --------------------+----------------------------------------------------------------
    1.treated_match | 110.5728 136.2616 0.81 0.417 -156.8522 377.9979
    1.year3 | 435.5602 160.3344 2.72 0.007 120.8901 750.2302
    |
    treated_match#year3 |
    1 1 | -35.85293 191.3455 -0.19 0.851 -411.3849 339.6791
    |
    _cons | 380.7842 122.1138 3.12 0.002 141.1253 620.4431
    -------------------------------------------------------------------------------------


    If anyone spots any mistakes in this, please tell me.

    Comment


    • #3
      I don't understand your explanation, but to the extent I do, it seems to have several serious errors.

      1. Without knowing how you used propensity score matching, it is hard to know, but it looks as if your treated_match variable does not correctly distinguish between the firms that received subsidies and those that did not. Any firm that received a subsidy should be coded 1, and any firm that did not should be coded 0--and then, if you want to replace that with missing value for any firm that didn't match, that would be fine. But as I read your code it looks as if untreated firms are coded 1 if they didn't match any treated ones, and they are still included in your analysis.

      2. You say that some companies received several treatments. For a diff-in-diff analysis that's not allowable. Each company must be, in all its observations, either treated or untreated. No variation is permissible.

      3. You don't show the command you used for the regression analysis in #2, but I'm guessing it's just like the one you showed in #1, with Valuecreation instead of ValCre3. The problem is that your use of -regress- does not allow you to properly account for the fact that this is matched data. -regress- will treat all of these observations as independent. You need to incorporate the matching into the analysis. Probably the simplest way to do that is to generate a match-group variable that identifies the matched pairs (or triples or whatever). Then you can -xtset- on that and run -xtreg, fe-, or you can do -areg- and absorb the match-group variable.

      In your future posts, please be sure to show the commands along with the output. It probably will also help if you show a small representative sample of your data as well. Please use the -dataex- command (-ssc install dataex- if you don't have it) to do that.

      Comment


      • #4
        Dear Clyde. Thank you for your response. My propensity score matching is performed as follows, and I have tested common support and balance using psgraph and pstest
        Code:
        *PERFORM EXACT MATCHING ON INDUSTRY
        probit treatment Nordnorge Soerlandet Vestlandet Oestviken Vest_Viken Innlandet ///
            Troendelag size søknadsår
         *size is size of company, "søknadsår" is application year, and the rest of the matching variables are
        *dummies indicating geographic locations
        
        predict double ps
        gen pscore2=.
        replace pscore2=bransje*2+ps
        *bransje is an industry variable containing values from 1-14 depending on which of 14 industries the company is in.
        
        *Sort data randomly
        set seed 5053
        sort u
        psmatch2 treatment, outcome(ValCre0 ValCre3) pscore(pscore2) neighbor(3) ///
            caliper(0.2) ai(3)
        *Match within caliper to have exact matching on industry, and I use customized pscore2 for PSM.
        *Outcome variables are Value creation in year0 and value creation in year 3.
        *option ai to calculate the heteroskedasticity-consistent analytical standard errors.
        it looks as if your treated_match variable does not correctly distinguish between the firms that received subsidies and those that did not. Any firm that received a subsidy should be coded 1, and any firm that did not should be coded 0--and then, if you want to replace that with missing value for any firm that didn't match, that would be fine.
        All firms that received subsidy and were matched are coded as 1 in the treated_match variable. All firms in the control group that were matched have _nn=0, and by including &treatment=0, I remove treated firms that were not matched. The number of firms in matched treatment and control group equal the number of matched firms on common support from the psmatch2 output.

        2. You say that some companies received several treatments. For a diff-in-diff analysis that's not allowable. Each company must be, in all its observations, either treated or untreated. No variation is permissible.
        No treated firms also appear in the control group (ie removed if treated one year, but not treated in another). No non-treated firms appear in the treatment group, so there is no variation. I do however see that allowing several treatments could contaminate the results from other treatments. We are considering including only one treatment per firm, but would then have the bias of not knowing which to choose.

        Is several treatments admissible in diff-in-diff if there is no variation?

        You need to incorporate the matching into the analysis
        . I was under the impression that the inclusion of treated_matched did this. And I thought that performing diff-in-diff on the means of the matched treatment and control group would be equivalent to performing diff-in-diff on the matched pairs?

        Comment


        • #5
          So most of my concerns, you had dealt with.

          Several treatments are, indeed, admissible in diff-in-diff if each firm receives only one of the treatments (or none, in the control group).

          The inclusion of treated_match does not identify which treatment firm is matched with which control firm(s). The matched pair (tuple) is the unit of analysis. Failing to represent this will not affect your point estimate of the treatment effect, but your standard errors, confidence intervals, and p-values will be wrong. Worse still, they won't even be wrong in any predictable way: they could be too big, or too small depending on details of the data. Leaving out this information is equivalent to using an unpaired t-test on matched pair data.

          Comment


          • #6
            You need to incorporate the matching into the analysis. Probably the simplest way to do that is to generate a match-group variable that identifies the matched pairs (or triples or whatever). Then you can -xtset- on that and run -xtreg, fe-, or you can do -areg- and absorb the match-group variable.
            Thanks again for your insights Clyde. The results I have been getting have demonstrated very high standard errors, so I tried to generate a matched group variable to include in the analysis, but I did not really understand how to do it. Could you suggest a way to do it?

            Comment


            • #7
              I managed to make a matched group variable by multiplying the _id and _n1(id of nearest neighbour) from my psmatch2 output. The data used looks something like the table below, and I have 871 treated observations and 243 control observations in total:
              _treated _id _n1 _nn match_group
              Untreated 1 0 0 0
              Untreated 2 0 0 0
              Untreated 3 0 0 0
              Treated 4 1 1 4
              Treated 5 1 1 5
              Treated 6 2 1 12
              Treated 7 3 1 21
              The code I used looks like this:
              Code:
              gen match_group=_id*_n1
              This creates a unique number in the match_group for each pair. I then sort data by match_group and generate a new variable id which denotes pair id, and produces pair numbers from 1 onwards
              Code:
              gen pair_id=_n
              .

              This gives me a variable assigning a number to each pair of treatment and controls. However, I am not sure how to proceed further if I want to contrast the outcome within each pair. I think that I am not correctly incorporating the matching in the difference- in differences just by xtset on the pair_id.

              You need to incorporate the matching into the analysis. Probably the simplest way to do that is to generate a match-group variable that identifies the matched pairs (or triples or whatever). Then you can -xtset- on that and run -xtreg, fe-, or you can do -areg- and absorb the match-group variable.
              I did the xtset and xtreg that your recommended with the following output:
              Code:
              xtset pair_id
              xtreg lValuecreation treated_match##year3, fe vce(cluster Kunde1)

              note: 1.treated_match omitted because of collinearity

              Fixed-effects (within) regression Number of obs = 1310
              Group variable: pair_id Number of groups = 831

              R-sq: within = 0.1000 Obs per group: min = 1
              between = 0.0107 avg = 1.6
              overall = 0.0209 max = 2

              F(2,683) = 19.17
              corr(u_i, Xb) = -0.0085 Prob > F = 0.0000

              (Std. Err. adjusted for 684 clusters in Kunde1)
              -------------------------------------------------------------------------------------
              | Robust
              lValuecreation | Coef. Std. Err. t P>|t| [95% Conf. Interval]
              --------------------+----------------------------------------------------------------
              1.treated_match | 0 (omitted)
              1.year3 | .6486902 .1840161 3.53 0.000 .287385 1.009995
              |
              treated_match#year3 |
              1 1 | -.1665359 .2059466 -0.81 0.419 -.5709005 .2378286
              |
              _cons | 5.893595 .0435493 135.33 0.000 5.808089 5.979102
              --------------------+----------------------------------------------------------------
              sigma_u | 1.6482927
              sigma_e | 1.1129615
              rho | .68684945 (fraction of variance due to u_i)
              -------------------------------------------------------------------------------------


              I also tried the areg command that you suggested, and I see they yield the same coefficients, although slightly different standard errors.

              Code:
              areg lValuecreation treated_match##year3, absorb(pair_id)
              note: 1.treated_match omitted because of collinearity

              Linear regression, absorbing indicators Number of obs = 1,310
              F( 2, 477) = 26.50
              Prob > F = 0.0000
              R-squared = 0.8511
              Adj R-squared = 0.5914
              Root MSE = 1.1130

              -------------------------------------------------------------------------------------
              lValuecreation | Coef. Std. Err. t P>|t| [95% Conf. Interval]
              --------------------+----------------------------------------------------------------
              1.treated_match | 0 (omitted)
              1.year3 | .6486902 .152877 4.24 0.000 .3482946 .9490859
              |
              treated_match#year3 |
              1 1 | -.1665359 .1732429 -0.96 0.337 -.5069496 .1738778
              |
              _cons | 5.893595 .0482445 122.16 0.000 5.798797 5.988393
              --------------------+----------------------------------------------------------------
              pair_id | F(830, 477) = 3.174 0.000 (831 categories)


              Any suggestions on how to proceed?

              Comment


              • #8
                I think your calculation of the matched group identifier is wrong. I haven't used -psmatch2- in a while, but a quick perusal of its help file suggests that I correctly remember that the variable _id that it creates is, itself, the matched group identifier. Even if I'm wrong about that, multiplying _id by _n1 is not something I can make any sense out of.

                I think you just need to -xtset _id- and then run your -xtreg-.

                No guarantees, of course, that you will get narrower standard errors. Effective matching does have that effect, but whether the matching you are doing is fine enough to achieve that, is a matter of how your data look.

                Comment


                • #9
                  I am looking on treatment effects on value creation, and I have now merged in Value Creation numbers for nearest neighbors 1-5 together with the treated observation. My data now looks like this, where ValCre0 and ValCre3 indicate value creation of treated firms.
                  Firm ID ValCre0 ValCre3 Mean VC neighbor 1-5 year 0 Mean VC neighbor 1-5 year3
                  A 100 110 150 160
                  B 200 200 180 150

                  I then create a variable for post treatment called year3. And create a new variable Valuecreation to be the outcome variable.
                  Code:
                  expand 2, gen(year3)  
                    gen Valuecreation=0
                    replace Valuecreation=ValCre0 if year3==0
                   replace Valuecreation=ValCre3 if year3==1
                  Now I'm struggling to make a dummy variable treatment, to finalize my difference in difference. I want it to look something like this
                  Code:
                  reg Valuecreation year3##treatment, vce(cluster customerID)
                  Any ideas on how to create the treatment variable when the control group variables are in the same observation as the treated?

                  Comment


                  • #10
                    You went wrong with -expand 2-. You need to use -reshape- The tableau describing your data in the top of #9 is not very helpful: those clearly are not legal Stata variable names. To help those who want to help you, post example data using -dataex- so it is easy to quicklyget to code that is suitable to your data. Install -dataex- (-ssc install dataex-) and, next time, use it when you show data examples.

                    So let's suppose that the variable names are firm_id, valcre0 valcre3 valcre0_neghbor valcre3_neighbor

                    Code:
                    // MAKE VARIABLE NAMES PARALLEL FOR -reshape-
                    rename (valcre?) =_1
                    rename *_neighbor *_0
                    
                    // RESHAPE LONG
                    reshape long valcre0 valcre3, i(firm_id) j(treatment)
                    label define treatment 0 "Matched Control" 1 "Treatment"
                    label values treatment treatment
                    Now you have a pair of observations for each original observation. One of the new observations contains the treatment firm's data, and the other the control firm's data, and they are indicated by the variable treatment.

                    Comment


                    • #11
                      Dear Tobias and fellows,

                      I am so happy to read this topic since I am now facing the same problem also. I want to use the matched samples only before running the DID. I used the -psmatch2-, imposing 2-nearest neigbor matching and caliper (0.01). I know that Stata creates some new variables : _weight (this is the frequency of a particular observation used as match), _pscore (estimated propensity score), _n1 (id number of the first nearest), _n2 (second nearest id) and _id (new id Stata creates for matched samples).

                      My questions to everyone who know better than I do are :

                      1. What should I do after get those information from the -psmatch2- output in order to proceed to the next step (running the DID)? I am expecting the step-by-step answer in sequential detail using the Stata i.e. 1st do this, 2nd step do this, and so on;

                      2. Many controlled units do not have weight (_weight=.) and some others have _weight=.5, do you know what does it mean?

                      3. I found many id number within the _n1 & _n2 it means they are the nearest neighbor for a particular treated unit, when I search for those id, I figure out that they do not have weight? How could it be a matched sample do not have weight (_weight=.); Should I use only matched samples with weight or I include all matched samples?

                      4. In using the DID after the matching, should I include the weight of each observations when running the DID i.e. xtreg outcome yeardummy##treatmentgroupdummy [aw=_weight], fe vce(robust) ?

                      Thank you,

                      Comment

                      Working...
                      X