Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Diff-in-Diff, panel data, and Instrumental Variable, multiple time periods

    Dear Statalist community,

    I'm examining the impact of an agricultural technology adoption program and I'd like your advice on part of the analysis. Sorry for the long post, I'm trying to be as clear as possible about the data and analysis.

    PART 1: Brief summary of the program, data, and analysis
    • Data comes from a two-stage randomized experiment. The country is divided in regions (n=8) and subregions (n=129). Sub-regions are randomly assigned to the treatment group in the first-stage, and then farmers within treatment sub-regions are randomly assigned to receive the treatment in the second-stage. The design allows us to estimate direct and spillover effects.
    • The treatment is a voucher that covers a percentage of the total cost of an irrigation technology + technical assistance. We're looking at smallholder farmers.
    • Baseline data (year = 2011) was collected from a representative sample of farmers across regions/sub-regions. Unfortunately, the program was ultimately implemented in only two (2) of the regions due to budgetary restrictions.
    • The treatment groups (treatment_group) are divided as follows:
      • Pure controls (treatment_group = 0): Farmers in sub-regions assigned to the control group in the first-stage.
      • Contaminated controls (treatment_group = 1): Farmers in sub-regions assigned to the treatment group in the first-stage and to the control group in the second-stage.
      • Directed beneficiaries-Treated (treatment_group = 2): Farmers assigned to the treatment group in the second-stage that used the voucher to buy the technology (compliers)
      • Direct Beneficiaries-Intended (treatment_group = 3): Farmers assigned to the treatment group in the second-stage but did not use the voucher to buy the technology (non-compliers).
    • Midline data (year = 2014) was collected approximately a year after program implementation from a representative sample of pure controls, contaminated controls, and direct beneficiaries (treated + intended) in the two (2) regions that implemented the program. This dataset also includes additional pure controls from control sub-regions (i.e. assigned to the control group in the first-stage) within non-treated regions (n=4) that share a geographic border with the two (2) treated regions.
    • Only a sub-sample of the observations in the Midline data set have baseline.
    • I've estimated the impact on program compliers using an IV approach and Midline data as follows:
    Code:
    ivreg2 tvp_log $controls_region (treated = treatmentIV) if treatment_group != 1, cl(subregion)
    where tvp_log is a continuous variable (i.e., log of value of crop production), $controls_regions includes two dummies (one for each treated region), treated is a dummy variable that takes the value of 1 for "Direct Beneficiaries-Treated", treatmentIV is a dummy variable with the original randomization (=1 for "Direct Beneficiaries" Treated and Intended). Following Abadie et al. (2017), I'm clustering at the subregion level because of the sampling design.

    HTML Code:
    IV (2SLS) estimation
    --------------------
    
    Estimates efficient for homoskedasticity only
    Statistics robust to heteroskedasticity and clustering on subregion
    
    Number of clusters (subregion) =     41               Number of obs =      569
                                                          F(  3,    40) =     5.38
                                                          Prob > F      =   0.0033
    Total (centered) SS     =  7485.021236                Centered R2   =  -0.1018
    Total (uncentered) SS   =  34295.37716                Uncentered R2 =   0.7595
    Residual SS             =  8246.950603                Root MSE      =    3.807
    
    ---------------------------------------------------------------------------------
                    |               Robust
            tvp_log |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    ----------------+----------------------------------------------------------------
            treated |  -3.324562   .9876374    -3.37   0.001    -5.260296   -1.388828
       region_norte |   2.570444   .9290954     2.77   0.006     .7494499    4.391437
    region_suroeste |   1.364327   .9735643     1.40   0.161    -.5438235    3.272479
              _cons |   6.290601   .8743133     7.19   0.000     4.576979    8.004224
    ---------------------------------------------------------------------------------
    Underidentification test (Kleibergen-Paap rk LM statistic):              4.447
                                                       Chi-sq(1) P-val =    0.0350
    ------------------------------------------------------------------------------
    Weak identification test (Cragg-Donald Wald F statistic):               56.857
                             (Kleibergen-Paap rk Wald F statistic):         18.409
    Stock-Yogo weak ID test critical values: 10% maximal IV size             16.38
                                             15% maximal IV size              8.96
                                             20% maximal IV size              6.66
                                             25% maximal IV size              5.53
    Source: Stock-Yogo (2005).  Reproduced by permission.
    NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors.
    ------------------------------------------------------------------------------
    Hansen J statistic (overidentification test of all instruments):         0.000
                                                     (equation exactly identified)
    ------------------------------------------------------------------------------
    Instrumented:         treated
    Included instruments: region_norte region_suroeste
    Excluded instruments: treatmentIV
    ------------------------------------------------------------------------------
    Question 1: I'm including dummies for each of the regions where the program was implemented (n=2). Since we have additional pure controls from regions where the program was not implemented (n=4), the reference "region" in this specification is basically a combination of these 4 regions as if it was only a single region. If I try to control for all regions using i.region, Stata issues an error message (shown below). I'm not 100% sure what is the appropriate thing to do here. Should I be using the partial option? Any advice?

    HTML Code:
    IV (2SLS) estimation
    --------------------
    .
    .
    skipped to save space
    .
    .
    ------------------------------------------------------------------------------
    Weak identification test (Cragg-Donald Wald F statistic):               56.555
                             (Kleibergen-Paap rk Wald F statistic):         18.312
    Stock-Yogo weak ID test critical values: 10% maximal IV size             16.38
                                             15% maximal IV size              8.96
                                             20% maximal IV size              6.66
                                             25% maximal IV size              5.53
    Source: Stock-Yogo (2005).  Reproduced by permission.
    NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors.
    ------------------------------------------------------------------------------
    Warning: estimated covariance matrix of moment conditions not of full rank.
             overidentification statistic not reported, and standard errors and
             model tests should be interpreted with caution.
    Possible causes:
             number of clusters insufficient to calculate robust covariance matrix
             singleton dummy variable (dummy with one 1 and N-1 0s or vice versa)
    partial option may address problem.
    ------------------------------------------------------------------------------
    Instrumented:         treated
    Included instruments: 3.region 5.region 6.region 7.region 8.region
    Excluded instruments: treatmentIV
    ------------------------------------------------------------------------------

    Results when Partialled-out:

    Code:
    ivreg2 tvp_log i.region (treated = treatmentIV) if treatment_group != 1, cl(subregion) partial(i.region)
    HTML Code:
    IV (2SLS) estimation
    --------------------
    
    Estimates efficient for homoskedasticity only
    Statistics robust to heteroskedasticity and clustering on subregion
    
    Number of clusters (subregion) =     41               Number of obs =      569
                                                          F(  1,    40) =    10.94
                                                          Prob > F      =   0.0020
    Total (centered) SS     =  7092.746986                Centered R2   =  -0.1372
    Total (uncentered) SS   =  7092.746986                Uncentered R2 =  -0.1372
    Residual SS             =  8066.186106                Root MSE      =    3.765
    
    ------------------------------------------------------------------------------------
                       |               Robust
              tvp_log  |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------------+----------------------------------------------------------------
               treated |  -3.324562   .9876374    -3.37   0.001    -5.260296   -1.388828
    ------------------------------------------------------------------------------------
    Underidentification test (Kleibergen-Paap rk LM statistic):              4.447
                                                       Chi-sq(1) P-val =    0.0350
    ------------------------------------------------------------------------------
    Weak identification test (Cragg-Donald Wald F statistic):               56.555
                             (Kleibergen-Paap rk Wald F statistic):         18.312
    Stock-Yogo weak ID test critical values: 10% maximal IV size             16.38
                                             15% maximal IV size              8.96
                                             20% maximal IV size              6.66
                                             25% maximal IV size              5.53
    Source: Stock-Yogo (2005).  Reproduced by permission.
    NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors.
    ------------------------------------------------------------------------------
    Hansen J statistic (overidentification test of all instruments):         0.000
                                                     (equation exactly identified)
    ------------------------------------------------------------------------------
    Instrumented:         treated
    Excluded instruments: treatmentIV
    Partialled-out:       3.region 5.region 6.region 7.region 8.region
                          _cons
                          nb: total SS, model F and R2s are after partialling-out;
                              any small-sample adjustments include partialled-out
                              variables in regressor count K
    ------------------------------------------------------------------------------
    I have different outcomes, including continuous (e.g., labor expenditures) and binary (e.g., =1 if used organic fertilizer, 0 otherwise) variables. You would expect that access to irrigation would have a positive impact on total crop production (e.g., multiple cropping cycles). However, the data was collected approximately a year after program implementation, so we're looking at short-term effects. After looking at the results from all of our outcome variables and exploring the mechanisms, we believe that farmers are going through a learning-by-doing process, so the effects are likely to take some time to be visible (e.g., if they are switching production from staple crops to fruits and veggies). Further, we find no spillover effects, which makes sense, since part of the reason the program was implemented was because we believe liquidity constraints are in part limiting the adoption of this technology.

    PART 2: Panel data - Baseline + Midline

    Out of the n=569 obs. in the Midline data set, a subset (n=241) have baseline data, so I have a balance panel of 241 individuals (482 observations). Since I already have this extra data, I want to explore the possibility of implementing diff-in-diff combined with IV in a panel data framework (using as a reference Chaisemartin (2010)), as follows:

    Code:
    xtset id year
    xtivreg tvp_log i.time i.treatmentIV (treated = time#treatmentIV), vce(cluster subregion) fe
    where id is the panel variable, year is the time variable (i.e., 2011 and 2014), time takes the value of 1 in 2014, and 0 in 2011.

    Code:
    Fixed-effects (within) IV regression            Number of obs     =        482
    Group variable: id                    Number of groups  =        241
    
    R-sq:                                           Obs per group:
         within  = 0.0214                                         min =          2
         between = 0.0022                                         avg =        2.0
         overall = 0.0036                                         max =          2
    
    
                                                    Wald chi2(2)      =       6.04
    corr(u_i, Xb)  = -0.0528                        Prob > chi2       =     0.0487
    
                                         (Std. Err. adjusted for 31 clusters in subregion)
    ------------------------------------------------------------------------------------
                       |               Robust
              tvp_log  |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------------+----------------------------------------------------------------
               treated |   -1.08324   1.096092    -0.99   0.323    -3.231542    1.065062
                1.time |   1.001031   .4776189     2.10   0.036     .0649155    1.937147
             1.treated |          0  (omitted)
                 _cons |   6.415951   .1569457    40.88   0.000     6.108344    6.723559
    -------------------+----------------------------------------------------------------
               sigma_u |  3.1655572
               sigma_e |  2.9455329
                   rho |  .53595748   (fraction of variance due to u_i)
    ------------------------------------------------------------------------------------
    Instrumented:   treated
    Instruments:    1.time 1.treatmentIV 0b.time#1.treatmentIV 1.time#0b.treatmentIV
                    1.time#1.treatmentIV
    ------------------------------------------------------------------------------------
    Questions 2: Based on the Imbens and Wooldridge (2007) lecture notes (pg. 2-3), I am wondering if I could use the set of "Contaminated controls" to implement a difference-in-difference-in-differences (DDD) method. Do you think that this extra analysis makes sense? I find this idea attractive since it would increase the number of observations in the panel while potentially controlling for two types of confounding trends (i.e., changes in the production outcomes across all farmers in all regions not associated with the program, and changes in production of farmers in treated regions but likely due to other local policies/shocks specific to these regions).


    PART 3: Panel data - Baseline + Midline + Endline

    We wanted to test the "learning-by-doing" hypothesis, so another round of data (Endline) was collected in 2019 (5 years after the Midline data was collected). This time, we collected data only on the set of farmers that had Baseline & Midline. So we have a balanced panel of 241 farmers (723 observations). Unfortunately, 16 of these farmers report not planting anything during the 2019 cycle, so for simplicity the following code takes into account 225 farmers that have production at Baseline, Midline, and Endline (675 observations total). To summarize, we have baseline data (year = 2011), midline data (year = 2014; a year after program was implemented), and endline data (year = 2019).

    I've been exploring different posts across the Statalist forum, particularly this post, this post, this post, this post, this post, and so I know I can do the following:

    Code:
    xtset id year
    xtreg tvp_log i.treatmentIV##i.pre_post, cluster(subregion) fe
    where treatmentIV takes a value of 1 for all observations randomly assigned to treatment (both "Intended" and "Treated") but only in years 2014 and 2019 (0 for 2001) and 0 for "Pure Controls"; pre_post is a polychotomous variable encoding the time period (0=2011=baseline; 1=2014=midline, and 2=2019=endline).

    HTML Code:
    note: 1.treatmentIV omitted because of collinearity
    
    Fixed-effects (within) regression               Number of obs      =       675
    Group variable: id                              Number of groups   =       225
    
    R-sq:  within  = 0.0662                         Obs per group: min =         3
           between = 0.0157                                        avg =       3.0
           overall = 0.0341                                        max =         3
    
                                                    F(4,30)            =      1.90
    corr(u_i, Xb)  = -0.0015                        Prob > F           =    0.1363
    
                                       (Std. Err. adjusted for 31 clusters in subregion)
    ----------------------------------------------------------------------------------
                     |               Robust
            tvcp_w_l |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -----------------+----------------------------------------------------------------
       1.treatmentIV |          0  (omitted)
                     |
            pre_post |
                  1  |   1.079436   .4839763     2.23   0.033     .0910244    2.067847
                  2  |  -1.103136    1.32543    -0.83   0.412    -3.810025    1.603753
                     |
    treated#treatmentIV |
                1 1  |  -.6645698   .5984921    -1.11   0.276    -1.886854    .5577142
                1 2  |   .6120616   1.428334     0.43   0.671    -2.304985    3.529109
                     |
               _cons |   6.431933   .2885257    22.29   0.000     5.842685    7.021181
    -----------------+----------------------------------------------------------------
             sigma_u |  2.7233863
             sigma_e |  3.3667456
                 rho |  .39552628   (fraction of variance due to u_i)
    ----------------------------------------------------------------------------------
    Questions 3: The results shown above are based on the variable treatmentIV, so I'm estimating the intention-to-treat (ITT). What is the appropriate way of coding the IV in this case? I've tried the following:

    Code:
    xtivreg tvp_log i.treatment_groupD##i.pre_post (treated = treatmentIV), vce (cluster Subzona) fe
    where treatment_groupD is a dummy that takes the value of 1 for observations for the treatment group ("Intended" and "Treated") during all three years, and 0 for "pure controls"; pre_post is a polychotomous variable (0=2011=baseline; 1=2014=midline; 2=2019=endline); treated is dummy variable that takes the value of 1 for "Direct Beneficiaries-Treated"but only in years 2014 and 2019 (0 in 2011), and 0 for "pure controls" and "Intended"; and treatmentIVtakes the value of 1 for the treatment group ("Intended" and "Treated") in the years 2014 and 2019 only (0 in 2011), and 0 for the "pure controls" across all years.

    HTML Code:
    Fixed-effects (within) IV regression            Number of obs     =        675
    Group variable: id                              Number of groups  =        225
    
    R-sq:                                           Obs per group:
         within  = 0.0662                                         min =          3
         between = 0.0157                                         avg =        3.0
         overall = 0.0341                                         max =          3
    
    
                                                    Wald chi2(4)      =       7.60
    corr(u_i, Xb)  = -0.0015                        Prob > chi2       =     0.1074
    
                                         (Std. Err. adjusted for 31 clusters in subregion)
    ------------------------------------------------------------------------------------
                       |               Robust
              tvcp_w_l |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------------+----------------------------------------------------------------
               treated |          0  (omitted)
    1.treatment_groupD |          0  (omitted)
                       |
              pre_post |
                    1  |   1.079436   .4839763     2.23   0.026     .1308597    2.028012
                    2  |  -1.103136    1.32543    -0.83   0.405    -3.700931    1.494658
                       |
    treatment_groupD#pre_post |
                  1 1  |  -.6645698   .5984921    -1.11   0.267    -1.837593    .5084532
                  1 2  |   .6120616   1.428334     0.43   0.668    -2.187421    3.411545
                       |
                 _cons |   6.431933   .2885257    22.29   0.000     5.866433    6.997433
    -------------------+----------------------------------------------------------------
               sigma_u |  2.7233863
               sigma_e |  3.3705263
                   rho |  .39498974   (fraction of variance due to u_i)
    ------------------------------------------------------------------------------------
    Instrumented:   officially_treated
    Instruments:    1.treatment_groupD 1.pre_post 2.pre_post 1.treatment_groupD#1.pre_post
                    1.treatment_groupD#2.pre_post treatmentIV
    ------------------------------------------------------------------------------------


    Again, I'm sorry for the long post. This is my first time working with panel data and I'm exploring the most appropriate way of getting the correct treatment effect estimates.

    Thank you in advance for your time!

    Respectfully,

    Cesar







Working...
X