Dear Statalist community,
I'm examining the impact of an agricultural technology adoption program and I'd like your advice on part of the analysis. Sorry for the long post, I'm trying to be as clear as possible about the data and analysis.
PART 1: Brief summary of the program, data, and analysis
where tvp_log is a continuous variable (i.e., log of value of crop production), $controls_regions includes two dummies (one for each treated region), treated is a dummy variable that takes the value of 1 for "Direct Beneficiaries-Treated", treatmentIV is a dummy variable with the original randomization (=1 for "Direct Beneficiaries" Treated and Intended). Following Abadie et al. (2017), I'm clustering at the subregion level because of the sampling design.
Question 1: I'm including dummies for each of the regions where the program was implemented (n=2). Since we have additional pure controls from regions where the program was not implemented (n=4), the reference "region" in this specification is basically a combination of these 4 regions as if it was only a single region. If I try to control for all regions using i.region, Stata issues an error message (shown below). I'm not 100% sure what is the appropriate thing to do here. Should I be using the partial option? Any advice?
Results when Partialled-out:
I have different outcomes, including continuous (e.g., labor expenditures) and binary (e.g., =1 if used organic fertilizer, 0 otherwise) variables. You would expect that access to irrigation would have a positive impact on total crop production (e.g., multiple cropping cycles). However, the data was collected approximately a year after program implementation, so we're looking at short-term effects. After looking at the results from all of our outcome variables and exploring the mechanisms, we believe that farmers are going through a learning-by-doing process, so the effects are likely to take some time to be visible (e.g., if they are switching production from staple crops to fruits and veggies). Further, we find no spillover effects, which makes sense, since part of the reason the program was implemented was because we believe liquidity constraints are in part limiting the adoption of this technology.
PART 2: Panel data - Baseline + Midline
Out of the n=569 obs. in the Midline data set, a subset (n=241) have baseline data, so I have a balance panel of 241 individuals (482 observations). Since I already have this extra data, I want to explore the possibility of implementing diff-in-diff combined with IV in a panel data framework (using as a reference Chaisemartin (2010)), as follows:
where id is the panel variable, year is the time variable (i.e., 2011 and 2014), time takes the value of 1 in 2014, and 0 in 2011.
Questions 2: Based on the Imbens and Wooldridge (2007) lecture notes (pg. 2-3), I am wondering if I could use the set of "Contaminated controls" to implement a difference-in-difference-in-differences (DDD) method. Do you think that this extra analysis makes sense? I find this idea attractive since it would increase the number of observations in the panel while potentially controlling for two types of confounding trends (i.e., changes in the production outcomes across all farmers in all regions not associated with the program, and changes in production of farmers in treated regions but likely due to other local policies/shocks specific to these regions).
PART 3: Panel data - Baseline + Midline + Endline
We wanted to test the "learning-by-doing" hypothesis, so another round of data (Endline) was collected in 2019 (5 years after the Midline data was collected). This time, we collected data only on the set of farmers that had Baseline & Midline. So we have a balanced panel of 241 farmers (723 observations). Unfortunately, 16 of these farmers report not planting anything during the 2019 cycle, so for simplicity the following code takes into account 225 farmers that have production at Baseline, Midline, and Endline (675 observations total). To summarize, we have baseline data (year = 2011), midline data (year = 2014; a year after program was implemented), and endline data (year = 2019).
I've been exploring different posts across the Statalist forum, particularly this post, this post, this post, this post, this post, and so I know I can do the following:
where treatmentIV takes a value of 1 for all observations randomly assigned to treatment (both "Intended" and "Treated") but only in years 2014 and 2019 (0 for 2001) and 0 for "Pure Controls"; pre_post is a polychotomous variable encoding the time period (0=2011=baseline; 1=2014=midline, and 2=2019=endline).
Questions 3: The results shown above are based on the variable treatmentIV, so I'm estimating the intention-to-treat (ITT). What is the appropriate way of coding the IV in this case? I've tried the following:
where treatment_groupD is a dummy that takes the value of 1 for observations for the treatment group ("Intended" and "Treated") during all three years, and 0 for "pure controls"; pre_post is a polychotomous variable (0=2011=baseline; 1=2014=midline; 2=2019=endline); treated is dummy variable that takes the value of 1 for "Direct Beneficiaries-Treated"but only in years 2014 and 2019 (0 in 2011), and 0 for "pure controls" and "Intended"; and treatmentIVtakes the value of 1 for the treatment group ("Intended" and "Treated") in the years 2014 and 2019 only (0 in 2011), and 0 for the "pure controls" across all years.
Again, I'm sorry for the long post. This is my first time working with panel data and I'm exploring the most appropriate way of getting the correct treatment effect estimates.
Thank you in advance for your time!
Respectfully,
Cesar
I'm examining the impact of an agricultural technology adoption program and I'd like your advice on part of the analysis. Sorry for the long post, I'm trying to be as clear as possible about the data and analysis.
PART 1: Brief summary of the program, data, and analysis
- Data comes from a two-stage randomized experiment. The country is divided in regions (n=8) and subregions (n=129). Sub-regions are randomly assigned to the treatment group in the first-stage, and then farmers within treatment sub-regions are randomly assigned to receive the treatment in the second-stage. The design allows us to estimate direct and spillover effects.
- The treatment is a voucher that covers a percentage of the total cost of an irrigation technology + technical assistance. We're looking at smallholder farmers.
- Baseline data (year = 2011) was collected from a representative sample of farmers across regions/sub-regions. Unfortunately, the program was ultimately implemented in only two (2) of the regions due to budgetary restrictions.
- The treatment groups (treatment_group) are divided as follows:
- Pure controls (treatment_group = 0): Farmers in sub-regions assigned to the control group in the first-stage.
- Contaminated controls (treatment_group = 1): Farmers in sub-regions assigned to the treatment group in the first-stage and to the control group in the second-stage.
- Directed beneficiaries-Treated (treatment_group = 2): Farmers assigned to the treatment group in the second-stage that used the voucher to buy the technology (compliers)
- Direct Beneficiaries-Intended (treatment_group = 3): Farmers assigned to the treatment group in the second-stage but did not use the voucher to buy the technology (non-compliers).
- Midline data (year = 2014) was collected approximately a year after program implementation from a representative sample of pure controls, contaminated controls, and direct beneficiaries (treated + intended) in the two (2) regions that implemented the program. This dataset also includes additional pure controls from control sub-regions (i.e. assigned to the control group in the first-stage) within non-treated regions (n=4) that share a geographic border with the two (2) treated regions.
- Only a sub-sample of the observations in the Midline data set have baseline.
- I've estimated the impact on program compliers using an IV approach and Midline data as follows:
Code:
ivreg2 tvp_log $controls_region (treated = treatmentIV) if treatment_group != 1, cl(subregion)
HTML Code:
IV (2SLS) estimation -------------------- Estimates efficient for homoskedasticity only Statistics robust to heteroskedasticity and clustering on subregion Number of clusters (subregion) = 41 Number of obs = 569 F( 3, 40) = 5.38 Prob > F = 0.0033 Total (centered) SS = 7485.021236 Centered R2 = -0.1018 Total (uncentered) SS = 34295.37716 Uncentered R2 = 0.7595 Residual SS = 8246.950603 Root MSE = 3.807 --------------------------------------------------------------------------------- | Robust tvp_log | Coef. Std. Err. z P>|z| [95% Conf. Interval] ----------------+---------------------------------------------------------------- treated | -3.324562 .9876374 -3.37 0.001 -5.260296 -1.388828 region_norte | 2.570444 .9290954 2.77 0.006 .7494499 4.391437 region_suroeste | 1.364327 .9735643 1.40 0.161 -.5438235 3.272479 _cons | 6.290601 .8743133 7.19 0.000 4.576979 8.004224 --------------------------------------------------------------------------------- Underidentification test (Kleibergen-Paap rk LM statistic): 4.447 Chi-sq(1) P-val = 0.0350 ------------------------------------------------------------------------------ Weak identification test (Cragg-Donald Wald F statistic): 56.857 (Kleibergen-Paap rk Wald F statistic): 18.409 Stock-Yogo weak ID test critical values: 10% maximal IV size 16.38 15% maximal IV size 8.96 20% maximal IV size 6.66 25% maximal IV size 5.53 Source: Stock-Yogo (2005). Reproduced by permission. NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors. ------------------------------------------------------------------------------ Hansen J statistic (overidentification test of all instruments): 0.000 (equation exactly identified) ------------------------------------------------------------------------------ Instrumented: treated Included instruments: region_norte region_suroeste Excluded instruments: treatmentIV ------------------------------------------------------------------------------
HTML Code:
IV (2SLS) estimation -------------------- . . skipped to save space . . ------------------------------------------------------------------------------ Weak identification test (Cragg-Donald Wald F statistic): 56.555 (Kleibergen-Paap rk Wald F statistic): 18.312 Stock-Yogo weak ID test critical values: 10% maximal IV size 16.38 15% maximal IV size 8.96 20% maximal IV size 6.66 25% maximal IV size 5.53 Source: Stock-Yogo (2005). Reproduced by permission. NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors. ------------------------------------------------------------------------------ Warning: estimated covariance matrix of moment conditions not of full rank. overidentification statistic not reported, and standard errors and model tests should be interpreted with caution. Possible causes: number of clusters insufficient to calculate robust covariance matrix singleton dummy variable (dummy with one 1 and N-1 0s or vice versa) partial option may address problem. ------------------------------------------------------------------------------ Instrumented: treated Included instruments: 3.region 5.region 6.region 7.region 8.region Excluded instruments: treatmentIV ------------------------------------------------------------------------------
Results when Partialled-out:
Code:
ivreg2 tvp_log i.region (treated = treatmentIV) if treatment_group != 1, cl(subregion) partial(i.region)
HTML Code:
IV (2SLS) estimation -------------------- Estimates efficient for homoskedasticity only Statistics robust to heteroskedasticity and clustering on subregion Number of clusters (subregion) = 41 Number of obs = 569 F( 1, 40) = 10.94 Prob > F = 0.0020 Total (centered) SS = 7092.746986 Centered R2 = -0.1372 Total (uncentered) SS = 7092.746986 Uncentered R2 = -0.1372 Residual SS = 8066.186106 Root MSE = 3.765 ------------------------------------------------------------------------------------ | Robust tvp_log | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------------+---------------------------------------------------------------- treated | -3.324562 .9876374 -3.37 0.001 -5.260296 -1.388828 ------------------------------------------------------------------------------------ Underidentification test (Kleibergen-Paap rk LM statistic): 4.447 Chi-sq(1) P-val = 0.0350 ------------------------------------------------------------------------------ Weak identification test (Cragg-Donald Wald F statistic): 56.555 (Kleibergen-Paap rk Wald F statistic): 18.312 Stock-Yogo weak ID test critical values: 10% maximal IV size 16.38 15% maximal IV size 8.96 20% maximal IV size 6.66 25% maximal IV size 5.53 Source: Stock-Yogo (2005). Reproduced by permission. NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors. ------------------------------------------------------------------------------ Hansen J statistic (overidentification test of all instruments): 0.000 (equation exactly identified) ------------------------------------------------------------------------------ Instrumented: treated Excluded instruments: treatmentIV Partialled-out: 3.region 5.region 6.region 7.region 8.region _cons nb: total SS, model F and R2s are after partialling-out; any small-sample adjustments include partialled-out variables in regressor count K ------------------------------------------------------------------------------
PART 2: Panel data - Baseline + Midline
Out of the n=569 obs. in the Midline data set, a subset (n=241) have baseline data, so I have a balance panel of 241 individuals (482 observations). Since I already have this extra data, I want to explore the possibility of implementing diff-in-diff combined with IV in a panel data framework (using as a reference Chaisemartin (2010)), as follows:
Code:
xtset id year xtivreg tvp_log i.time i.treatmentIV (treated = time#treatmentIV), vce(cluster subregion) fe
Code:
Fixed-effects (within) IV regression Number of obs = 482 Group variable: id Number of groups = 241 R-sq: Obs per group: within = 0.0214 min = 2 between = 0.0022 avg = 2.0 overall = 0.0036 max = 2 Wald chi2(2) = 6.04 corr(u_i, Xb) = -0.0528 Prob > chi2 = 0.0487 (Std. Err. adjusted for 31 clusters in subregion) ------------------------------------------------------------------------------------ | Robust tvp_log | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------------+---------------------------------------------------------------- treated | -1.08324 1.096092 -0.99 0.323 -3.231542 1.065062 1.time | 1.001031 .4776189 2.10 0.036 .0649155 1.937147 1.treated | 0 (omitted) _cons | 6.415951 .1569457 40.88 0.000 6.108344 6.723559 -------------------+---------------------------------------------------------------- sigma_u | 3.1655572 sigma_e | 2.9455329 rho | .53595748 (fraction of variance due to u_i) ------------------------------------------------------------------------------------ Instrumented: treated Instruments: 1.time 1.treatmentIV 0b.time#1.treatmentIV 1.time#0b.treatmentIV 1.time#1.treatmentIV ------------------------------------------------------------------------------------
PART 3: Panel data - Baseline + Midline + Endline
We wanted to test the "learning-by-doing" hypothesis, so another round of data (Endline) was collected in 2019 (5 years after the Midline data was collected). This time, we collected data only on the set of farmers that had Baseline & Midline. So we have a balanced panel of 241 farmers (723 observations). Unfortunately, 16 of these farmers report not planting anything during the 2019 cycle, so for simplicity the following code takes into account 225 farmers that have production at Baseline, Midline, and Endline (675 observations total). To summarize, we have baseline data (year = 2011), midline data (year = 2014; a year after program was implemented), and endline data (year = 2019).
I've been exploring different posts across the Statalist forum, particularly this post, this post, this post, this post, this post, and so I know I can do the following:
Code:
xtset id year xtreg tvp_log i.treatmentIV##i.pre_post, cluster(subregion) fe
HTML Code:
note: 1.treatmentIV omitted because of collinearity Fixed-effects (within) regression Number of obs = 675 Group variable: id Number of groups = 225 R-sq: within = 0.0662 Obs per group: min = 3 between = 0.0157 avg = 3.0 overall = 0.0341 max = 3 F(4,30) = 1.90 corr(u_i, Xb) = -0.0015 Prob > F = 0.1363 (Std. Err. adjusted for 31 clusters in subregion) ---------------------------------------------------------------------------------- | Robust tvcp_w_l | Coef. Std. Err. t P>|t| [95% Conf. Interval] -----------------+---------------------------------------------------------------- 1.treatmentIV | 0 (omitted) | pre_post | 1 | 1.079436 .4839763 2.23 0.033 .0910244 2.067847 2 | -1.103136 1.32543 -0.83 0.412 -3.810025 1.603753 | treated#treatmentIV | 1 1 | -.6645698 .5984921 -1.11 0.276 -1.886854 .5577142 1 2 | .6120616 1.428334 0.43 0.671 -2.304985 3.529109 | _cons | 6.431933 .2885257 22.29 0.000 5.842685 7.021181 -----------------+---------------------------------------------------------------- sigma_u | 2.7233863 sigma_e | 3.3667456 rho | .39552628 (fraction of variance due to u_i) ----------------------------------------------------------------------------------
Code:
xtivreg tvp_log i.treatment_groupD##i.pre_post (treated = treatmentIV), vce (cluster Subzona) fe
HTML Code:
Fixed-effects (within) IV regression Number of obs = 675 Group variable: id Number of groups = 225 R-sq: Obs per group: within = 0.0662 min = 3 between = 0.0157 avg = 3.0 overall = 0.0341 max = 3 Wald chi2(4) = 7.60 corr(u_i, Xb) = -0.0015 Prob > chi2 = 0.1074 (Std. Err. adjusted for 31 clusters in subregion) ------------------------------------------------------------------------------------ | Robust tvcp_w_l | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------------+---------------------------------------------------------------- treated | 0 (omitted) 1.treatment_groupD | 0 (omitted) | pre_post | 1 | 1.079436 .4839763 2.23 0.026 .1308597 2.028012 2 | -1.103136 1.32543 -0.83 0.405 -3.700931 1.494658 | treatment_groupD#pre_post | 1 1 | -.6645698 .5984921 -1.11 0.267 -1.837593 .5084532 1 2 | .6120616 1.428334 0.43 0.668 -2.187421 3.411545 | _cons | 6.431933 .2885257 22.29 0.000 5.866433 6.997433 -------------------+---------------------------------------------------------------- sigma_u | 2.7233863 sigma_e | 3.3705263 rho | .39498974 (fraction of variance due to u_i) ------------------------------------------------------------------------------------ Instrumented: officially_treated Instruments: 1.treatment_groupD 1.pre_post 2.pre_post 1.treatment_groupD#1.pre_post 1.treatment_groupD#2.pre_post treatmentIV ------------------------------------------------------------------------------------
Again, I'm sorry for the long post. This is my first time working with panel data and I'm exploring the most appropriate way of getting the correct treatment effect estimates.
Thank you in advance for your time!
Respectfully,
Cesar