Hello,
I am attempting to replicate a series of doubly robust diff in diff analyses. The original analysis was done in R and I am trying to replicate in Stata. However, I cannot get the estimates to match and so I am looking for any advice on why this mismatch may be occurring and what I might do to resolve the issue.
In terms of my initial process, I took the R code used by the original researchers and drafted equivalent syntax for Stata. I then ran the Stata code on a .dta file provided by the original researchers. This Stata code ran successfully but the estimated ATET produced by Stata tended to be larger in magnitude and have smaller standard errors. For example, for a model of stunting prevalence, the original paper finds an ATET of -0.0785 (0.0511) and my Stata code returns an ATET of -0.1011 (0.0223). The original R code and the proposed Stata code are pasted below.
To resolve the difference I have attempted the following:
Thanks!
Chris
Original R syntax
##
## Set up data ##
##
# Assign time indicator (pre- or post- treatment)
d_haz <- d_haz %>%
mutate(time = ifelse(year == 2012, 0, 1),
wealth_index=as_factor(wealth_index)) %>%
select(-year,)
d_stunting <- d_stunting %>%
mutate(time = ifelse(year == 2012, 0, 1),
wealth_index=as_factor(wealth_index)) %>%
select(-year)
# Create dummy variables
dummy_haz <- recipe(zscore_ha ~ ., data = d_haz) %>%
step_dummy(all_nominal_predictors(),one_hot = FALSE)
dummy_stunting <- recipe(stunted ~ ., data = d_stunting) %>%
step_dummy(all_nominal_predictors(), one_hot = FALSE)
# Prepare and apply transformations
d_haz_1 <- dummy_haz %>%
prep() %>%
bake(new_data = NULL)
d_stunting_1 <- dummy_stunting %>%
prep() %>%
bake(new_data = NULL)
# Set up DRDID code for stunting
# create list of covariates
stunting_covariates <- c("child_age","mother_bmi", "area_urban",
"altitude_meters", "livestock", "hhead_female",
"hh_floor_finished","km100_min_dist_Viotociv", "km100_min_dist_Explosions",
"km100_min_dist_Battles",
"wealth_index_poorer", "wealth_index_middle",
"wealth_index_richer", "wealth_index_richest"
)
drdid_result_stunted<- drdid_rc(
y = d_stunting_1$stunted, # Outcome variable
post = d_stunting_1$time, # Time indicator (1 = post-treatment, 0 = pre-treatment)
D = d_stunting_1$treatment, # Treatment indicator (1 = treated, 0 = control)
covariates = as.matrix(d_stunting_1[, stunting_covariates, drop = FALSE]), # Covariates (ensure it's a matrix)
i.weights = NULL, # Optional weights (set NULL for equal weights)
boot = FALSE, # Set TRUE if you want bootstrap inference
inffunc = FALSE # Set TRUE if you need influence functions
)
# Control mean for stunting, pre-treatment
control_mean_stunted <- mean(d_stunting_1$stunted[d_stunting_1$treatment == 0 & d_stunting_1$time == 0], na.rm = TRUE)
cat("Control mean (Stunted):", control_mean_stunted, "\n")
Stata replication syntax
* -----------------------------------------------------------
* DRDID for Stunting
* -----------------------------------------------------------
use "$input/stunted_analysis_km100.dta", clear
drop if missing(stunted)
gen time = (year > 2012)
tabulate wealth_index, generate(wealth_index_)
* Run DRDID for stunting
drdid stunted child_age mother_bmi area_urban altitude_meters i.livestock i.hhead_female i.hh_floor_finished i.km100_min_dist_Viotociv i.km100_min_dist_Explosions i.km100_min_dist_Battles wealth_cat1 wealth_cat2 wealth_cat3 wealth_cat4 wealth_cat5, time(year) tr(treatment) all
I am attempting to replicate a series of doubly robust diff in diff analyses. The original analysis was done in R and I am trying to replicate in Stata. However, I cannot get the estimates to match and so I am looking for any advice on why this mismatch may be occurring and what I might do to resolve the issue.
In terms of my initial process, I took the R code used by the original researchers and drafted equivalent syntax for Stata. I then ran the Stata code on a .dta file provided by the original researchers. This Stata code ran successfully but the estimated ATET produced by Stata tended to be larger in magnitude and have smaller standard errors. For example, for a model of stunting prevalence, the original paper finds an ATET of -0.0785 (0.0511) and my Stata code returns an ATET of -0.1011 (0.0223). The original R code and the proposed Stata code are pasted below.
To resolve the difference I have attempted the following:
- I have spoken to my colleague who is an R user. He is able to run the original R code and replicate the original results precisely.
- In case there was an issue with the provided data file, I used R to open the datafile used by the original group and export a .csv file. I then reran the analysis in Stata using this data file. This did not change the results.
- I confirmed that the number of observations and the pre-treatment means for the outcome variables are identical between the original analysis and my attempt at replication.
- I confirmed that R and Stata were using identical time and treatment variables and that dummy variables were leaving out the same reference category.
- I emailed Francesco Rios-Avila, one of the authors of the Stata drdid package and asked if he had any idea why I might be getting different results. His only suggestion was related to propensity score trimming. R automatically drops observations with propensity scores greater than 0.995 or less than 0.005, while the current version of the Stata drdid package does not. I manually generated the propensity scores, dropped those greater than 0.995 and less than 0.005 and reran the model. However, this did not change any results because no observations were dropped with this trimming.
- I experimented with more restrictive p-score limits. When we restrict the drdid model to observations with propensity scores less than 0.9 and greater than 0.1, the resulting ATET is larger and more significant (ATET: -0.1237 SE: 0.0214), moving further away form the original results..
Thanks!
Chris
Original R syntax
##
## Set up data ##
##
# Assign time indicator (pre- or post- treatment)
d_haz <- d_haz %>%
mutate(time = ifelse(year == 2012, 0, 1),
wealth_index=as_factor(wealth_index)) %>%
select(-year,)
d_stunting <- d_stunting %>%
mutate(time = ifelse(year == 2012, 0, 1),
wealth_index=as_factor(wealth_index)) %>%
select(-year)
# Create dummy variables
dummy_haz <- recipe(zscore_ha ~ ., data = d_haz) %>%
step_dummy(all_nominal_predictors(),one_hot = FALSE)
dummy_stunting <- recipe(stunted ~ ., data = d_stunting) %>%
step_dummy(all_nominal_predictors(), one_hot = FALSE)
# Prepare and apply transformations
d_haz_1 <- dummy_haz %>%
prep() %>%
bake(new_data = NULL)
d_stunting_1 <- dummy_stunting %>%
prep() %>%
bake(new_data = NULL)
# Set up DRDID code for stunting
# create list of covariates
stunting_covariates <- c("child_age","mother_bmi", "area_urban",
"altitude_meters", "livestock", "hhead_female",
"hh_floor_finished","km100_min_dist_Viotociv", "km100_min_dist_Explosions",
"km100_min_dist_Battles",
"wealth_index_poorer", "wealth_index_middle",
"wealth_index_richer", "wealth_index_richest"
)
drdid_result_stunted<- drdid_rc(
y = d_stunting_1$stunted, # Outcome variable
post = d_stunting_1$time, # Time indicator (1 = post-treatment, 0 = pre-treatment)
D = d_stunting_1$treatment, # Treatment indicator (1 = treated, 0 = control)
covariates = as.matrix(d_stunting_1[, stunting_covariates, drop = FALSE]), # Covariates (ensure it's a matrix)
i.weights = NULL, # Optional weights (set NULL for equal weights)
boot = FALSE, # Set TRUE if you want bootstrap inference
inffunc = FALSE # Set TRUE if you need influence functions
)
# Control mean for stunting, pre-treatment
control_mean_stunted <- mean(d_stunting_1$stunted[d_stunting_1$treatment == 0 & d_stunting_1$time == 0], na.rm = TRUE)
cat("Control mean (Stunted):", control_mean_stunted, "\n")
Stata replication syntax
* -----------------------------------------------------------
* DRDID for Stunting
* -----------------------------------------------------------
use "$input/stunted_analysis_km100.dta", clear
drop if missing(stunted)
gen time = (year > 2012)
tabulate wealth_index, generate(wealth_index_)
* Run DRDID for stunting
drdid stunted child_age mother_bmi area_urban altitude_meters i.livestock i.hhead_female i.hh_floor_finished i.km100_min_dist_Viotociv i.km100_min_dist_Explosions i.km100_min_dist_Battles wealth_cat1 wealth_cat2 wealth_cat3 wealth_cat4 wealth_cat5, time(year) tr(treatment) all
Comment