Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Replicating drdid analysis from R in Stata

    Hello,

    I am attempting to replicate a series of doubly robust diff in diff analyses. The original analysis was done in R and I am trying to replicate in Stata. However, I cannot get the estimates to match and so I am looking for any advice on why this mismatch may be occurring and what I might do to resolve the issue.

    In terms of my initial process, I took the R code used by the original researchers and drafted equivalent syntax for Stata. I then ran the Stata code on a .dta file provided by the original researchers. This Stata code ran successfully but the estimated ATET produced by Stata tended to be larger in magnitude and have smaller standard errors. For example, for a model of stunting prevalence, the original paper finds an ATET of -0.0785 (0.0511) and my Stata code returns an ATET of -0.1011 (0.0223). The original R code and the proposed Stata code are pasted below.

    To resolve the difference I have attempted the following:
    • I have spoken to my colleague who is an R user. He is able to run the original R code and replicate the original results precisely.
    • In case there was an issue with the provided data file, I used R to open the datafile used by the original group and export a .csv file. I then reran the analysis in Stata using this data file. This did not change the results.
    • I confirmed that the number of observations and the pre-treatment means for the outcome variables are identical between the original analysis and my attempt at replication.
    • I confirmed that R and Stata were using identical time and treatment variables and that dummy variables were leaving out the same reference category.
    • I emailed Francesco Rios-Avila, one of the authors of the Stata drdid package and asked if he had any idea why I might be getting different results. His only suggestion was related to propensity score trimming. R automatically drops observations with propensity scores greater than 0.995 or less than 0.005, while the current version of the Stata drdid package does not. I manually generated the propensity scores, dropped those greater than 0.995 and less than 0.005 and reran the model. However, this did not change any results because no observations were dropped with this trimming.
    • I experimented with more restrictive p-score limits. When we restrict the drdid model to observations with propensity scores less than 0.9 and greater than 0.1, the resulting ATET is larger and more significant (ATET: -0.1237 SE: 0.0214), moving further away form the original results..
    Not sure what else to do at this point but please let me know if you have any suggestions.

    Thanks!

    Chris


    Original R syntax

    ##
    ## Set up data ##
    ##
    # Assign time indicator (pre- or post- treatment)
    d_haz <- d_haz %>%
    mutate(time = ifelse(year == 2012, 0, 1),
    wealth_index=as_factor(wealth_index)) %>%
    select(-year,)

    d_stunting <- d_stunting %>%
    mutate(time = ifelse(year == 2012, 0, 1),
    wealth_index=as_factor(wealth_index)) %>%
    select(-year)

    # Create dummy variables
    dummy_haz <- recipe(zscore_ha ~ ., data = d_haz) %>%
    step_dummy(all_nominal_predictors(),one_hot = FALSE)

    dummy_stunting <- recipe(stunted ~ ., data = d_stunting) %>%
    step_dummy(all_nominal_predictors(), one_hot = FALSE)


    # Prepare and apply transformations
    d_haz_1 <- dummy_haz %>%
    prep() %>%
    bake(new_data = NULL)

    d_stunting_1 <- dummy_stunting %>%
    prep() %>%
    bake(new_data = NULL)

    # Set up DRDID code for stunting

    # create list of covariates
    stunting_covariates <- c("child_age","mother_bmi", "area_urban",
    "altitude_meters", "livestock", "hhead_female",
    "hh_floor_finished","km100_min_dist_Viotociv", "km100_min_dist_Explosions",
    "km100_min_dist_Battles",
    "wealth_index_poorer", "wealth_index_middle",
    "wealth_index_richer", "wealth_index_richest"
    )


    drdid_result_stunted<- drdid_rc(
    y = d_stunting_1$stunted, # Outcome variable
    post = d_stunting_1$time, # Time indicator (1 = post-treatment, 0 = pre-treatment)
    D = d_stunting_1$treatment, # Treatment indicator (1 = treated, 0 = control)
    covariates = as.matrix(d_stunting_1[, stunting_covariates, drop = FALSE]), # Covariates (ensure it's a matrix)
    i.weights = NULL, # Optional weights (set NULL for equal weights)
    boot = FALSE, # Set TRUE if you want bootstrap inference
    inffunc = FALSE # Set TRUE if you need influence functions
    )

    # Control mean for stunting, pre-treatment
    control_mean_stunted <- mean(d_stunting_1$stunted[d_stunting_1$treatment == 0 & d_stunting_1$time == 0], na.rm = TRUE)
    cat("Control mean (Stunted):", control_mean_stunted, "\n")


    Stata replication syntax

    * -----------------------------------------------------------
    * DRDID for Stunting
    * -----------------------------------------------------------
    use "$input/stunted_analysis_km100.dta", clear

    drop if missing(stunted)
    gen time = (year > 2012)
    tabulate wealth_index, generate(wealth_index_)

    * Run DRDID for stunting

    drdid stunted child_age mother_bmi area_urban altitude_meters i.livestock i.hhead_female i.hh_floor_finished i.km100_min_dist_Viotociv i.km100_min_dist_Explosions i.km100_min_dist_Battles wealth_cat1 wealth_cat2 wealth_cat3 wealth_cat4 wealth_cat5, time(year) tr(treatment) all

  • #2
    I work in both R and Stata (but not on drdid problems) and know that you rarely get exact agreement, especially with user-written programs. That said, it looks like the R data-munging is pretty significant as compared to the setup for Stata. Can you confirm that the stunted_analysis_km100.dta dataset has already gone through either the same data management process as is shown in the R code?

    Comment


    • #3
      Hi Erik

      I have attempted to replicate the same data process in Stata as is reflected in the R code. This includes dropping any observations missing the outcome variable, generating a time variable where 2012=0 and 2018=1, and turning the categorical wealth index variable into 5 dummy variables. There are some other issues referenced in the R code that are handled in a similar way by default in Stata so I made no adjustments. For example, by leaving out the ivar option in my code, Stata defaults to cross sectional data whereas in R you have to specifically indicate that your data is cross sectional.

      All that said, I am not an R user. So if you see some other adjustment to the data in the R code, please let me know.

      Chris

      Comment


      • #4
        Is the data publicly available? That would help a lot.

        Comment


        • #5
          Yes, its DHS data. I have attached the file.

          Chris
          Attached Files

          Comment


          • #6
            Thank you. To be clear, is this the data file you have already done the manipulation on or is it the base data file that the R code above was applied to?

            Comment


            • #7
              This is the file that the R code and the Stata code are applied to

              Comment


              • #8
                Hi Chris,

                Looking through the R code and using the data you provided, I think that your Stata code is indeed doing the same thing. The one thing I am unsure of is that in the R code you provided, two data files are mentioned (d_haz and d_stunting). They both seem to contain the same variables except for the outcome (zscor_ha vs stunted), and I am guessing you provided the d_stunting version.

                The DRDID authors provide simulated data in the package (see the example code for the sim_rc data). You could try estimating the ATT in R and Stata's drdid. I did so and go really different ATT estimates. I'm not sure why.

                Comment


                • #9
                  Thanks for taking a look Erik. I guess the mystery continues! I will take a look at the documentation you shared.

                  Comment

                  Working...
                  X