Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Principal Component Regression and Synthetic Controls

    For a project I'm doing, I want to use a synthetic control estimator that is robust to noise and missing data. Recent papers have argued that one way of doing this is via principal component analysis. In fact, they call it principal component regression, which ostensibly denoises and debiases the outcomes matrix in the pre-intervention period while imputing the potential outcome.

    Well, I know there's no Stata command to do this (or if so, I'd almost pay to find out about it). So I looked at similar papers which did a similar thing, imputing counterfactuals via PCA. After contacting the authors, I tried out their the lead author kindly sent me, I tried their code on the Smoking Dataset for Proposition 99, the canonical SCM dataset installed with the user written Synth package for Stata. I also used it for the original SCM paper, the Basque Country and terrorism paper. Below is my code adapted from Li. et.al.
    Code:
    import delim "https://raw.githubusercontent.com/synth-inference/synthdid/master/data/california_prop99.csv", clear
    
    /* Robust SCM/PCR doesn't need really covariates to
    approximate the counterfactual produced by classic SCM.
    This one only includes the outcomes.
    
    
    Here, California, state with FIPS code 3, is the treated unit.
    
    
    Treatment is after 1989
    
    
    */
    
    
    egen id = group(state) // makes a unique ID
    
    
    xtset id year, y // We now have yearly panel data
    
    
    drop state treated // irrelevant for our purposes
    
    rename packs sale // I didn't like the original variable name
    
    
    reshape wide sale, j(id) i( year) // I use greshape, but feel free to use this one too
    
    loc stub sale
    
    cls
    
    
    *Gets Principal Components, here's the code from Li et. al.
    
    pca `stub'1-`stub'3 `stub'4-`stub'39
    
    
    egen `stub'_d = rowmean(`stub'1-`stub'39) 
    qui gen `stub'_dd = `stub'3-`stub'_d 
    qui sum `stub'_dd if year < 1988, mean
    dis "pre-1989 average difference is " r(mean)
    qui g `stub'_da = `stub'_d + r(mean)
    cls
    tw (line `stub'3 year, lcol(black)) (line `stub'_da year, lcol(red)), ///
     xli(1989) ///
     legend(off) ///
     text(100 1980 "Observed California", color(black)) ///
     text(90 1980 "Synthetic California", color(red))
    Okay so I must be honest: I've never heard of PCR before. I've never used it (or seen ti used) empirical work until quite recently. I'd never really needed to use PCA either, but I'm familiar with what it does.

    My question is essentially for anyone who's more familiar with PCA/PCR than I am: is this pretty much all there is to PCA in this context? I ask, because the pre-intervention fit is nowhere near as good as classic SCM, but the counterfactual predictions are actually about the same. Might there be a better way to estimate causal effects using PCA/PCR in this instance? It approximates the SCM estimator well, I just want to ensure I'm estimating the method correctly since there's no Stata syntax for this method.

  • #2
    Hey, if anyone cares, I think I have a starting solution that anyone else is free to extend.
    Code:
    import delim "https://raw.githubusercontent.com/synth-inference/synthdid/master/data/california_prop99.csv", clear
    
    /* Robust SCM/PCR doesn't need really covariates to
    approximate the counterfactual produced by classic SCM.
    This one only includes the outcomes.
    
    
    Here, California, state with FIPS code 3, is the treated unit.
    
    
    Treatment is after 1989
    
    
    */
    
    
    egen id = group(state) // makes a unique ID
    
    
    xtset id year, y // We now have yearly panel data
    
    
    drop state treated // irrelevant for our purposes
    
    rename packs sale // I didn't like the original variable name
    
    
    greshape wide sale, j(id) i( year) // I use greshape, but feel free to use this one too
    tsset year, y
    loc stub sale
    
    cls
    
    
    *Gets Principal Components, here's the code from Li et. al.
    
    qui: pca `stub'1-`stub'2 `stub'4-`stub'39
    
    
    qui{
    cap drop pred
    predict double pred*
    }
    
    cls
    
    ssc inst lassopack, replace
    
    lassoregress `stub'3 pred* if year < 1989
    
    predict cf, xb
    
    
    tw (line `stub'3 year, lcol(black)) ///
        (line cf year, lcol(red)) ///
        if year < 2020, ///
    xli(1989) legend(off) ylabel(0(20)120)
    Essentially, I reshape the dataset such that time is the observations variable and the units are the variables. We predict the principal components of the underlying dataset using donor pool outcomes before the intervention, and then we linearly predict the post-intervention outcomes via LASSO (feel free to use lasso linear or cvlasso, these produce similar results). Notice how this approach without covariates predicts the counterfactual similar to the classic SCM that uses covariates. Not perfectly, but pretty darn close. If anyone else might have ideas about how I could improve this method (say, with kmeans clustering in high dimensional settings), I'd love to hear them. Either way, interesting alternative to SCM estimation.

    Comment


    • #3
      One day, soon, I'll update this post with the Python code to do this. If I can translate it for Stata, I will

      Comment


      • #4
        That is fantastic news! Looking forward to this, and I'm quite sure I'm not the only one

        Comment


        • #5
          Honestly, the Python code is public. So, I'm not even necessarily savoring it for myself.


          However, the devil is in the details. The python code by itself assumes a lot about your data structure (like you're working with a wide dataset), so my code really just automates away lots of the boring details that lots of authors won't bother to

          Comment

          Working...
          X