Principal Component Regression and Synthetic Controls

Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#1

Principal Component Regression and Synthetic Controls

11 Dec 2021, 14:00

For a project I'm doing, I want to use a synthetic control estimator that is robust to noise and missing data. Recent papers have argued that one way of doing this is via principal component analysis. In fact, they call it principal component regression, which ostensibly denoises and debiases the outcomes matrix in the pre-intervention period while imputing the potential outcome.

Well, I know there's no Stata command to do this (or if so, I'd almost pay to find out about it). So I looked at similar papers which did a similar thing, imputing counterfactuals via PCA. After contacting the authors, I tried out their the lead author kindly sent me, I tried their code on the Smoking Dataset for Proposition 99, the canonical SCM dataset installed with the user written Synth package for Stata. I also used it for the original SCM paper, the Basque Country and terrorism paper. Below is my code adapted from Li. et.al.

Code:

import delim "https://raw.githubusercontent.com/synth-inference/synthdid/master/data/california_prop99.csv", clear /* Robust SCM/PCR doesn't need really covariates to approximate the counterfactual produced by classic SCM. This one only includes the outcomes. Here, California, state with FIPS code 3, is the treated unit. Treatment is after 1989 */ egen id = group(state) // makes a unique ID xtset id year, y // We now have yearly panel data drop state treated // irrelevant for our purposes rename packs sale // I didn't like the original variable name reshape wide sale, j(id) i( year) // I use greshape, but feel free to use this one too loc stub sale cls *Gets Principal Components, here's the code from Li et. al. pca `stub'1-`stub'3 `stub'4-`stub'39 egen `stub'_d = rowmean(`stub'1-`stub'39) qui gen `stub'_dd = `stub'3-`stub'_d qui sum `stub'_dd if year < 1988, mean dis "pre-1989 average difference is " r(mean) qui g `stub'_da = `stub'_d + r(mean) cls tw (line `stub'3 year, lcol(black)) (line `stub'_da year, lcol(red)), /// xli(1989) /// legend(off) /// text(100 1980 "Observed California", color(black)) /// text(90 1980 "Synthetic California", color(red))

Okay so I must be honest: I've never heard of PCR before. I've never used it (or seen ti used) empirical work until quite recently. I'd never really needed to use PCA either, but I'm familiar with what it does.

My question is essentially for anyone who's more familiar with PCA/PCR than I am: is this pretty much all there is to PCA in this context? I ask, because the pre-intervention fit is nowhere near as good as classic SCM, but the counterfactual predictions are actually about the same. Might there be a better way to estimate causal effects using PCA/PCR in this instance? It approximates the SCM estimator well, I just want to ensure I'm estimating the method correctly since there's no Stata syntax for this method.
Tags: None

Jared Greathouse

Join Date: Sep 2021
Posts: 2172

23 Dec 2021, 17:16

Hey, if anyone cares, I think I have a starting solution that anyone else is free to extend.

Code:

import delim "https://raw.githubusercontent.com/synth-inference/synthdid/master/data/california_prop99.csv", clear

/* Robust SCM/PCR doesn't need really covariates to
approximate the counterfactual produced by classic SCM.
This one only includes the outcomes.


Here, California, state with FIPS code 3, is the treated unit.


Treatment is after 1989


*/


egen id = group(state) // makes a unique ID


xtset id year, y // We now have yearly panel data


drop state treated // irrelevant for our purposes

rename packs sale // I didn't like the original variable name


greshape wide sale, j(id) i( year) // I use greshape, but feel free to use this one too
tsset year, y
loc stub sale

cls


*Gets Principal Components, here's the code from Li et. al.

qui: pca `stub'1-`stub'2 `stub'4-`stub'39


qui{
cap drop pred
predict double pred*
}

cls

ssc inst lassopack, replace

lassoregress `stub'3 pred* if year < 1989

predict cf, xb


tw (line `stub'3 year, lcol(black)) ///
    (line cf year, lcol(red)) ///
    if year < 2020, ///
xli(1989) legend(off) ylabel(0(20)120)

Essentially, I reshape the dataset such that time is the observations variable and the units are the variables. We predict the principal components of the underlying dataset using donor pool outcomes before the intervention, and then we linearly predict the post-intervention outcomes via LASSO (feel free to use lasso linear or cvlasso, these produce similar results). Notice how this approach without covariates predicts the counterfactual similar to the classic SCM that uses covariates. Not perfectly, but pretty darn close. If anyone else might have ideas about how I could improve this method (say, with kmeans clustering in high dimensional settings), I'd love to hear them. Either way, interesting alternative to SCM estimation.

Comment

Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#3

27 Jul 2023, 12:11

One day, soon, I'll update this post with the Python code to do this. If I can translate it for Stata, I will
1 like
Comment
Maxence Morlet

Join Date: Mar 2021

Posts: 653
#4

27 Jul 2023, 12:38

That is fantastic news! Looking forward to this, and I'm quite sure I'm not the only one
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#5

27 Jul 2023, 20:48

Honestly, the Python code is public. So, I'm not even necessarily savoring it for myself.

However, the devil is in the details. The python code by itself assumes a lot about your data structure (like you're working with a wide dataset), so my code really just automates away lots of the boring details that lots of authors won't bother to
Comment

Announcement