Calculate Pairwise Standard Deviation

Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#1

Calculate Pairwise Standard Deviation

15 Oct 2022, 08:19

Hey everyone. I would appreciate some assistance with writing the calculation stage of an algorithm that will be quite useful to Stata users. In a recent paper, some authors

repeatedly compute the pairwise standard deviation of the difference in outcome values for each treated unit i paired against a control unit. This exercise is conducted for each time period t across all periods prior to the treatment. We then sort the computed standard deviation vector and seek a group of minimum values before treatment year T such that

we

minimise the amount of variation or dispersion in the difference in outcome values prior to treatment

(see equation 5, page 24). Alright enough abstraction, let's put some meat on these bones shall we with the Basque Dataset. Here, we have one treated unit (The Basque Country) and 16 donor pool units. The unit I calculate the score for below is Catalunya (Unit 10), an autonomous community very similar to the Basque Country.

Code:

clear * u "http://fmwww.bc.edu/repec/bocode/s/scul_basque.dta", clear qui xtset local lbl: value label `r(panelvar)' loc unit ="Basque Country (Pais Vasco)":`lbl' loc int_time = 1975 qui xtset cls g treat = cond(`r(panelvar)'==`unit' & `r(timevar)' >= `int_time',1,0) keep treat gdp year id drop treat cls reshape wide gdpcap, j(id) i(year) order gdpcap`unit', a(year) /* Donor Selection Algorithm */ keep gdpcap5 gdpcap10 year g diff = gdpcap5-gdpcap10 if year < 1975 g diffsq = (gdpcap5-gdpcap10)^2 if year < 1975 egen sumdiff = total(diff-diffsq) if year < 1975 g sdscore10 = sqrt(sumdiff/year) if year < 1975 cls l if year < 1975

I think this is the basic code for it......... but, even if so, how would I optimize this for the rest of the 16 control units (i.e, gdpcap1-gdpcap17)? My goal would be to create a score for each control unit. How might I begin tackling this challenge?
Tags: None

Jared Greathouse

Join Date: Sep 2021
Posts: 2170

15 Oct 2022, 10:44

I noticed an error in my code- the paper says the formula is the difference between the treated unit and control unit in the pre-intervention period, minus the expected value of the difference between the treated unit and the control unit. They don't define the expected value operator, and leaving it as is mean the numerator = 0. So I've revised it from the example above to

Code:

clear *
u "http://fmwww.bc.edu/repec/bocode/s/scul_basque.dta", clear

qui xtset
local lbl: value label `r(panelvar)'

loc unit ="Basque Country (Pais Vasco)":`lbl'


loc int_time = 1975

qui xtset
cls

g treat = cond(`r(panelvar)'==`unit' & `r(timevar)' >= `int_time',1,0)

keep treat gdp year id

drop treat
cls

reshape wide gdpcap, j(id) i(year)

order gdpcap`unit', a(year)

/* Donor Selection Algorithm */

keep gdpcap5 gdpcap10 year

egen sumdiff = total((gdpcap5-gdpcap10)^2) if year < `int_time'

g sdscore10 = sqrt(sumdiff/(`int_time'-year)) if year < `int_time'
cls
l if year < `int_time'

I believe this is the correct idea, but I'd still be curious on how to calculate this for the rest of my untreated units.

Comment

Jared Greathouse

Join Date: Sep 2021
Posts: 2170

15 Oct 2022, 11:25

I think I solved it... Here's the code.

Code:

clear *
u "http://fmwww.bc.edu/repec/bocode/s/scul_basque.dta", clear

qui xtset
local lbl: value label `r(panelvar)'

loc unit ="Basque Country (Pais Vasco)":`lbl'


loc int_time = 1975

qui xtset
cls

g treat = cond(`r(panelvar)'==`unit' & `r(timevar)' >= `int_time',1,0)

keep treat gdp year id

drop treat
cls

reshape wide gdpcap, j(id) i(year)

order gdpcap`unit', a(year)
cls

/* Donor Selection Algorithm */

//keep gdpcap5 gdpcap10 year

qui foreach var of var gdpcap1-gdpcap17 {
    local varlab: var lab `var'  //extract the whole variable label string
    local vl = ustrregexrf("`varlab'", "^.*#" ,"")
    local vl = ustrregexrf("`vl'", ",", " ")
    macro list _var _varlab _vl
    local varlabm : word 1 of `vl'
    
    egen sumdiff`varlabm' = total((gdpcap5-gdpcap`varlabm')^2) if year < `int_time'

    //replace sumdiff`varlabm' = sumdiff`varlabm'^2 if year < `int_time'

    g sdscore`varlabm' = sqrt(sumdiff`varlabm'/(`int_time'-year)) if year < `int_time'

    drop sumdiff`varlabm'
    
    
}
cls
greshape long gdpcap sdscore, i(year) j(id)

xtset id year, y

bys year: egen rank = rank(sds) if year < 1975
cls
xtset

In THEORY, these (say, the top 10 ranking) are "the most comparable" units to the treated unit in the donor pool. If we had 500, we could use a method like this to select the top x optimal donors, which is useful for any causal analysis. Any other comments on this, would be most appreciated.

Comment

Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#4

15 Oct 2022, 11:52

Hi Jared,

I don't totally understand what you are doing here, as long as we are putting meat on these theoretical bones, I wonder how you plan to move forward with the testing and evaluation of this algorithm to determine its correctness? Could you, for example, simulate a dataset (or, better, set of datasets) where you know the top x optimal donors ahead of time? If so, you could test this code on that dataset to make sure that it correctly identifies those optimal donors.

Some other questions: Are there any hidden "edge cases" which violate the assumptions of the algorithm? Is the next step to write a new and more general command?
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#5

15 Oct 2022, 11:57

Some substantive questions: what is going on here anyway? Usually, when you're trying to optimize a system, you have a set of constraints, and you're trying to maximize or minimize some outcome based on those constraints. So what are you trying to maximize, or what are you trying to minimize?

It almost feels like you have a units in a multidimensional space, and you are trying to find the closes set of units in the control group to a given unit in the treatment group. Almost like a nearest neighbors problem? Am I on the right track here, or am I way off base?

Edit: Can you also say a bit about why this is useful? I'm sure its obvious to you, but I'm not familiar with this line of research.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#6

15 Oct 2022, 13:10

Hey Daniel Daniel Schaefer . I'd honestly have to talk to the authors (I think they got it from some other published paper which I've not read yet). I can talk a little about the empirical case I give here, since it's one I can explain conceptually. I'll ramble a little but here, but I think it's important, at least for other folks who may be concerned about causal inference. Good papers which talk about the general method we're using are here, here, and here. Read them, when boredom strikes. To summarize, this is useful when you've got a set of treated units and want to pre-process your untreated units such that the treated unit and untreated units are more comparable (think difference in differences or synthetic controls).

This is the data from the Basque study, the famous case (in causal inference econometrics anyhow), where Abadie and his advisor study the causal impact of terrorism in the Basque Country on GDP per capita. They developed the first synthetic control methodology, a method which is ostensibly designed to select the "best" comparison units for a given treated unit in the pre-intervention period. Precisely, SC is a weighted average of comparison units which are designed to replicate the observed trajectory of a treated unit in the pre period. In this case, we have many donors/untreated units, and only one treated unit. To reduce interpolation biases, it is common to restrict our donor pool of untreated units to units we believe share much in common with the treated unit already. But, what about cases where we have 100 untreated units? Or 1000? What if we have panel data in high-dimensions, where the selection of a control group might not be obvious? Normal SCM works in moderate to low dimensions, but it runs into problems when we have many, many donors due to overfitting. So, a slew of methods have been developed in recent years to be useful where we have one target unit and many, many donors. One such method is the one developed by my coworker, which solves this problem with machine learning/functional principal component analysis. Other methods have also been proposed. Anyways enough rambling: the original SCM selected Catalunya and Madrid as donors. This makes sense because Catalunya is super close to the Basque Country, and they share similar political and cultural backgrounds. Madrid also makes sense because it is the only area of the nation wealthier than Basque Country at the time. So, we can be confident that the synthetic control is like the real Basque Country not just on observed covariates, but unobserved covariates too (see my under review Stata Journal paper for more on this point).

However, the authors match on 13 covariates to come up with these donors. What if we didn't have any covariates, how would we do causal inference in this setting then? Well, if we believe two curves that look very similar over a long period of time are the more similar to each otherwise, it stands to reason then that if we can use a method to come up with the optimal donors in advance, we may not need to use covariates to do causal inference in econometrics. To illustrate this point, consider the graph

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long id double(year gdpcap) 5 1955 3.853184630005267 5 1956 3.9456582961508766 5 1957 4.033561734872626 5 1958 4.023421896896646 5 1959 4.013781968405232 5 1960 4.285918396222732 5 1961 4.574336095797406 5 1962 4.898957353563045 5 1963 5.197014981629133 5 1964 5.3389029787527225 5 1965 5.465153005251848 5 1966 5.545915627064143 5 1967 5.614895726639487 5 1968 5.8521849330715785 5 1969 6.0814054173695915 5 1970 6.17009424134957 5 1971 6.283633404546246 5 1972 6.5555553986528405 5 1973 6.810768561103078 5 1974 7.105184302810804 10 1955 3.5466296303037304 10 1956 3.6904455695415153 10 1957 3.8268349981757446 10 1958 3.8756783776063983 10 1959 3.9217367338405547 10 1960 4.241788200023208 10 1961 4.575335478925664 10 1962 4.838046411962654 10 1963 5.081334096368672 10 1964 5.158097878145307 10 1965 5.223650525073194 10 1966 5.3324765050387395 10 1967 5.4294489207045755 10 1968 5.674378853530688 10 1969 5.915523944191169 10 1970 6.066837871936143 10 1971 6.227649208206138 10 1972 6.539060129025636 10 1973 6.837975056094464 10 1974 6.987360823804806 12 1955 1.243430483310572 12 1956 1.332547847831702 12 1957 1.422450706571281 12 1958 1.4402313799015014 12 1959 1.4580834221751375 12 1960 1.5358469139606876 12 1961 1.5962581639969495 12 1962 1.705584162407281 12 1963 1.817694966093672 12 1964 1.8828192904008454 12 1965 1.948871846813299 12 1966 2.0326334896921194 12 1967 2.1176091673305395 12 1968 2.2455012243293213 12 1969 2.3819623487876234 12 1970 2.5184947332291223 12 1971 2.654027516621925 12 1972 2.8464009236758248 12 1973 3.045629889785954 12 1974 3.1030419008352528 end format %ty year label values id regionname label def regionname 5 "Basque Country (Pais Vasco)", modify label def regionname 10 "Cataluna", modify label def regionname 12 "Extremadura", modify xtset id year, y twoway (tsline gdp if id==5, lcol(black) lwidth(thick)) /// (tsline gdp if id==10, lwidth(medium)) /// (tsline gdp if id==12, lwidth(medium)), /// legend(order(1 "Basque Country" 2 "Catalunya" 3 "Extremadura") /// pos(0) ring(0) region(fcol(none))) /// yti("GDP per Capita/1000") /// tti(Year)

We can see that Catalunya is an obvious choice that a reasonable algorithm would select- why? They're close geographically, culturally, politically, and they also have almost the exact same GDPs in many cases. Extremadura, by comparison, is super poor, and is very far geographically from the Basque Country, so there's no reason on why it should be a good match for the Basque Country. There are others (La Rioja and Asturias, for example), but the central goal is selecting the optimal donors, and why? Well, to impute a good counterfactual, we need to ensure that the treated group is as comparable as possible to the control group. And, if we select the "right" donors, the "synthetic" control is unbiased, and we can say that our assignment of intervention is as good as random.

Edge cases: my algorithm here ranks units based on their scores. As we can see, Catalonia is the first (for all years, surprisingly), meaning it is the most like the Basque Country. I originally worried about "What if there's a tie in the rankings", but the only way for that to happen would be for the outcomes for two units to be exactly the same, and that pretty much doesn't happen in real life. The next step would be to use this algorithm in my scul estimator or other estimators I'm currently writing to make parallel trends more viable or synthetic controls more trustworthy.

Some substantive questions: what is going on here anyway? Usually, when you're trying to optimize a system, you have a set of constraints, and you're trying to maximize or minimize some outcome based on those constraints. So what are you trying to maximize, or what are you trying to minimize?

I agree! the way the authors present it is a little odd, since we usually have constraints here... They say

The purpose of this algorithm is to minimise the amount of variation or dispersion in the difference in outcome values prior to treatment

so you're quite right, it is very much like a nearest neighbors algorithm. I actually hadn't thought about it like that until you mentioned it, so this is a great way to think about it. The authors in their case literally have 1000s of donors- they experiment with the "best" subset of 100, 200, 300, and 500 donors, so we can immediately see why having an algorithm to reduce dimensionality can be useful when selecting your "optimal" control group can be tricky. I'm working on a paper about this exact subject now with my coworker Mani (I cited his paper above)- we'll be writing about a host of these algorithms/panel data approaches in econometrics/computer science, to see which algorithms select the "best" donors in cases of high-dimensionality, high volume/velocity settings and missing data. Simulations will be a core aspect of our work, and when we're done/along the way, I'll integrate some of the python code we'll use into a Stata Journal paper.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#7

21 Oct 2022, 08:56

Thanks for the thorough explanation and great resources. I've been slowly trying to absorb this stuff over the last week or so. This is some very cool work, and I look forward to hearing more about it. Do you see yourself developing any user contributed commands to do this kind of thing?
Comment
George Ford

Join Date: Aug 2014

Posts: 3136
#8

21 Oct 2022, 09:40

SC is sensitive to the donor pool, as it should be but sometimes in seemingly strange ways. Culling the pool is helpful and this looks like an useful and purely mechanical way to do so. I've tried several methods but not this one (but close, e.g., RMSE).
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#9

21 Oct 2022, 17:19

Daniel Schaefer

Do you see yourself developing any user contributed commands to do this kind of thing?

Indeed. Arguably, my SCUL estimator is already a step in this direction (my paper describes that it picks 3 donors from a pooll of 650), but this is arguably not good enough because it relies on the LASSO to do the donor selection. Which, while not bad, also isn't the ideal. I mentioned that I'm working with a man named Mani Bayani above (well I cited his paper). His is one of the few papers that does donor selection prior to estimation. Him and I are working together on a paper right now that compares lots of these SCM algorithms to each other in settings of high dimensionality, high volume/velocity, missing data, and noise.

Mani is an R/Python user (the dude is a beast, he has his bachelor's in math and masters in computer science and engineering, Ph.D in econ), so much of our work will be in Python.... but, we'll also be developing new algorithms of our own, experimenting with different computer science/operations research algorithms to pre-process donors prior to estimation. Lots of machine learning, tensor math. And what I'll do on my end, is all take these Python algorithms and situate them into Stata commands, by running these under the hood. It would be a really important step for causal inference in Stata, in my opinion. I could send you the barebones outline of what we've got planned, if you want. Lots of empirical applications, lots of synthetic studies with different DGPs.

SC is sensitive to the donor pool, as it should be but sometimes in seemingly strange ways.

Agreed! Unfortunately. Like I spoke of above, part of my work is to decrease this sensitivity to select the optimal donors. The way I outline above is sort of crude, and there are much better ways to do it, but soon as we develop the algorithms in Python, I'll get these into Stata and send them to SJ so the world can play with them.
Comment

Jared Greathouse

Join Date: Sep 2021
Posts: 2170

#10

10 Apr 2023, 10:18

This may be of interest to anyone who uses synthetic control methods, so I'll answer the question I originally posed. Me and a coworker are working on a paper about donor selection in SCM, and he helped explain the formula to me. Here is an example of how we'd apply it with a real dataset.

Code:

clear *

cls


import delim "https://docs.google.com/spreadsheets/d/e/2PACX-1vQIxhDcQAU0-yTPHLFulP36bYQYTQBHFk0b8DGwefZ7c4KXuwvM3wcbO2IYJ5izM3AIFVcCDdDlf_FY/pub?gid=732338764&single=true&output=csv", clear

su city_id if is_barce==1
g date = date(yyyy_mm_dd, "YMD")


g treat = cond(is_barce==1 & date >=`=td(07jul2015)',1,0)

format date %td

cls
tempvar obs_count
bys city_id (date): g `obs_count' = _N
qui su `obs_count'


levelsof city_id if `obs_count' ~=`r(max)', l(city)

drop if city_id == `city' | date > = td(01aug2017)

g year = year(date)

g weeknum = week(date)

g week = yw(year,week)



format %tw week 
//2013-06-05  2013-22


drop date
drop yyy
cls

destring index, dpcomma replace

egen id = group(city)
replace id=id-1
drop city
*keep if mediterr ==1 | is_barce ==1
cls

gcollapse (mean) index, by(mediterr is_barce id week)
*drop if quar == .

tempvar city

g `city' = cond(is_barce==1,"Barcelona","Donor ")

egen id2 = concat(`city' id) if is_barce==0

replace id2 = "Barcelona" if id2==""


labmask id, values(id2)
drop id2

cls

//-52

drop is

rename (week ind) (date price)


g treat = cond(id==29 & date >=`=tw(2015w28)',1,0)


xtset id date, w

local lbl: value label `r(panelvar)'


loc unit ="Barcelona":`lbl'

So I'm just cleaning the dataset. Now, we begin too choose the donors which are "most" similar in trends to Barcelona in the pre-intervention period

Code:

g score = .
 
 
qui levelsof id if id != `unit', loc(ids)

qui foreach x of loc ids {
    
cap frame drop minframe

frame put id date price if (id ==`unit' | id ==`x') & date < `=tw(2015w28)', into(minframe)



frame minframe {
greshape wide price, j(id) i(date)

order price`unit', a(date)


qui ds // Gets a list of all variables

loc t: word 2 of `r(varlist)'

loc treated_unit: disp "`t'" // T for treated

loc a: word 3 of `r(varlist)'

loc donor_unit: disp "`a'" // First donor unit...


g diff = `treated_unit'-`donor_unit'

egen meandiff = mean(`treated_unit'-`donor_unit')

g diffmdsq = (diff - meandiff)^2

egen summed = total(diffmdsq)

qui su date

g division= summed/`=r(N)'

g score = division^.5

su score, mean

loc score = r(mean)

frame default: replace score = `score' if id ==`x'
}
}
cls
//sort score
//br if date == `=tw(2015w25)'
egen rank = rank(score) if date == `=tw(2011w01)'
distinct id if medi==1 & date == `=tw(2011w01)' & inrange(rank,1,16)

qui xtset
local lbl: value label `r(panelvar)'


loc unit ="Barcelona":`lbl'

levelsof `r(panelvar)' if inrange(rank,1,16), loc(units) sep(",")
cls

qui xtset

line price date if inlist(id,`units'), connect(L) /// <-- the key option
    lcolor("gs12") || line price date if id ==29, lcol(black) || line price date ///
    if !inlist(id,`units',29), connect(L) lcol(red%5),, ///
    legend(order(1 "Donors" 2 "Barcelona" 3 "Unselected Donors") ring(0) pos(11)) xli(`=tw(2015w28)')

The graph tells the story much better (or I try to anyways!!!). The 16 greyscale donors are compared against against the 66 not-chosen donors. The 16 selected donors come from similar areas of the time series, and also appear to show less fluctuation than the noisier ones which have more volatility in the observed trajectories of their prices (or at least, the fluctuations are more similar to Barcelona's). The fact that it selects 10 donors on the Mediterranean Sea also reassures us that the donors are alike on unobserved factors that also might drive the evolution of the weekly price data.

Someday, I'll likely write a simple command for this in Stata, likely try and publish it as a Stata Tip for anyone who might find it useful.

Announcement