Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculate Pairwise Standard Deviation

    Hey everyone. I would appreciate some assistance with writing the calculation stage of an algorithm that will be quite useful to Stata users. In a recent paper, some authors
    repeatedly compute the pairwise standard deviation of the difference in outcome values for each treated unit i paired against a control unit. This exercise is conducted for each time period t across all periods prior to the treatment. We then sort the computed standard deviation vector and seek a group of minimum values before treatment year T such that
    we
    minimise the amount of variation or dispersion in the difference in outcome values prior to treatment
    (see equation 5, page 24). Alright enough abstraction, let's put some meat on these bones shall we with the Basque Dataset. Here, we have one treated unit (The Basque Country) and 16 donor pool units. The unit I calculate the score for below is Catalunya (Unit 10), an autonomous community very similar to the Basque Country.
    Code:
    clear *
    u "http://fmwww.bc.edu/repec/bocode/s/scul_basque.dta", clear
    
    qui xtset
    local lbl: value label `r(panelvar)'
    
    loc unit ="Basque Country (Pais Vasco)":`lbl'
    
    
    loc int_time = 1975
    
    qui xtset
    cls
    
    g treat = cond(`r(panelvar)'==`unit' & `r(timevar)' >= `int_time',1,0)
    
    keep treat gdp year id
    
    drop treat
    cls
    
    reshape wide gdpcap, j(id) i(year)
    
    order gdpcap`unit', a(year)
    
    /* Donor Selection Algorithm */
    
    keep gdpcap5 gdpcap10 year
    
    g diff = gdpcap5-gdpcap10 if year < 1975
    
    g diffsq = (gdpcap5-gdpcap10)^2 if year < 1975
    
    
    egen sumdiff = total(diff-diffsq) if year < 1975
    
    g sdscore10 = sqrt(sumdiff/year) if year < 1975
    cls
    l if year < 1975
    I think this is the basic code for it......... but, even if so, how would I optimize this for the rest of the 16 control units (i.e, gdpcap1-gdpcap17)? My goal would be to create a score for each control unit. How might I begin tackling this challenge?

  • #2
    I noticed an error in my code- the paper says the formula is the difference between the treated unit and control unit in the pre-intervention period, minus the expected value of the difference between the treated unit and the control unit. They don't define the expected value operator, and leaving it as is mean the numerator = 0. So I've revised it from the example above to
    Code:
    clear *
    u "http://fmwww.bc.edu/repec/bocode/s/scul_basque.dta", clear
    
    qui xtset
    local lbl: value label `r(panelvar)'
    
    loc unit ="Basque Country (Pais Vasco)":`lbl'
    
    
    loc int_time = 1975
    
    qui xtset
    cls
    
    g treat = cond(`r(panelvar)'==`unit' & `r(timevar)' >= `int_time',1,0)
    
    keep treat gdp year id
    
    drop treat
    cls
    
    reshape wide gdpcap, j(id) i(year)
    
    order gdpcap`unit', a(year)
    
    /* Donor Selection Algorithm */
    
    keep gdpcap5 gdpcap10 year
    
    egen sumdiff = total((gdpcap5-gdpcap10)^2) if year < `int_time'
    
    g sdscore10 = sqrt(sumdiff/(`int_time'-year)) if year < `int_time'
    cls
    l if year < `int_time'
    I believe this is the correct idea, but I'd still be curious on how to calculate this for the rest of my untreated units.

    Comment


    • #3
      I think I solved it... Here's the code.
      Code:
      clear *
      u "http://fmwww.bc.edu/repec/bocode/s/scul_basque.dta", clear
      
      qui xtset
      local lbl: value label `r(panelvar)'
      
      loc unit ="Basque Country (Pais Vasco)":`lbl'
      
      
      loc int_time = 1975
      
      qui xtset
      cls
      
      g treat = cond(`r(panelvar)'==`unit' & `r(timevar)' >= `int_time',1,0)
      
      keep treat gdp year id
      
      drop treat
      cls
      
      reshape wide gdpcap, j(id) i(year)
      
      order gdpcap`unit', a(year)
      cls
      
      /* Donor Selection Algorithm */
      
      //keep gdpcap5 gdpcap10 year
      
      qui foreach var of var gdpcap1-gdpcap17 {
          local varlab: var lab `var'  //extract the whole variable label string
          local vl = ustrregexrf("`varlab'", "^.*#" ,"")
          local vl = ustrregexrf("`vl'", ",", " ")
          macro list _var _varlab _vl
          local varlabm : word 1 of `vl'
          
          egen sumdiff`varlabm' = total((gdpcap5-gdpcap`varlabm')^2) if year < `int_time'
      
          //replace sumdiff`varlabm' = sumdiff`varlabm'^2 if year < `int_time'
      
          g sdscore`varlabm' = sqrt(sumdiff`varlabm'/(`int_time'-year)) if year < `int_time'
      
          drop sumdiff`varlabm'
          
          
      }
      cls
      greshape long gdpcap sdscore, i(year) j(id)
      
      xtset id year, y
      
      bys year: egen rank = rank(sds) if year < 1975
      cls
      xtset
      In THEORY, these (say, the top 10 ranking) are "the most comparable" units to the treated unit in the donor pool. If we had 500, we could use a method like this to select the top x optimal donors, which is useful for any causal analysis. Any other comments on this, would be most appreciated.

      Comment


      • #4
        Hi Jared,

        I don't totally understand what you are doing here, as long as we are putting meat on these theoretical bones, I wonder how you plan to move forward with the testing and evaluation of this algorithm to determine its correctness? Could you, for example, simulate a dataset (or, better, set of datasets) where you know the top x optimal donors ahead of time? If so, you could test this code on that dataset to make sure that it correctly identifies those optimal donors.

        Some other questions: Are there any hidden "edge cases" which violate the assumptions of the algorithm? Is the next step to write a new and more general command?

        Comment


        • #5
          Some substantive questions: what is going on here anyway? Usually, when you're trying to optimize a system, you have a set of constraints, and you're trying to maximize or minimize some outcome based on those constraints. So what are you trying to maximize, or what are you trying to minimize?

          It almost feels like you have a units in a multidimensional space, and you are trying to find the closes set of units in the control group to a given unit in the treatment group. Almost like a nearest neighbors problem? Am I on the right track here, or am I way off base?

          Edit: Can you also say a bit about why this is useful? I'm sure its obvious to you, but I'm not familiar with this line of research.

          Comment


          • #6
            Hey Daniel Daniel Schaefer . I'd honestly have to talk to the authors (I think they got it from some other published paper which I've not read yet). I can talk a little about the empirical case I give here, since it's one I can explain conceptually. I'll ramble a little but here, but I think it's important, at least for other folks who may be concerned about causal inference. Good papers which talk about the general method we're using are here, here, and here. Read them, when boredom strikes. To summarize, this is useful when you've got a set of treated units and want to pre-process your untreated units such that the treated unit and untreated units are more comparable (think difference in differences or synthetic controls).

            This is the data from the Basque study, the famous case (in causal inference econometrics anyhow), where Abadie and his advisor study the causal impact of terrorism in the Basque Country on GDP per capita. They developed the first synthetic control methodology, a method which is ostensibly designed to select the "best" comparison units for a given treated unit in the pre-intervention period. Precisely, SC is a weighted average of comparison units which are designed to replicate the observed trajectory of a treated unit in the pre period. In this case, we have many donors/untreated units, and only one treated unit. To reduce interpolation biases, it is common to restrict our donor pool of untreated units to units we believe share much in common with the treated unit already. But, what about cases where we have 100 untreated units? Or 1000? What if we have panel data in high-dimensions, where the selection of a control group might not be obvious? Normal SCM works in moderate to low dimensions, but it runs into problems when we have many, many donors due to overfitting. So, a slew of methods have been developed in recent years to be useful where we have one target unit and many, many donors. One such method is the one developed by my coworker, which solves this problem with machine learning/functional principal component analysis. Other methods have also been proposed. Anyways enough rambling: the original SCM selected Catalunya and Madrid as donors. This makes sense because Catalunya is super close to the Basque Country, and they share similar political and cultural backgrounds. Madrid also makes sense because it is the only area of the nation wealthier than Basque Country at the time. So, we can be confident that the synthetic control is like the real Basque Country not just on observed covariates, but unobserved covariates too (see my under review Stata Journal paper for more on this point).

            However, the authors match on 13 covariates to come up with these donors. What if we didn't have any covariates, how would we do causal inference in this setting then? Well, if we believe two curves that look very similar over a long period of time are the more similar to each otherwise, it stands to reason then that if we can use a method to come up with the optimal donors in advance, we may not need to use covariates to do causal inference in econometrics. To illustrate this point, consider the graph
            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input long id double(year gdpcap)
             5 1955  3.853184630005267
             5 1956 3.9456582961508766
             5 1957  4.033561734872626
             5 1958  4.023421896896646
             5 1959  4.013781968405232
             5 1960  4.285918396222732
             5 1961  4.574336095797406
             5 1962  4.898957353563045
             5 1963  5.197014981629133
             5 1964 5.3389029787527225
             5 1965  5.465153005251848
             5 1966  5.545915627064143
             5 1967  5.614895726639487
             5 1968 5.8521849330715785
             5 1969 6.0814054173695915
             5 1970   6.17009424134957
             5 1971  6.283633404546246
             5 1972 6.5555553986528405
             5 1973  6.810768561103078
             5 1974  7.105184302810804
            10 1955 3.5466296303037304
            10 1956 3.6904455695415153
            10 1957 3.8268349981757446
            10 1958 3.8756783776063983
            10 1959 3.9217367338405547
            10 1960  4.241788200023208
            10 1961  4.575335478925664
            10 1962  4.838046411962654
            10 1963  5.081334096368672
            10 1964  5.158097878145307
            10 1965  5.223650525073194
            10 1966 5.3324765050387395
            10 1967 5.4294489207045755
            10 1968  5.674378853530688
            10 1969  5.915523944191169
            10 1970  6.066837871936143
            10 1971  6.227649208206138
            10 1972  6.539060129025636
            10 1973  6.837975056094464
            10 1974  6.987360823804806
            12 1955  1.243430483310572
            12 1956  1.332547847831702
            12 1957  1.422450706571281
            12 1958 1.4402313799015014
            12 1959 1.4580834221751375
            12 1960 1.5358469139606876
            12 1961 1.5962581639969495
            12 1962  1.705584162407281
            12 1963  1.817694966093672
            12 1964 1.8828192904008454
            12 1965  1.948871846813299
            12 1966 2.0326334896921194
            12 1967 2.1176091673305395
            12 1968 2.2455012243293213
            12 1969 2.3819623487876234
            12 1970 2.5184947332291223
            12 1971  2.654027516621925
            12 1972 2.8464009236758248
            12 1973  3.045629889785954
            12 1974 3.1030419008352528
            end
            format %ty year
            label values id regionname
            label def regionname 5 "Basque Country (Pais Vasco)", modify
            label def regionname 10 "Cataluna", modify
            label def regionname 12 "Extremadura", modify
            
            xtset id year, y
            
            twoway (tsline gdp if id==5, lcol(black) lwidth(thick)) ///
            (tsline gdp if id==10, lwidth(medium)) ///
            (tsline gdp if id==12, lwidth(medium)), ///
            legend(order(1 "Basque Country" 2 "Catalunya" 3 "Extremadura") ///
            pos(0) ring(0) region(fcol(none))) ///
            yti("GDP per Capita/1000") ///
            tti(Year)
            We can see that Catalunya is an obvious choice that a reasonable algorithm would select- why? They're close geographically, culturally, politically, and they also have almost the exact same GDPs in many cases. Extremadura, by comparison, is super poor, and is very far geographically from the Basque Country, so there's no reason on why it should be a good match for the Basque Country. There are others (La Rioja and Asturias, for example), but the central goal is selecting the optimal donors, and why? Well, to impute a good counterfactual, we need to ensure that the treated group is as comparable as possible to the control group. And, if we select the "right" donors, the "synthetic" control is unbiased, and we can say that our assignment of intervention is as good as random.

            Edge cases: my algorithm here ranks units based on their scores. As we can see, Catalonia is the first (for all years, surprisingly), meaning it is the most like the Basque Country. I originally worried about "What if there's a tie in the rankings", but the only way for that to happen would be for the outcomes for two units to be exactly the same, and that pretty much doesn't happen in real life. The next step would be to use this algorithm in my scul estimator or other estimators I'm currently writing to make parallel trends more viable or synthetic controls more trustworthy.
            Some substantive questions: what is going on here anyway? Usually, when you're trying to optimize a system, you have a set of constraints, and you're trying to maximize or minimize some outcome based on those constraints. So what are you trying to maximize, or what are you trying to minimize?
            I agree! the way the authors present it is a little odd, since we usually have constraints here... They say
            The purpose of this algorithm is to minimise the amount of variation or dispersion in the difference in outcome values prior to treatment
            so you're quite right, it is very much like a nearest neighbors algorithm. I actually hadn't thought about it like that until you mentioned it, so this is a great way to think about it. The authors in their case literally have 1000s of donors- they experiment with the "best" subset of 100, 200, 300, and 500 donors, so we can immediately see why having an algorithm to reduce dimensionality can be useful when selecting your "optimal" control group can be tricky. I'm working on a paper about this exact subject now with my coworker Mani (I cited his paper above)- we'll be writing about a host of these algorithms/panel data approaches in econometrics/computer science, to see which algorithms select the "best" donors in cases of high-dimensionality, high volume/velocity settings and missing data. Simulations will be a core aspect of our work, and when we're done/along the way, I'll integrate some of the python code we'll use into a Stata Journal paper.

            Comment


            • #7
              Thanks for the thorough explanation and great resources. I've been slowly trying to absorb this stuff over the last week or so. This is some very cool work, and I look forward to hearing more about it. Do you see yourself developing any user contributed commands to do this kind of thing?

              Comment


              • #8
                SC is sensitive to the donor pool, as it should be but sometimes in seemingly strange ways. Culling the pool is helpful and this looks like an useful and purely mechanical way to do so. I've tried several methods but not this one (but close, e.g., RMSE).

                Comment


                • #9
                  Daniel Schaefer

                  Do you see yourself developing any user contributed commands to do this kind of thing?
                  Indeed. Arguably, my SCUL estimator is already a step in this direction (my paper describes that it picks 3 donors from a pooll of 650), but this is arguably not good enough because it relies on the LASSO to do the donor selection. Which, while not bad, also isn't the ideal. I mentioned that I'm working with a man named Mani Bayani above (well I cited his paper). His is one of the few papers that does donor selection prior to estimation. Him and I are working together on a paper right now that compares lots of these SCM algorithms to each other in settings of high dimensionality, high volume/velocity, missing data, and noise.

                  Mani is an R/Python user (the dude is a beast, he has his bachelor's in math and masters in computer science and engineering, Ph.D in econ), so much of our work will be in Python.... but, we'll also be developing new algorithms of our own, experimenting with different computer science/operations research algorithms to pre-process donors prior to estimation. Lots of machine learning, tensor math. And what I'll do on my end, is all take these Python algorithms and situate them into Stata commands, by running these under the hood. It would be a really important step for causal inference in Stata, in my opinion. I could send you the barebones outline of what we've got planned, if you want. Lots of empirical applications, lots of synthetic studies with different DGPs.

                  SC is sensitive to the donor pool, as it should be but sometimes in seemingly strange ways.
                  Agreed! Unfortunately. Like I spoke of above, part of my work is to decrease this sensitivity to select the optimal donors. The way I outline above is sort of crude, and there are much better ways to do it, but soon as we develop the algorithms in Python, I'll get these into Stata and send them to SJ so the world can play with them.

                  Comment


                  • #10
                    This may be of interest to anyone who uses synthetic control methods, so I'll answer the question I originally posed. Me and a coworker are working on a paper about donor selection in SCM, and he helped explain the formula to me. Here is an example of how we'd apply it with a real dataset.
                    Code:
                    clear *
                    
                    cls
                    
                    
                    import delim "https://docs.google.com/spreadsheets/d/e/2PACX-1vQIxhDcQAU0-yTPHLFulP36bYQYTQBHFk0b8DGwefZ7c4KXuwvM3wcbO2IYJ5izM3AIFVcCDdDlf_FY/pub?gid=732338764&single=true&output=csv", clear
                    
                    su city_id if is_barce==1
                    g date = date(yyyy_mm_dd, "YMD")
                    
                    
                    g treat = cond(is_barce==1 & date >=`=td(07jul2015)',1,0)
                    
                    format date %td
                    
                    cls
                    tempvar obs_count
                    bys city_id (date): g `obs_count' = _N
                    qui su `obs_count'
                    
                    
                    levelsof city_id if `obs_count' ~=`r(max)', l(city)
                    
                    drop if city_id == `city' | date > = td(01aug2017)
                    
                    g year = year(date)
                    
                    g weeknum = week(date)
                    
                    g week = yw(year,week)
                    
                    
                    
                    format %tw week 
                    //2013-06-05  2013-22
                    
                    
                    drop date
                    drop yyy
                    cls
                    
                    destring index, dpcomma replace
                    
                    egen id = group(city)
                    replace id=id-1
                    drop city
                    *keep if mediterr ==1 | is_barce ==1
                    cls
                    
                    gcollapse (mean) index, by(mediterr is_barce id week)
                    *drop if quar == .
                    
                    tempvar city
                    
                    g `city' = cond(is_barce==1,"Barcelona","Donor ")
                    
                    egen id2 = concat(`city' id) if is_barce==0
                    
                    replace id2 = "Barcelona" if id2==""
                    
                    
                    labmask id, values(id2)
                    drop id2
                    
                    cls
                    
                    //-52
                    
                    drop is
                    
                    rename (week ind) (date price)
                    
                    
                    g treat = cond(id==29 & date >=`=tw(2015w28)',1,0)
                    
                    
                    xtset id date, w
                    
                    local lbl: value label `r(panelvar)'
                    
                    
                    loc unit ="Barcelona":`lbl'
                    So I'm just cleaning the dataset. Now, we begin too choose the donors which are "most" similar in trends to Barcelona in the pre-intervention period
                    Code:
                    g score = .
                     
                     
                    qui levelsof id if id != `unit', loc(ids)
                    
                    qui foreach x of loc ids {
                        
                    cap frame drop minframe
                    
                    frame put id date price if (id ==`unit' | id ==`x') & date < `=tw(2015w28)', into(minframe)
                    
                    
                    
                    frame minframe {
                    greshape wide price, j(id) i(date)
                    
                    order price`unit', a(date)
                    
                    
                    qui ds // Gets a list of all variables
                    
                    loc t: word 2 of `r(varlist)'
                    
                    loc treated_unit: disp "`t'" // T for treated
                    
                    loc a: word 3 of `r(varlist)'
                    
                    loc donor_unit: disp "`a'" // First donor unit...
                    
                    
                    g diff = `treated_unit'-`donor_unit'
                    
                    egen meandiff = mean(`treated_unit'-`donor_unit')
                    
                    g diffmdsq = (diff - meandiff)^2
                    
                    egen summed = total(diffmdsq)
                    
                    qui su date
                    
                    g division= summed/`=r(N)'
                    
                    g score = division^.5
                    
                    su score, mean
                    
                    loc score = r(mean)
                    
                    frame default: replace score = `score' if id ==`x'
                    }
                    }
                    cls
                    //sort score
                    //br if date == `=tw(2015w25)'
                    egen rank = rank(score) if date == `=tw(2011w01)'
                    distinct id if medi==1 & date == `=tw(2011w01)' & inrange(rank,1,16)
                    
                    qui xtset
                    local lbl: value label `r(panelvar)'
                    
                    
                    loc unit ="Barcelona":`lbl'
                    
                    levelsof `r(panelvar)' if inrange(rank,1,16), loc(units) sep(",")
                    cls
                    
                    qui xtset
                    
                    line price date if inlist(id,`units'), connect(L) /// <-- the key option
                        lcolor("gs12") || line price date if id ==29, lcol(black) || line price date ///
                        if !inlist(id,`units',29), connect(L) lcol(red%5),, ///
                        legend(order(1 "Donors" 2 "Barcelona" 3 "Unselected Donors") ring(0) pos(11)) xli(`=tw(2015w28)')
                    The graph tells the story much better (or I try to anyways!!!). The 16 greyscale donors are compared against against the 66 not-chosen donors. The 16 selected donors come from similar areas of the time series, and also appear to show less fluctuation than the noisier ones which have more volatility in the observed trajectories of their prices (or at least, the fluctuations are more similar to Barcelona's). The fact that it selects 10 donors on the Mediterranean Sea also reassures us that the donors are alike on unobserved factors that also might drive the evolution of the weekly price data.

                    Someday, I'll likely write a simple command for this in Stata, likely try and publish it as a Stata Tip for anyone who might find it useful.

                    Comment

                    Working...
                    X