Simple "synth" Command Usage with Monthly Data

Samuel Van Gorden

Join Date: Jul 2022

Posts: 16
#1

Simple "synth" Command Usage with Monthly Data

23 Dec 2022, 11:42

I am trying to run a simple synthetic control experiment using the "synth" command on a dataset containing monthly data. From https://www.stata.com/statalist/arch.../msg01164.html I was able to determine a (very unintuitive) way to express monthly dates. My command is:

Code:

synth ra(`=tm(1980m12)'(3)`=tm(1990m11)') ra, trunit(44) trperiod(`=tm(1990m12)')

However, I then get the error

ra(251(3)370) does not exist as a (numeric) variable in dataset

I then tried:

Code:

synth ra ra, trunit(44) trperiod(`=tm(1990m12)') xperiod(`=tm(1980m12)'(3)`=tm(1990m11)')

and I got

expression too long

.

What am I doing wrong? This should be a very simple command which uses the same variable (ra) for the independent and dependent variables, with 44 as the control unit, and a pre-treatment period of 12/1980 to 11/1990.

As a side note, this method of expressing monthly dates seems needlessly complicated to me. I find that there are many such instances in STATA where convoluted expressions are needed to obtain simple outputs. I would be interested in seeing an explanation of why this difficult to use syntax is necessary if one exists.
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

23 Dec 2022, 12:31

I think you have reversed the order of your two arguments - the first argument is the dependent variable, and the second argument is the specification of the predictor variables, as the output of

Code:

help synth

tells us. Try

Code:

synth ra ra(`=tm(1980m12)'(3)`=tm(1990m11)'), trunit(44) trperiod(`=tm(1990m12)')

With regard to Stata syntax being unintuitive, that is true of all languages. When I visit Switzerland, for example, I find German, French, and Italian uniformly unintuitive (no experience with Romansh, but I don't hold out high hopes for my intuition working there, either). This reflects more on my failure to accumulate enough experience speaking these languages to build up the familiarity that makes constructing and understanding sentences intuitive.

Stata is just another language. With more experience with Stata, you'll find it intuitive that in modeling commands the variable being modeled typically precedes other variables.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#3

23 Dec 2022, 15:59

William Lisowski is right. Synth is expressed like

Code:

synth y x x1(10/12)

and so on. However, I don't recommend you use synth. Synth is super arbitrary in terms of lag specification. Mine does better.
Comment
Samuel Van Gorden

Join Date: Jul 2022

Posts: 16
#4

23 Dec 2022, 16:38

William Lisowski thanks for your response - I tried it and got the same outcome (expression too long).

What I meant by STATA being unintuitive was that there seems to be a different way to accomplish each separate task, whereas in other programming languages things are much more consistent. I understand `' as the dereferencing operator (which is a bit of a strange operator to begin with), but then also needing to precede the date with "=" and "tm" seems overly complicated. In most OO programming languages the command would know to parse out a datetime object without any additional modifiers. This is just one of the many examples of STATA being needlessly messy that I have run into that does not exist in other languages such as R, Python, Java, C, etc. STATA is hands down the most frustrating language I have programmed in, which really says something considering I programmed in TCL for several years!
Comment

Jared Greathouse

Join Date: Sep 2021
Posts: 2172

23 Dec 2022, 17:02

Let's prove it shall we? Here I simulate a prop 99 like dataset, with 37 control units and 1 treated unit, from 1970 to 2000. I simulate a linear factor model with 3 factor loadings and a single time varying factor. If you don't wanna run it, cool, I have the main results at the bottom.

Code:

clear *
loc int_time = 1989

set obs 38
set seed 1011


egen id = seq(), f(1) t(38)

cls


generate u_i1 = 6 // latent factor lodings

generate u_i2 = 9 // latent factor lodings

g u_i3 = 12 // latent factor lodings

expand 40

qbys id: g time = _n+1969

keep if inrange(time,1970,2000)

bys id: g u_t = runiform()


xtset id time, g

su `r(timevar)', mean

loc yearmin =r(min)

// Generate population data

bys id: g population = runiformint(5000000,10000000) if time ==`yearmin'

replace pop=L1.pop+rnormal(20000,2000) if time>`yearmin'


// Generate income data


bys id: g income = runiformint(20000,40000) if time ==`yearmin'

replace income=L1.income+rnormal(1000,500) if time>`yearmin'


replace income = ln((income/pop)*100000)

// Generate proportion of alcohol drinkers data

bys id: g growth = runiformint(10000,40000) if time ==`yearmin'

replace growth=L1.growth+rnormal(2,10) if time>`yearmin'


replace growth = (growth/pop)*100

replace pop = ln(population)

g pop2 = exp(pop)

// Generate price data


bys id: g price = runiformint(27.3,42.2) if time ==`yearmin'

replace price=L1.price+rnormal(4,1.5) if time>`yearmin'

cls

//!! Generate cigarette sales per capita data


bys id: g cigsale = abs(floor(((0.0000005*pop2) ///
    -(price*200)- ///
    (income*5)- ///
    (growth*(10))) - ((u_i1*u_i2*u_i3)*u_t) + ///
    rnormal(25000,2000))) if time == `yearmin'
            
bys id: replace cigsale = ((cigsale/pop2)*100000)

loc ymp1 = `yearmin'+1

loc ymp5 = `yearmin'+5



bys id: replace cig=L1.cig+rnormal(2,2) if inrange(time,`ymp1',`ymp5')



bys id: replace cig=L1.cig-rnormal(2,2) if time >`ymp5'
su cigsale
as cig > 0

tempvar storename id2

loc entity State


g `storename' = "`entity' "

egen `id2' = concat(`storename' id)

labmask id, values(`id2')

cls
qui xtset
local lbl: value label `r(panelvar)'

loc unit ="`entity' 13":`lbl'

g treated = cond(id==`unit' & time >= `int_time',1,0)


as cigsale > 0

cls

lab var treated "Anti-Smoking Policy"


clonevar cigte = cig

replace cigte = cigsale - (cigsale * .098)-rnormal(.5,.1) if id ==`unit' & time >=`int_time'

Okay now that's done. Now we can begin with estimation

Code:

synth cigte ///
    growth(1984(1)1988) ///
    income(1984(1)1988) price(1980(1)1988) ///
    cigte(1988) ///
    cigte(1980) ///
    cigte(1975) cigte(1985), ///
    trunit(`unit') ///
    trperiod(`int_time') fig keep(`scmdata', replace) //

qui scul cigte, scheme(sj) ///
    treated(treated) ///

Okay so now we've estimated our effects. The real effect size is a reduction of 17 packs per capita from 1989 to 2000, which is actually roughly the cannonical estimate of SCM for Prop 99. Here we have side by side the true values, the vanilla SC estimates and the SCUL.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input double(_Y_treated _Y_synthetic time cf)
206.48312377929688 209.04665116882325 1970  207.1403304413866
215.04507446289063  212.0257825012207 1971  210.8180016702924
213.47695922851563 214.09341960144042 1972 213.54415174864374
 214.0126495361328 216.98536473083496 1973 214.84077814381493
216.96914672851563 217.03967070007326 1974  215.7230054723854
   218.45751953125 218.22244160461426 1975  217.1172069723441
212.71701049804688 213.78982250976563 1976 212.35967931071275
 210.5804443359375 209.22987266540525 1977  209.5400094965006
207.19996643066406  205.0433042449951 1978 207.03636342577863
 202.8429718017578  202.9904515686035 1979 203.70977504780052
199.79644775390625 200.90006056213377 1980 200.38360180138136
197.68443298339844  200.0681079864502 1981 198.16394207175836
 196.0567169189453  198.4637028503418 1982 195.96361483090368
191.76785278320313  195.7755709075928 1983  192.7931614892522
189.84266662597656 193.46030746459962 1984 192.29683713193197
190.04014587402344  190.6295277252197 1985 190.90778854782195
188.07794189453125 187.59325134277344 1986  188.2586321803584
 186.0728302001953  185.6151879119873 1987  186.0312328212708
 184.0740509033203 183.43889569091797 1988 184.56983966616917
 163.0240020751953  180.2573909301758 1989  181.8762558019235
163.62014770507813 178.17645106506347 1990 179.71627423494397
 160.3173065185547 176.84363258361816 1991 178.05761044477856
 156.1579132080078 175.76256706237794 1992 175.09231190699842
151.41085815429688  174.0351475982666 1993 172.45316114293382
 149.6656036376953 173.59551377868655 1994 169.05994904151652
150.42247009277344 172.71237896728516 1995 166.27544596961803
 147.8294677734375   170.686692779541 1996 163.41465035667738
146.91152954101563 169.97707914733888 1997 161.70240222874207
143.92422485351563 168.77022714233397 1998 160.81592445107748
141.20570373535156  165.1271237487793 1999 157.35897062725581
141.25685119628906 162.95198188781737 2000  154.5442173226469
end

cls
tsset time

foreach v of var _Y_synthetic cf {
    
    g diff`v' = _Y_treated-`v'
}

mean diff* if time >= 1989

twoway (tsline _Y_tr, lcolor(black) lwidth(medthick) lpattern(solid)) (tsline _Y_synthetic, lcolor(red) lwidth(medium) lpattern(dash)) (tsline cf, lcolor(blue) lwidth(medium) lpattern(shortdash)), tline(1989) legend(order(1 "Treated" 2 "ADH" 3 "SCUL") position(7) ring(0))

Maybe one could quip with this simulation (I know I wanna improve it at some point), but either way, the true effect size is -17, vanilla SC says the effect is -21, mine is 17, so I think mine is better than the classical method.

EDIT: Samuel Van Gorden I kinda agree that Stata's datetime aspects can be hard to work with sometimes, but as William Lisowski will remind you (as he's reminded me!), Stata's datetime functions are extremely versitile, and while daunting at first, is really quite powerful. But as I sort of demonstrate above, my command doesn't demand you work with this at all. All you need is a panel dataset, an outcome, a treatment variable and a nice time series. You likely won't even need additional predictors.

Last edited by Jared Greathouse; 23 Dec 2022, 17:10.

Comment

Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#6

24 Dec 2022, 00:24

`' as the dereferencing operator (which is a bit of a strange operator to begin with)

Technically `' is not dereference, since the 'dereference' operation implies that you have a pointer (or reference) to a location in heap memory: the "dereference" operation accesses the object itself given the reference. Macros are not like this. Macros make a whole lot more sense when you hold on to the general intuition that a macro in Stata is a piece of executable code that you can dynamically inject into your syntax. People often treat macros as if they were value types, which you might store in stack memory. Macros are actually injectable syntax that can be directly interpreted by the Stata interpreter. I happen to think this is very cool. Makes me want to learn Lisp.

In most OO programming languages the command would know to parse out a datetime object without any additional modifiers.

Java is not dynamically typed, and would absolutely have you explicitly create a new datetime object. And (frankly) I don't think python's dynamic type system is all that great. I find that python3 often makes incorrect type inferences. Neither R nor C are object oriented languages, C is not dynamically typed, and regardless, all of the languages you mention will require you to make explicit type transformations from time to time. R is maybe the best of the bunch at dynamically determining the type of an object, but dynamic typing in R is responsible for many of the memory inefficiencies characteristic of R as a language. Julia might actually come close to solving many of these problems with its polymorphic type system, but you wont get close to the same number of features or support as with Stata. More to the point, I reject the premise that dynamic typing is a good thing in the first place. Strong typing makes for clearer syntax, code that is easier to understand and reason about, and better compilers.

STATA is hands down the most frustrating language I have programmed in

Every language has its drawbacks. I could tell you horror stories about R, C, Java and Python. As a statistical programing language, Stata's ado has clean, readable, easy syntax. Give it some time and I'm sure you'll come to love it too.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

24 Dec 2022, 09:34

Let's return from philosophy to the problem of running synth, a community-contributed command installed from SSC, which is apparently no longer under active development and support by the authors. From post #3

@William Lisowski thanks for your response - I tried it and got the same outcome (expression too long).

The "expression too long" error message that was displayed was not in reaction the command as you typed it. Instead, it was thrown by Stata after synth constructed an expression that exceeded Stata's limits for expressions.

My intuition suggests that this is a consequence of synth trying to deal with a sequence of 40 enumerated quarters of data. You can test this hypothesis by trying to specify a shorter pre-treatment period - just a year or two.

If that is the case, then perhaps you would be better served by constructing your dataset so that you do not need to specify the training period - in which case it will use all the observations prior to the intervention.

Perhaps you tried that and got some sort of error. I note that you apparently have quarterly data but described using a Stata monthly date rather than a Stata quarterly date. If indeed you have just four observations in each year - March, June, September, and December - then perhaps (pretending your date variable is named mdate)

Code:

tsset mdate, delta(3)

would solve the problem, by telling Stata that your monthly date is recorded in three month increments.

Code:

. list mdate if inlist(_n,1,_N), clean mdate 1. 1980m12 161. 2020m12 . * monthly . tsset mdate Time variable: mdate, 1980m12 to 2020m12, but with gaps Delta: 1 month . * every three months . tsset mdate, delta(3) Time variable: mdate, 1980m12 to 2020m12 Delta: 3 months .

Try that and then see if synth runs without specifying the training period.

Alternatively, and perhaps preferable in the long run, you could create an actual Stata quarterly date variable from your monthly date variable

Code:

generate qdate = qofd(dofm(mdate)) format %tq qdate tsset qdate

And again, synth should be able to proceed without specifying the training period.

Last edited by William Lisowski; 24 Dec 2022, 09:37.
Comment
Samuel Van Gorden

Join Date: Jul 2022

Posts: 16
#8

27 Dec 2022, 15:17

William Lisowski I tried running the command with just a single year of pre-treatment and also without specifying any pre-treatment period (i.e. synth ra ra, trunit(44)...) and still get the same "expression too long" error. My dataset includes data points at a monthly granularity, so I am unable to specify delta(3) and I'm not sure using qofd() would work because I'm not using months 1, 4, 7, and 10 (I'm using 3, 6, 9, 12). I tried it out and I'm getting an error for "repeated time values within panel" (I am using state fips code as the panel variable - tsset state_fips q_date).
Comment
Dimitriy V. Masterov

Join Date: Mar 2014

Posts: 609
#9

27 Dec 2022, 15:41

Take a look here, and the link in my last comment.
2 likes
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#10

27 Dec 2022, 15:50

Hey Dimitriy V. Masterov , yeah I agree that the errors for this are super cryptic. I could be guilty of this too, as I'm not finished with foolproofing scul, but I wish authors would code all the basic errors, e.g., "unit needs at least two donors" or "x covariate is completely constant" instead of "command can't run". Same with the inlist error, "too many T0 periods".
Comment
Samuel Van Gorden

Join Date: Jul 2022

Posts: 16
#11

28 Dec 2022, 11:44

Dimitriy V. Masterov I enabled the trace and it appears that my results period was too long. Thanks!
Comment

Announcement