Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Simple "synth" Command Usage with Monthly Data

    I am trying to run a simple synthetic control experiment using the "synth" command on a dataset containing monthly data. From https://www.stata.com/statalist/arch.../msg01164.html I was able to determine a (very unintuitive) way to express monthly dates. My command is:

    Code:
    synth ra(`=tm(1980m12)'(3)`=tm(1990m11)') ra, trunit(44) trperiod(`=tm(1990m12)')
    However, I then get the error
    ra(251(3)370) does not exist as a (numeric) variable in dataset
    I then tried:

    Code:
    synth ra ra, trunit(44) trperiod(`=tm(1990m12)') xperiod(`=tm(1980m12)'(3)`=tm(1990m11)')
    and I got
    expression too long
    .

    What am I doing wrong? This should be a very simple command which uses the same variable (ra) for the independent and dependent variables, with 44 as the control unit, and a pre-treatment period of 12/1980 to 11/1990.

    As a side note, this method of expressing monthly dates seems needlessly complicated to me. I find that there are many such instances in STATA where convoluted expressions are needed to obtain simple outputs. I would be interested in seeing an explanation of why this difficult to use syntax is necessary if one exists.

  • #2
    I think you have reversed the order of your two arguments - the first argument is the dependent variable, and the second argument is the specification of the predictor variables, as the output of
    Code:
    help synth
    tells us. Try
    Code:
    synth ra ra(`=tm(1980m12)'(3)`=tm(1990m11)'), trunit(44) trperiod(`=tm(1990m12)')
    With regard to Stata syntax being unintuitive, that is true of all languages. When I visit Switzerland, for example, I find German, French, and Italian uniformly unintuitive (no experience with Romansh, but I don't hold out high hopes for my intuition working there, either). This reflects more on my failure to accumulate enough experience speaking these languages to build up the familiarity that makes constructing and understanding sentences intuitive.

    Stata is just another language. With more experience with Stata, you'll find it intuitive that in modeling commands the variable being modeled typically precedes other variables.

    Comment


    • #3
      William Lisowski is right. Synth is expressed like
      Code:
      synth y x x1(10/12)
      and so on. However, I don't recommend you use synth. Synth is super arbitrary in terms of lag specification. Mine does better.

      Comment


      • #4
        William Lisowski thanks for your response - I tried it and got the same outcome (expression too long).

        What I meant by STATA being unintuitive was that there seems to be a different way to accomplish each separate task, whereas in other programming languages things are much more consistent. I understand `' as the dereferencing operator (which is a bit of a strange operator to begin with), but then also needing to precede the date with "=" and "tm" seems overly complicated. In most OO programming languages the command would know to parse out a datetime object without any additional modifiers. This is just one of the many examples of STATA being needlessly messy that I have run into that does not exist in other languages such as R, Python, Java, C, etc. STATA is hands down the most frustrating language I have programmed in, which really says something considering I programmed in TCL for several years!

        Comment


        • #5
          Let's prove it shall we? Here I simulate a prop 99 like dataset, with 37 control units and 1 treated unit, from 1970 to 2000. I simulate a linear factor model with 3 factor loadings and a single time varying factor. If you don't wanna run it, cool, I have the main results at the bottom.
          Code:
          clear *
          loc int_time = 1989
          
          set obs 38
          set seed 1011
          
          
          egen id = seq(), f(1) t(38)
          
          cls
          
          
          generate u_i1 = 6 // latent factor lodings
          
          generate u_i2 = 9 // latent factor lodings
          
          g u_i3 = 12 // latent factor lodings
          
          expand 40
          
          qbys id: g time = _n+1969
          
          keep if inrange(time,1970,2000)
          
          bys id: g u_t = runiform()
          
          
          xtset id time, g
          
          su `r(timevar)', mean
          
          loc yearmin =r(min)
          
          // Generate population data
          
          bys id: g population = runiformint(5000000,10000000) if time ==`yearmin'
          
          replace pop=L1.pop+rnormal(20000,2000) if time>`yearmin'
          
          
          // Generate income data
          
          
          bys id: g income = runiformint(20000,40000) if time ==`yearmin'
          
          replace income=L1.income+rnormal(1000,500) if time>`yearmin'
          
          
          replace income = ln((income/pop)*100000)
          
          // Generate proportion of alcohol drinkers data
          
          bys id: g growth = runiformint(10000,40000) if time ==`yearmin'
          
          replace growth=L1.growth+rnormal(2,10) if time>`yearmin'
          
          
          replace growth = (growth/pop)*100
          
          replace pop = ln(population)
          
          g pop2 = exp(pop)
          
          // Generate price data
          
          
          bys id: g price = runiformint(27.3,42.2) if time ==`yearmin'
          
          replace price=L1.price+rnormal(4,1.5) if time>`yearmin'
          
          cls
          
          //!! Generate cigarette sales per capita data
          
          
          bys id: g cigsale = abs(floor(((0.0000005*pop2) ///
              -(price*200)- ///
              (income*5)- ///
              (growth*(10))) - ((u_i1*u_i2*u_i3)*u_t) + ///
              rnormal(25000,2000))) if time == `yearmin'
                      
          bys id: replace cigsale = ((cigsale/pop2)*100000)
          
          loc ymp1 = `yearmin'+1
          
          loc ymp5 = `yearmin'+5
          
          
          
          bys id: replace cig=L1.cig+rnormal(2,2) if inrange(time,`ymp1',`ymp5')
          
          
          
          bys id: replace cig=L1.cig-rnormal(2,2) if time >`ymp5'
          su cigsale
          as cig > 0
          
          tempvar storename id2
          
          loc entity State
          
          
          g `storename' = "`entity' "
          
          egen `id2' = concat(`storename' id)
          
          labmask id, values(`id2')
          
          cls
          qui xtset
          local lbl: value label `r(panelvar)'
          
          loc unit ="`entity' 13":`lbl'
          
          g treated = cond(id==`unit' & time >= `int_time',1,0)
          
          
          as cigsale > 0
          
          cls
          
          lab var treated "Anti-Smoking Policy"
          
          
          clonevar cigte = cig
          
          replace cigte = cigsale - (cigsale * .098)-rnormal(.5,.1) if id ==`unit' & time >=`int_time'
          Okay now that's done. Now we can begin with estimation
          Code:
          synth cigte ///
              growth(1984(1)1988) ///
              income(1984(1)1988) price(1980(1)1988) ///
              cigte(1988) ///
              cigte(1980) ///
              cigte(1975) cigte(1985), ///
              trunit(`unit') ///
              trperiod(`int_time') fig keep(`scmdata', replace) //
          
          qui scul cigte, scheme(sj) ///
              treated(treated) ///
          Okay so now we've estimated our effects. The real effect size is a reduction of 17 packs per capita from 1989 to 2000, which is actually roughly the cannonical estimate of SCM for Prop 99. Here we have side by side the true values, the vanilla SC estimates and the SCUL.
          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input double(_Y_treated _Y_synthetic time cf)
          206.48312377929688 209.04665116882325 1970  207.1403304413866
          215.04507446289063  212.0257825012207 1971  210.8180016702924
          213.47695922851563 214.09341960144042 1972 213.54415174864374
           214.0126495361328 216.98536473083496 1973 214.84077814381493
          216.96914672851563 217.03967070007326 1974  215.7230054723854
             218.45751953125 218.22244160461426 1975  217.1172069723441
          212.71701049804688 213.78982250976563 1976 212.35967931071275
           210.5804443359375 209.22987266540525 1977  209.5400094965006
          207.19996643066406  205.0433042449951 1978 207.03636342577863
           202.8429718017578  202.9904515686035 1979 203.70977504780052
          199.79644775390625 200.90006056213377 1980 200.38360180138136
          197.68443298339844  200.0681079864502 1981 198.16394207175836
           196.0567169189453  198.4637028503418 1982 195.96361483090368
          191.76785278320313  195.7755709075928 1983  192.7931614892522
          189.84266662597656 193.46030746459962 1984 192.29683713193197
          190.04014587402344  190.6295277252197 1985 190.90778854782195
          188.07794189453125 187.59325134277344 1986  188.2586321803584
           186.0728302001953  185.6151879119873 1987  186.0312328212708
           184.0740509033203 183.43889569091797 1988 184.56983966616917
           163.0240020751953  180.2573909301758 1989  181.8762558019235
          163.62014770507813 178.17645106506347 1990 179.71627423494397
           160.3173065185547 176.84363258361816 1991 178.05761044477856
           156.1579132080078 175.76256706237794 1992 175.09231190699842
          151.41085815429688  174.0351475982666 1993 172.45316114293382
           149.6656036376953 173.59551377868655 1994 169.05994904151652
          150.42247009277344 172.71237896728516 1995 166.27544596961803
           147.8294677734375   170.686692779541 1996 163.41465035667738
          146.91152954101563 169.97707914733888 1997 161.70240222874207
          143.92422485351563 168.77022714233397 1998 160.81592445107748
          141.20570373535156  165.1271237487793 1999 157.35897062725581
          141.25685119628906 162.95198188781737 2000  154.5442173226469
          end
          
          cls
          tsset time
          
          foreach v of var _Y_synthetic cf {
              
              g diff`v' = _Y_treated-`v'
          }
          
          mean diff* if time >= 1989
          
          twoway (tsline _Y_tr, lcolor(black) lwidth(medthick) lpattern(solid)) (tsline _Y_synthetic, lcolor(red) lwidth(medium) lpattern(dash)) (tsline cf, lcolor(blue) lwidth(medium) lpattern(shortdash)), tline(1989) legend(order(1 "Treated" 2 "ADH" 3 "SCUL") position(7) ring(0))
          Maybe one could quip with this simulation (I know I wanna improve it at some point), but either way, the true effect size is -17, vanilla SC says the effect is -21, mine is 17, so I think mine is better than the classical method.

          EDIT: Samuel Van Gorden I kinda agree that Stata's datetime aspects can be hard to work with sometimes, but as William Lisowski will remind you (as he's reminded me!), Stata's datetime functions are extremely versitile, and while daunting at first, is really quite powerful. But as I sort of demonstrate above, my command doesn't demand you work with this at all. All you need is a panel dataset, an outcome, a treatment variable and a nice time series. You likely won't even need additional predictors.
          Last edited by Jared Greathouse; 23 Dec 2022, 17:10.

          Comment


          • #6
            `' as the dereferencing operator (which is a bit of a strange operator to begin with)
            Technically `' is not dereference, since the 'dereference' operation implies that you have a pointer (or reference) to a location in heap memory: the "dereference" operation accesses the object itself given the reference. Macros are not like this. Macros make a whole lot more sense when you hold on to the general intuition that a macro in Stata is a piece of executable code that you can dynamically inject into your syntax. People often treat macros as if they were value types, which you might store in stack memory. Macros are actually injectable syntax that can be directly interpreted by the Stata interpreter. I happen to think this is very cool. Makes me want to learn Lisp.

            In most OO programming languages the command would know to parse out a datetime object without any additional modifiers.
            Java is not dynamically typed, and would absolutely have you explicitly create a new datetime object. And (frankly) I don't think python's dynamic type system is all that great. I find that python3 often makes incorrect type inferences. Neither R nor C are object oriented languages, C is not dynamically typed, and regardless, all of the languages you mention will require you to make explicit type transformations from time to time. R is maybe the best of the bunch at dynamically determining the type of an object, but dynamic typing in R is responsible for many of the memory inefficiencies characteristic of R as a language. Julia might actually come close to solving many of these problems with its polymorphic type system, but you wont get close to the same number of features or support as with Stata. More to the point, I reject the premise that dynamic typing is a good thing in the first place. Strong typing makes for clearer syntax, code that is easier to understand and reason about, and better compilers.

            STATA is hands down the most frustrating language I have programmed in
            Every language has its drawbacks. I could tell you horror stories about R, C, Java and Python. As a statistical programing language, Stata's ado has clean, readable, easy syntax. Give it some time and I'm sure you'll come to love it too.

            Comment


            • #7
              Let's return from philosophy to the problem of running synth, a community-contributed command installed from SSC, which is apparently no longer under active development and support by the authors. From post #3

              @William Lisowski thanks for your response - I tried it and got the same outcome (expression too long).
              The "expression too long" error message that was displayed was not in reaction the command as you typed it. Instead, it was thrown by Stata after synth constructed an expression that exceeded Stata's limits for expressions.

              My intuition suggests that this is a consequence of synth trying to deal with a sequence of 40 enumerated quarters of data. You can test this hypothesis by trying to specify a shorter pre-treatment period - just a year or two.

              If that is the case, then perhaps you would be better served by constructing your dataset so that you do not need to specify the training period - in which case it will use all the observations prior to the intervention.

              Perhaps you tried that and got some sort of error. I note that you apparently have quarterly data but described using a Stata monthly date rather than a Stata quarterly date. If indeed you have just four observations in each year - March, June, September, and December - then perhaps (pretending your date variable is named mdate)
              Code:
              tsset mdate, delta(3)
              would solve the problem, by telling Stata that your monthly date is recorded in three month increments.
              Code:
              . list mdate if inlist(_n,1,_N), clean
              
                       mdate  
                1.   1980m12  
              161.   2020m12  
              
              . * monthly
              . tsset mdate
              
              Time variable: mdate, 1980m12 to 2020m12, but with gaps
                      Delta: 1 month
              
              . * every three months
              . tsset mdate, delta(3)
              
              Time variable: mdate, 1980m12 to 2020m12
                      Delta: 3 months
              
              .
              Try that and then see if synth runs without specifying the training period.

              Alternatively, and perhaps preferable in the long run, you could create an actual Stata quarterly date variable from your monthly date variable
              Code:
              generate qdate = qofd(dofm(mdate))
              format %tq qdate
              tsset qdate
              And again, synth should be able to proceed without specifying the training period.
              Last edited by William Lisowski; 24 Dec 2022, 09:37.

              Comment


              • #8
                William Lisowski I tried running the command with just a single year of pre-treatment and also without specifying any pre-treatment period (i.e. synth ra ra, trunit(44)...) and still get the same "expression too long" error. My dataset includes data points at a monthly granularity, so I am unable to specify delta(3) and I'm not sure using qofd() would work because I'm not using months 1, 4, 7, and 10 (I'm using 3, 6, 9, 12). I tried it out and I'm getting an error for "repeated time values within panel" (I am using state fips code as the panel variable - tsset state_fips q_date).

                Comment


                • #9
                  Take a look here, and the link in my last comment.

                  Comment


                  • #10
                    Hey Dimitriy V. Masterov , yeah I agree that the errors for this are super cryptic. I could be guilty of this too, as I'm not finished with foolproofing scul, but I wish authors would code all the basic errors, e.g., "unit needs at least two donors" or "x covariate is completely constant" instead of "command can't run". Same with the inlist error, "too many T0 periods".

                    Comment


                    • #11
                      Dimitriy V. Masterov I enabled the trace and it appears that my results period was too long. Thanks!

                      Comment

                      Working...
                      X