Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • New on SSC: -tsegen- for computations over a rolling window using time-series operators with egen functions

    Thanks to Kit Baum, a new package tsegen (with Nick Cox) is now available from SSC.

    With tsegen, you can invoke any egen function that requires a varlist using a time-series varlist (tsvarlist) instead. tsegen converts the tsvarlist to a varlist by substituting equivalent temporary variables as necessary and then invokes the specified egen function.

    tsegen requires Stata version 10 or higher. To install, type:

    Code:
    ssc install tsegen
    tsegen is notably useful for computing descriptive statistics over a rolling window of time using egen row functions. By using a set of lagged variables, tsegen can compute the statistic directly and is orders of magnitude faster than other approaches that require looping over subsets of observations. For example, you can calculate the mean over a 5-year window that includes the current observation

    Code:
    webuse grunfeld, clear
    tsegen inv_m5 = rowmean(L(0/4).invest)
    See help tsvarlist for more information on various ways to specify a set of operated variables. Since missing values are ignored, a non-missing mean may be based on a single observation. With tsegen, you can require, for example, a minimum of 3 non-missing observations

    Code:
    tsegen inv_m5m3 = rowmean(L(0/4).invest, 3)
    Here's another example that calculates the standard deviation over a 3-year moving window, with and without the current observation

    Code:
    tsegen inv_sd3 = rowsd(L(0/2).invest)
    tsegen inv_sd3L = rowsd(L(1/3).invest)
    tsegen can also be used with windows of time that span periods before and after the current observation. For example, you can apply a smoothing filter over 5 periods centered on the current observation using

    Code:
    webuse sales1, clear
    tsegen sm = rowmean(L(0/2).sales F(1/2).sales)
    tssmooth ma sm1=sales, window(2 1 2)
    The two commands perform exactly the same computation but preliminary testing suggests tsegen is much faster.

    tsegen is not limited to egen functions that ship with Stata. If you have egenmore installed (from SSC), you can, for example, tag observations that reflect 3 consecutive years of positive growth in market value:

    Code:
    webuse grunfeld, clear
    gen double diff = D.mvalue
    tsegen pg = rall(L(0/2).diff,3) , c(@ > 0)

  • #2
    Bravo to you and Nick for doing this! It is clear from all the posts on the forum asking for these kinds of calculations that there is enormous need for this.

    Comment


    • #3
      Very helpful work. Thanks, Robert and Nick.

      A suggestion for further improvement: It would be nice if tsegen would also be byable for those egen functions that are. For example, somebody may want to compute the mean for a one-period lagged variable, say:
      Code:
      webuse grunfeld, clear
      by company: tsegen inv_ml1 = mean(L.invest)
      Unfortunately, this results in an error message because using by does not work (yet?). Making tsegen byable would also be helpful to compute the mean (or other byable egen functions) within group only for groups with a certain minimum number of nonmissing observations, which is a very nice feature of tsegen.

      Another minor comment about your last example. It is the particular nice property of tsegen that one does not need to generate the intermediate variable. Instead, the following combination of time-series operators with tsegen works:
      Code:
      webuse grunfeld, clear
      tsegen pg2 = rall(L(0/2).D.mvalue,3) , c(@ > 0)
      https://twitter.com/Kripfganz

      Comment


      • #4
        Thanks very much for your comments. As it happens, your first example can be achieved by

        Code:
         
        tsegen inv_ml_2 = mean(L.invest), by(company)
        although that is by virtue of an undocumented feature of egen rather than anything we have introduced.

        Robert as first author may well have a better take, but my misgiving about supporting by: as a prefix is that it may well clash with the sort order needed to make time series calculations work at all. Perhaps we could implement it conditional on the one variable specified being the panel identifier. Sometimes, however, there is virtue in a program doing only what is claimed. This is just an interim reaction.

        Comment


        • #5
          Thanks for pointing to this workaround.

          I do not think that the by prefix interferes with the time-series calculations because the temporary variables for the time-series varlist are substituted before the specific egen functions are invoked, but I might be wrong.
          https://twitter.com/Kripfganz

          Comment


          • #6
            This is indeed interesting and goes beyond the original concept behind tsegen which was to unroll the loops required to perform computations on a rolling/moving window of time and make the calculations by observation and across a set of time-series operated variables instead. Since there are no egen functions that accept a varlist (not just a varname) that are byable (that I know of), it seemed natural to not support the by: prefix.

            The egen mean() function accepts an expression, not just a varname. So while your example can be made to work as Nick has shown, the following will not work:

            Code:
            tsegen inv_ml_2 = mean(invest - L.invest), by(company)
            and there is no magic that can easily be retrofitted to tsegen that would parse an expression with time-series operators and substitute equivalent temporary variables before passing the argument to the egen function. So I'm inclined to leave tsegen as is, sorry.


            Comment


            • #7
              Sebastian, re #5, you are correct. tsegen calls tsrevar to substitute equivalent temporary variables and those are then passed along to the egen function. Unfortunately, tsrevar can only be used with a varlist that may contain time-series operators and not an expression. A workaround would require parsing the expression and that would not be trivial.

              Comment


              • #8
                That's a fair point. It is probably not worth the effort to make tsegen work for general expressions. Yet, this is not really an issue of the by prefix but more of allowing an expression versus only a varlist because the same example without by(company) also does not work.

                As far as I understand, the only thing that egen does with the by prefix is to translate it into the by option that is then parsed by the specific egen functions:
                Code:
                ...
                        if _by() {
                                local byopt "by(`_byvars')"
                                local cma ","
                        }
                ...
                https://twitter.com/Kripfganz

                Comment


                • #9
                  You are correct in your observation that all that egen does is to pass along the by option as an argument to the function, which is exactly the workaround that Nick suggested in #4. But as noted in #6, even if tsegen was modified to pass along the by option and you could write

                  Code:
                  by company: tsegen inv_ml_2 = mean(invest - L.invest)
                  you would still have the following error:

                  Code:
                  L.invest invalid name
                  r(198);
                  because tsrevar does not take an expression. I still think that it is simpler to limit tsegen to functions that accept a varlist.

                  Comment


                  • #10
                    I have a follow-up issue on the use of expressions with tsegen:

                    While this was not the original purpose of developing tsegen, I personally find it very helpful that I can in general also use it with egen functions other than row*(). Yet, when using those egen functions with expressions (instead of just a variable name) this might lead to unexpected error messages. For example:
                    Code:
                    . webuse grunfeld, clear
                    
                    . tsegen stock_Lval = total(kstock * L.mvalue)
                    kstockcompanyyearinvestmvaluekstocktime__000000__000001 invalid name
                    r(198);
                    
                    . tsegen inv_Linv = total(invest + L.invest)
                    + invalid name
                    r(198);
                    I would be very happy if you can provide a workaround for such situations in a future update of tsegen. In any case, the displayed error messages are not very intuitive (particularly in the first situation).

                    P.S. I am using the latest version 1.1.2 (30may2015) of tsegen.

                    Edit: Maybe the easiest "solution" is to check somewhere in the code with the confirm command whether the arguments are a varlist and if not to issue an appropriate error message.
                    Last edited by Sebastian Kripfganz; 15 Jun 2015, 10:43.
                    https://twitter.com/Kripfganz

                    Comment


                    • #11
                      The reason why you get this puzzling error message is that the arguments you provided can be interpreted as a time-series variable list:

                      Code:
                      . tsunab vlist : kstock * L.mvalue
                      
                      . dis "`vlist'"
                      kstock company year invest mvalue kstock time __000000 L.mvalue

                      Comment


                      • #12
                        Good point. I did not think about * as a wildcard symbol for varlists in this context. My suggestion then would be to add the line
                        Code:
                        conf v `args'
                        in the tsegen.ado file just before
                        Code:
                        cap tsrevar `args'
                        to turn this puzzling error message into a more meaningful one.

                        Anyway, tsegen is a great command!
                        https://twitter.com/Kripfganz

                        Comment


                        • #13
                          Expanding Robert's reply a little, what is biting here is that tsegen feeds on a tsvarlist (and not, in particular, on arbitrary expressions).

                          When the operator * is used (which here is intended to mean multiplication) it gets interpreted as a wild card for all variables, so in the first instance it's legal, but problems arise later.

                          Conversely when the operator + is used (addition, naturally) the problem is that this has no interpretation as an operator for varlists.

                          I think Sebastian realises this is what is happening, so this is for anyone otherwise bemused.
                          Last edited by Nick Cox; 15 Jun 2015, 13:00.

                          Comment


                          • #14
                            I had a second thought about my suggestion in the previous post. My proposal might actually not be a good idea because wildcards probably should be allowed in general but the confirm command does not accept them. Sorry for all the confusing posts.
                            https://twitter.com/Kripfganz

                            Comment


                            • #15
                              tsegen is not limited to egen functions that ship with Stata. If you have egenmore installed (from SSC), you can, for example, tag observations that reflect 3 consecutive years of positive growth in market value:

                              Code:
                              webuse grunfeld, clear
                              gen double diff = D.mvalue
                              tsegen pg = rall(L(0/2).diff,3) , c(@ > 0)
                              How would I modify this last example in post #1 to give me "tag observations that reflect more than 3 consecutive years of positive growth in market value"?

                              Comment

                              Working...
                              X