Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • New on SSC: rangestat - a program to generate statistics using observations within range

    Thanks for Kit Baum, a new program called rangestat (with Roberto Ferrer and Nick Cox) is now available on SSC. Stata 11 is required.

    rangestat calculates statistics for each observation using all observations where a numeric key variable is within the low and high bounds defined for the current observation. For panel data and time-series, rangestat can generate statistics over a rolling window of time. In addition to its built-in statistics, rangestat can apply user-supplied Mata functions.

    To install, type in Stata's command window:

    Code:
    ssc install rangestat
    Once installed, type

    Code:
    help rangestat
    to get more information.

    rangestat offers an efficient solution to a type of Stata problem that appears simple but remains vexing to solve in Stata: you need to calculate something that is specific to each observation but the calculations use values from other observations and there's no way to group observations using the by: prefix to perform the task directly. This type of problem typically requires some form of looping. The brute force approach is to loop over each observation and make calculations on the desired subset of observations using an if condition. For example, say you want to calculate the mean wage of other people of similar age. A brute force solution could look like:

    Code:
    sysuse nlsw88, clear
    gen double mwage = .
    quietly forvalues i = 1/`=_N' {
     summarize wage if inrange(age[`i'], age-1, age+1) & _n != `i', meanonly
     replace mwage = r(mean) in `i'
    }
    With rangestat, you can get the same using:

    Code:
    rangestat (mean) rmwage = wage, interval(age -1 1) excludeself
    The syntax of rangestat is very similar to that of Stata's collapse command except that instead of reducing the number of observations, you create new variables with the desired statistics.

    Another example would be rolling windows of time. tsegen (from SSC, with Nick Cox) also handles such problems and remains the most efficient solution in terms of execution time, as long as the time window is manageable. tsegen is fast because Stata is very efficient at creating temporary variables that hold the values of the lag/lead observations and the statistic is calculated using all observations at the same time. The downside of tsegen is that all these temporary variables require more memory. On the other hand, rangestat is frugal in terms of memory and more flexible in that it can calculate more than one statistic at a time. For example,

    Code:
    . webuse grunfeld, clear
    
    . tsegen double inv_m5b = rowmean(L(0/4).invest)
    
    . rangestat (mean) invest (sd) sd_inv=invest kstock (count) invest kstock, interval(year -4 0) by(company) describe
    
                  storage   display    value
    variable name   type    format     label      variable label
    ------------------------------------------------------------------------------------
    invest_mean     double  %10.0g                mean of invest
    sd_inv          double  %10.0g                sd of invest
    kstock_sd       double  %10.0g                sd of kstock
    invest_count    double  %10.0g                count of invest
    kstock_count    double  %10.0g                count of kstock
    
    . assert inv_m5b == invest_mean
    
    .
    Preliminary testing suggests that rangestat is faster than tsegen when the time window spans more than 50 periods, less if memory is constrained or if tsegen needs to be called repeatedly to generate more statistics.

    Finally, an exciting and powerful feature of rangestat is its ability to call a user-written Mata function to perform calculations. rangestat performs all of its tasks in Mata and has an extremely efficient engine to identify which observations are in the specified range. For each observation, rangestat prepares a single real matrix that contains the values to use for the calculations. A user-supplied Mata function needs only to accept that matrix and return results in a real rowvector. The size of the rowvector does not matter: rangestat will create as many variables as needed to store the results. Here is a quick example of how to calculate the correlation between two variables on a rolling window:

    Code:
    clear all
    webuse grunfeld
    
    mata:
    mata set matastrict on
    
    real rowvector N_corr(real matrix X)
    {
      real matrix R
    
      R = correlation(X)
      return(rows(X), R[2,1])
    
    }
    
    end
    
    rangestat (N_corr) invest mvalue, interval(year -5 0) by(company) casewise describe
    The Mata function N_corr() returns two values. The first contains the number of rows in X, in other words the number of observations that were in range. The second value is the correlation's rho. rangestat creates two variables to store these values, N_corr1 and N_corr2 respectively.

    As long as your Mata function accepts a single real matrix and returns a real rowvector, you can do anything you want. You could even program a regression and rangestat will handle all the details of how run this regression by observations over a rolling window of time.

  • #2
    Very cool, Robert! Thank you so much.

    Comment


    • #3
      Is there a way to let the outcome be 'error' or 'not enough values' if the range given with rangestat is incomplete? I would only like it to be the maxiumum value of a given year if the range is complete (observations for atleast one year back) for example?

      Comment


      • #4
        #3 is already being discussed at http://www.statalist.org/forums/foru...e-52-week-high

        Everyone: Please follow discussion there.

        Gilles: Please don't post the same question in concurrent threads.

        Comment


        • #5
          cool, it's a useful program, thank you so much, Robert

          Comment


          • #6
            Hi everyone,
            I am a new member and just join the forum today. I am working on my thesis of unbalanced panel dataset of 4000 farms over 13 years. I am trying to calculate rolling skewness and rolling semi-kurtosis of farms' margin over 2-year window (i.e this current year and 1 previous year). For rolling kurtosis, I am interested in the left-tail of the distribution by defining the left-tail based on a certain percentile of the program margin. I follow related posts on the forum and can calculate rolling standard deviation based on this command by Clyde Schechter. Thanks much Schechter for that

            xtset farm_id year
            foreach v of varlist prefr1 prefr2 {
            tsegen rolling_sd_`v' = rowsd(L(0/1).`v', 2)
            }

            I'd like to ask if there is any corresponding row command for skewness and kurtosis? And how can I narrow down my computing of rolling kurtosis to the left tail only?
            Or can I just narrow down the observation first by creating a subset and calculate rolling kurtosis for that subset?

            I'd much appreciate for any advice/ help!
            Thank you very much,
            Truc Phan
            Last edited by Truc Phan; 06 Apr 2018, 11:22.

            Comment


            • #7
              Truc Phan: See rangestat (SSC) for skewness and kurtosis.

              For your own analogue of kurtosis based on one tail, you'll need (I think) to write your own small program and use rangerun (SSC), You don't give a formula or a precise reference, so the recipe ("based on a certain percentile") is not clear to me.

              Please note that I explain the provenance of community-contributed (user-written) programs I cite. You are asked to do that too: here the citation needed is tsegen (SSC). See also FAQ Advice #12.

              Although skewness and kurtosis have enjoyed a modest resurgence in some circles, their pitfalls remain underestimated. For one cautionary tale, see

              SJ-10-3 st0204 . . Speaking Stata: The limits of sample skewness and kurtosis
              . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
              Q3/10 SJ 10(3):482--495 (no commands)
              uses Stata and Mata to show that sample skewness and
              kurtosis are limited by sample size and that these limits
              impart bias to estimation

              https://www.stata-journal.com/sjpdf....iclenum=st0204
              Last edited by Nick Cox; 06 Apr 2018, 11:38.

              Comment


              • #8
                Thanks much Nick for your prompt feedback. And I am sorry for not citing the reference. I will read all the related posts you suggested and try wring the commands. I think I will use the 25 percentile to define the left tail of the farm margin.

                Comment


                • #9
                  The help for rangestat includes a worked example of moving quantiles.

                  Comment


                  • #10
                    I suppose if I wanted to use frequency weights to compute mean/var/skewness, I cannot use rangestat?
                    Last edited by Sven Johnsson; 02 May 2018, 02:24.

                    Comment


                    • #11
                      rangestat doesn't support weights, if that is what you are asking. summarize supports the calculation you want, so you could use rangerun (SSC) if you have a problem with similar flavour to rangestat.

                      Comment


                      • #12
                        EDIT: It is a panel data where the panel identifier is id, and time variable is monthly(format %tm)

                        I am trying to run 2-year rolling regressions using rangestat with the following command:
                        Code:
                        rangestat (reg) var1 var2 var3 var4, interval(year 0 2) by(id)
                        but it seems that I am not understanding the "interval" quite right, as the output variable reg_nobs exceeds 24, which should not happen as I am using monthly data.

                        Does anyone have any idea how I should change my code in order to run 2-year regressions?

                        Thanks.
                        Last edited by Christian Nydal; 18 Jun 2018, 07:06. Reason: Additional info on dataset

                        Comment


                        • #13
                          Your requested interval is from year + 0 to year + 2. That is the set {year + 0, year + 1, year + 2} and each set includes up to 3 years.

                          Comment


                          • #14
                            Dear Stata Users

                            I am working with daily panel data (id, date). I need to run a regression each month id and collect residuals. Statsby appeared to be extremely inefficient to solve this problem. I am hopeful to get your attention to the problem below:

                            Originally posted by Olena Onishchenko View Post
                            Nick

                            Thank you. I think what I need is a regression by id, month.

                            This one seem to be working:

                            Code:
                            rangestat (reg) BuyInst sentiment_volume_1d_n_lag1, interval(month 0 0) by(id)
                            The output gives coefficients, standard errors but no residuals. Is there any chance rangestat can produce residuals?
                            Thank you.
                            Last edited by Olena Onishchenko; 02 Nov 2018, 01:26.

                            Comment


                            • #15
                              After running rangestat, one line suffices of the form

                              Code:
                              gen double residual = y - b_x * x - b_cons
                              where you should write your own variable names instead of y and x and subtract a product coefficient * predictor for each predictor.

                              Comment

                              Working...
                              X