Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Rolling percentiles for large dataset

    Hi,

    I have a large dataset containing more than 30m observations.
    The dataset aggregates daily trading data for about 35000 individuals (variables "date" and "ind").
    I need to compute a rolling 1-, 5-, 95-, and 99-percentile of a certain variable X for each of those individuals over the respective past 60 days.
    So, something like

    Code:
    xtset ind date
    mvsumm X, stat(p1) win(60) gen(p1)  end
    would do the job over a 60-day window. However, this command is relatively slow. It runs for more than a day. Since I need this multiple times, I am in need of a quicker option.
    Is there any other, faster way to do this?
    My impression is that rangestat is faster.
    Unfortunately, there seems to be no option to compute percentiles with rangestat.

    Thanks for any input!
    Last edited by Rolf Miller; 07 Apr 2019, 16:27. Reason: Rolling percentile

  • #2
    In the help for rangestat (SSC, as you were asked to explain) there is a detailed worked example showing how moving quantiles can be calculated.

    Comment


    • #3
      You can also use the perc(k) option of asrol (available from SSC) for rolling window percentiles. You shall find that asrol is blinking fast. To download asrol

      Code:
      ssc install asrol
      The option perc(k) returns the k-th percentile of values in a range. This option must be used in combination with the option stat(median). Without using perc(k) option, stat(median) finds the median value or the 50th percentile of the values in a given window. However, if option perc(k) is specified, then the stat(median) will find k-th percentile of the values in range. For example, if we are interested in finding the 75th percentiles of the values in our desired rolling window, then we have to invoke the option perc(.75) along with using the option stat(median).

      See the following example where we shall find the 75th percentile of the variable profitability in a rolling window of 5 years for each industry in each country.

      Code:
      bys country industry : asrol profitability, window(year 5) stat(median) perc(.75)

      Note :
      The calculation of percentiles follows a similar method as used in summarize and _pctile. Therefore, the percentile values might be slightly different from the values calculated with centile. For details related to different definitions of percentiles, see Hyndman and Fan (1996).

      Hyndman, R., & Fan, Y. (1996). Sample Quantiles in Statistical Packages. The American Statistician, 50(4), 361-365. doi:10.2307/2684934

      Regards
      --------------------------------------------------
      Attaullah Shah, PhD.
      Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
      FinTechProfessor.com
      https://asdocx.com
      Check out my asdoc program, which sends outputs to MS Word.
      For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

      Comment


      • #4
        Here's a worked example for rangestat you can test. Note that all desired percentiles are obtainable in one run.

        Code:
        webuse nlswork, clear
        
        * ssc inst moremata needed for -mm_quantile()-
        mata:  
        mata clear
        real rowvector myquantile(real colvector X) {
            return(mm_quantile(X, 1, (0.01, 0.05, 0.95, 0.99)))
        }
        end
        
        rangestat (myquantile) ln_wage, interval(age -2 2)
        
        label var myquantile1 "p1"
        label var myquantile2 "p5"
        label var myquantile3 "p95"
        label var myquantile4 "p99"
        From #1 the syntax you need is something like -- with the Mata code above as preamble ---

        Code:
         rangestat (myquantile) X, by(ind) int(-60 -1)
        or

        Code:
         rangestat (myquantile) X, by(ind) int(-59 0)
        depending on whether the last 60 days excludes the present day or not.

        Note that with a sample size of 60 (or fewer if there are gaps or missing values), the 1% and 99% percentiles will just be the minimum and maximum.
        Last edited by Nick Cox; 08 Apr 2019, 01:36.

        Comment

        Working...
        X