Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a moving percentile rank based on a look back window of 252 days.

    Hello everybody, I was hoping somebody could help me with my following problem:

    I have an unbalanced panel dataset consisting of firms (secid) and daily dates (date). For each secid and date there is a variable (iv) that corresponds to the implied volatility of an option for each firm at each date. I want to create a moving percentile rank that describes each iv with regards to its historical values for the last year. That is, for each secid, its daily iv should be ranked as a percentile from a look back period of 252 days. All firms have more than 252 days of data but they dont all have the same dates or number of observations. I tried the following command but the result was an ongoing command that did not end after several hours, so had to stop it:

    tsset secid date

    bysort secid (date):

    rolling i, window(252) clear:

    egen i = rank(iv), track

    gen pcrank = (i-1) / (252-1)

    Appreciate any feedback, thank you very much.

  • #2
    Making assumptions about your data structure and how you want the ranking to work, this loop does what you describe. I don't know how to (or if it's possible to) get -rolling- to do what you are asking -- AFAIK, it's commonly used for statistical procedures (e.g., moving window / rolling regressions).



    Code:
    *create fake data since no example data were provided*
    clear
    set obs 1000
    g secid  = 1000+mod(_n, 5)
    g date = int(11323+runiform()*10000)
    format date %td
    bys *: drop if _N!=1 //drop dups
    g iv = int(runiform()*1000)
    su
    tsset secid date
    
    
    
    
    *loop through each panel / window of dates of 252 days each*
    
    g rank = .  //rank compared with all id's in that 252 day window
    g totalcompared = .  //contains total considered in ranking
    loc d = 252 //set window length here
    
    qui su date, d
    forval n = `r(min)'(1)`r(max)' {
        loc spanel = `n'-`d'
            tempvar i
            tempvar j
        qui egen `i' = rank(iv) if inrange(date, `spanel', `n'), field
            qui replace rank = `i' if mi(rank) & !mi(`i') & date ==`n'
        qui egen `j' = max(`i')
            qui replace totalcompared = `j' if mi(totalcompared) & !mi(`j') & date==`n'
        }
        
    
        
        
        
    *see how this works with a subset:
    sort date secid
    cap drop check*
    *using a range 252 prior to observation 100*
    qui egen check1 = rank(iv) if inrange(date, `=`=date[100]'-252', `=date[100]'), field
    qui egen check2 = max(check1) if inrange(date, `=`=date[100]'-252', `=date[100]')
    l if inrange(date, `=`=date[100]'-252', `=date[100]')
    Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

    Comment


    • #3
      There's probably a lot of inefficiency in this part that I didnt consider since I only simulated 1000 observations :
      Code:
       qui su date, d
      forval n = `r(min)'(1)`r(max)' {
      since it will potentially loop over a lot of days/periods with no observations depending on the date ranges of your data. It may make more sense to store all the possible dates in a local (assuming the # of items doesnt push past the local macro limit (see c(macrolen))) with something like -levelsof- and then loop over each date that is in your data in order to calculate rank in the 252 days prior to that. There are probably other ways to better loop over the values of date by 'secid' that I haven't thought of.
      Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

      Comment


      • #4
        The fastest way to do this is to use rangestat from SSC.

        With panel data, a rolling window problem means that you must calculate results separately for each observation in the data because the set of observations within the window changes for each observation. What rolling does is to set up a loop over each observation in the dataset and call your function using an if condition that selects the observations within the window for the current observation. There's also lots of extra overhead required to save each result in a separate dataset.

        rangestat handles all the mechanics of selecting the set of observations within the window in Mata and is super efficient at it. There's no back and forth to save results in a separate file as all results are tabulated in Mata and transferred back to Stata at the end of the run. Unfortunately, there's no built in function to calculate the rank so you have to supply your own Mata function to get there. But in this case you do not have to reinvent the wheel since there's such a function in the moremata package (also from SSC).

        When you plan to solve a rolling window problem, the first step is to identify, using standard Stata commands, how you would perform the calculation for individual observations. As you have observed, there's an egen rank() function for that. Since using explicit subscripting is a no-no with egen functions, I think the safest way to calculate results for individual observations is to clone the variable with the values for iv within the desired window.

        For every observation in the data, rangestat calls your Mata function with a real matrix X that contains the values for the observations that fall within the window. The moremata function mm_ranks() has a few options; the arguments I used match what's done with the egen rank(), track function. With all rolling window problem, you will want to know how many observations with non-missing values were used to perform the calculation. Since the rangestat call includes a casewise option, the X matrix contains no missing values. So the function returns rows(X), which is the number of non-missing values for the observations within the window. The function also returns R[rows(X)], which is the rank of the last observation within the window. The last observation within the window is also the current observation given the interval specified.

        If the value of iv is missing for the current observation, the result is set to missing.

        Code:
        * fake data for 10 secid, each with data for 300 up to 3000 days, with missings
        clear
        set seed 3131
        set obs 10
        gen secid = _n
        expand runiformint(300,3000)
        bysort secid: gen date = mdy(1,1,2000) + _n
        format %td date
        * data has missing values and missing observations
        gen iv = int(runiform() * 100) if runiform() < .95
        drop if runiform() < .05
        sum, format
        tsset secid date
        
        
        * generate results for observation 300
        clonevar x = iv if inrange(date, date[300]-251, date[300]) & secid == secid[300]
        egen i = rank(x), track
        list in 300
        drop x i
        
        * generate results for observation 5000
        clonevar x = iv if inrange(date, date[5000]-251, date[5000]) & secid == secid[5000]
        egen i = rank(x), track
        list in 5000
        drop x i
        
        
        * define Mata function to return number of obs and rank of last obs
        mata:
          mata clear
          real rowvector myrank(real matrix X)
          {
            real matrix R
            R = mm_ranks(X,1,1)
            return(rows(X), R[rows(X)])
          }
        end
        
        rangestat (myrank) iv, interval(date -251 0) by(secid) casewise
        
        * rename results using more descriptive names and compare results
        rename myrank1 nobs
        rename myrank2 rank
        
        * set rank to missing if iv is missing
        replace rank = . if mi(iv)
        
        list in 300
        list in 5000

        Comment


        • #5
          There are points of interest to many of different kinds here.

          Executive summary:

          1. Eric's approach is perfectly consistent, modulo uncontroversial tweaks, with Robert's approach and (e.g.) copes well with missing values etc.

          2. Robert's assertion of speed is backed up: his approach is 30 times faster.

          Let's peel off some important details in comparing approaches.

          1. rolling's idea of a window of length # is that it includes # distinct times and ends with the present. For a window of length 252 that is consistent with Robert's use in #4 of an interval that starts at present date - 251 and ends at present date (- 0, as it were). In contrast Eric's code is inclusive from - 252 to present, so includes one more date. Off-by-one differences ("fencepost errors") are all too easy here.

          2. Daniel in #1 stated, as Robert and I interpret him, that he wanted percentile rank determination for each distinct identifier.

          3. In calling egen Daniel in #1 used track ranks; Eric used field ranks; and Robert used track ranks. I would use standard ranks, in which ties are resolved by assigning the mean rank, but the larger points are simply that in comparing approaches like must be compared with like and that people have a choice.

          4. Daniel's interest in percentile ranks is met tacitly in later posts by providing rank and count as raw ingredients. I have a bias to regarding the discussion at http://www.stata.com/support/faqs/st...ing-positions/ as lucid, scholarly and usefully detailed. Denote rank by i and number of values by n. Daniel's recipe (i - 1) / (252 - 1) will at best only be correct when the number of values ranked n is in fact 252 and could be very wildly wrong otherwise. In the FAQ just cited it is called spreadsheet style and long-term members here will know what to think about that. Although (i - 1) / (n - 1) produces percentile ranks that range from exactly 0 to exactly 1 I assert that statistical experience makes you singularly reluctant to use it, as for example invnormal(0) and invnormal(1) are both indeterminate. (The recent paper http://www.stata-journal.com/article...article=st0465 documents Galton's early use of (i - 0.5) / n.)

          5. Eric's code was flagged as being, in effect, quick to write and likely to be slow to run, and absolutely no criticism there. I probably spent more time trying to tweak it than he did writing it. Below I put together Robert's example, Eric's code made compatible and give some timings. FWIW, some further experiments indicate that some of the difference in timing can be reduced by replacing egen calls with direct Stata commands, but only some. My guess is that Daniel's route in #1 cannot be made to work easily, but I didn't try.

          6. It would be a fair comment that you need to put in some work to identifying and/or writing a Mata function to get rangestat to work here. I got as far as identifying mm_ranks() as helpful and writing some initial code before going to bed late last night after passing the baton to Robert, who then leapt up and sprinted round the bend.

          Code:
          * fake data for 10 secid, each with data for 300 up to 3000 days, with missings
          clear
          timer clear
          
          set seed 3131
          set obs 10
          gen secid = _n
          expand runiformint(300,3000)
          bysort secid: gen date = mdy(1,1,2000) + _n
          format %td date
          * data has missing values and missing observations
          gen iv = int(runiform() * 100) if runiform() < .95
          drop if runiform() < .05
          sum, format
          tsset secid date
          
          timer on 1
          * generate results for observation 300
          clonevar x = iv if inrange(date, date[300]-251, date[300]) & secid == secid[300]
          egen i = rank(x), track
          list in 300
          drop x i
          
          * generate results for observation 5000
          clonevar x = iv if inrange(date, date[5000]-251, date[5000]) & secid == secid[5000]
          egen i = rank(x), track
          list in 5000
          drop x i
          
          * define Mata function to return number of obs and rank of last obs
          mata:
            mata clear
            real rowvector myrank(real matrix X)
            {
              real matrix R
              R = mm_ranks(X,1,1)
              return(rows(X), R[rows(X)])
            }
          end
          
          rangestat (myrank) iv, interval(date -251 0) by(secid) casewise
          
          * rename results using more descriptive names and compare results
          rename myrank1 nobs
          rename myrank2 rank
          
          * set rank to missing if iv is missing
          replace rank = . if mi(iv)
          
          list in 300
          list in 5000
          timer off 1
          
          timer on 2
          * Eric Booth's code, tweaked
          g eb_rank = .  //rank compared with all id's in that 252 day window
          g eb_nobs = .  //contains total considered in ranking
          loc d = 252 //set window length here
          
          su date, meanonly
          quietly forval n = `r(min)'/`r(max)' {
              count if date == `n'
              if r(N) {
                  tempvar i j
                  egen `i' = rank(iv) if inrange(date, `n' - `d' + 1, `n'), track by(secid)
                  replace eb_rank = `i' if date == `n'
                  egen `j' = count(iv) if inrange(date, `n' - `d' + 1, `n'), by(secid)
                  replace eb_nobs = `j' if date == `n'
                  drop `i' `j'
              }    
          }
          
          timer off 2
          
          assert eb_rank == rank
          assert eb_nobs == nobs
          
          timer list
          
          . timer list
             1:      1.92 /        1 =       1.9190
             2:     59.85 /        1 =      59.8480
          Last edited by Nick Cox; 06 Apr 2017, 05:15.

          Comment


          • #6
            Thank you for your replies Robert Picard and eric_a_booth. Robert I used your code and its very helpful !!. I suppose when you say:
            "* generate results for observation 5000" you mean 3000 since that is the maximum number of observations for any secid. Is that correct? Also when I run the code the rank seems to be done over a window of 171 days instead of 252, as the maximum value of nobs is 171. I cannot see why it does that and how to change it. Appreciate you feedback.

            Comment


            • #7
              Robert meant exactly what he said by

              Code:
              in 5000
              It's in his code; the code is entirely reproducible. The argument of in always refers to observation number in the entire dataset in the present sort order and is never interpreted within groups.

              Your other question seems to be about what happens with your real data, which we cannot see. But nobs is just the number of values used in a window and does not indicate the spread of time used. For example, if a window only includes non-missing values 250 and 1 days before the present then the time spread is 250 (counting inclusively) but nobs will be reported as 2. 171 means that you have values to use on 171/252 days in at least one window. Only you will know whether that is a surprise.

              Comment


              • #8
                Dear Statalisters,

                I am pretty happy to have found this threat. It pretty much covers what I was searching for.
                Unfortunately, I was not able to completely transfer the ideas from here to my own problem.
                Maybe someone might be able to help me with my very similar issue.

                Opposite to the initial problem, my data is TIME-SERIES data.
                I need a 200 trading day window and need to generate deciles (I know how to get there as soon as I have the ranks) from an INDEX VALUE (DowJones) instead of volatility data, but that should not change the methodology.

                So, in order to get that 200 trading day window, I would generate a new variable ID (so I do not have any gaps, but always exactly 200day-windows).

                Code:
                gen ID = _n
                tsset ID
                This is how my data looks like:
                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input double date float ID double DowJones_INDEX_VALUE
                16072   1 10409.85
                16075   2 10544.07
                16076   3 10538.66
                16077   4 10529.03
                16078   5 10592.44
                16079   6 10458.89
                16082   7 10485.18
                16083   8 10427.18
                16084   9 10538.37
                16085  10 10553.85
                16086  11 10600.51
                16090  12 10528.66
                16091  13 10623.62
                16092  14 10623.18
                16093  15 10568.29
                16096  16 10702.51
                16097  17 10609.92
                16098  18 10468.37
                16099  19 10510.49
                16100  20 10488.07
                16103  21 10499.18
                16104  22 10505.18
                16105  23 10470.74
                16106  24 10495.55
                16107  25 10593.03
                16110  26 10579.03
                16111  27 10613.85
                16112  28  10737.7
                16113  29 10694.07
                16114  30 10627.85
                16118  31 10714.88
                16119  32 10671.99
                16120  33 10664.73
                16121  34 10619.03
                16124  35 10609.62
                16125  36 10566.37
                16126  37 10601.62
                16127  38 10580.14
                16128  39 10583.92
                16131  40 10678.14
                16132  41 10591.48
                16133  42 10593.11
                16134  43    10588
                16135  44 10595.55
                16138  45 10529.48
                16139  46 10456.96
                16140  47 10296.89
                16141  48 10128.38
                16142  49 10240.08
                16145  50 10102.89
                end
                format %d date

                So referencing the code of Robert, I would adapt his code in the following way:
                1) I essentially alter the reference for the interval from date to ID.
                2) I use DowJones_INDEX_VALUE instead of "iv" as the rank-reference
                3) I dropped the "by date" in the rangestat-command, as I do not have panel data.

                Code:
                * define Mata function to return number of obs and rank of last obs
                mata: mata clear
                
                real rowvector myrank(real matrix X)
                {
                real matrix R
                R = mm_ranks(X,1,1)
                return(rows(X), R[rows(X)])
                }
                end
                
                rangestat (myrank) DowJones_INDEX_VALUE, interval(ID -200 0) casewise
                
                * rename results using more descriptive names and compare results
                rename myrank1 number_obs
                rename myrank2 rank
                
                * set rank to missing if DowJones_INDEX_VALUE is missing
                replace rank = . if missing(DowJones_INDEX_VALUE)
                However, I always get the following error message:

                myrank(): 3499 mm_ranks() not found
                do_flex_stats(): - function returned error
                <istmt>: - function returned error

                I am sure there must be a rather easy solution for this.

                Unfortunately I have not encountered MATA before and my skill level regarding loops is still in the making.
                I would be very grateful if anybody could help me with this.

                Thank you,
                Carlos

                Comment


                • #9
                  As explained in #4 in this thread the function mm_ranks() is from moremata (SSC) which you must install before you can use it. But a problem of using deciles is better tackled directly. The help for rangestat includes a worked example for quantiles.

                  See for a recent example https://www.statalist.org/forums/for...ear-to-a-graph and particularly #7 in that thread for a tweak to trap missing values appropriately.

                  If you are precise about which deciles you want, perhaps even all for cumulative probabilities 0.1(0.1)0.9, then we can suggest code if that is needed.

                  Comment


                  • #10
                    Hi!

                    first of all thank you very much for all the very helpful content over here! very much appreciated.

                    I am currently working on a very similar topic.
                    Running the above mentioned example works totally fine but I need something slightly different.

                    I need the relative ranking across ALL firms for the previous year. The number of observations when running in the code are perfectly fine. However, I want different rankings for different firms.

                    My current code:

                    mata:
                    mata clear

                    real rowvector myrank(real matrix X)
                    {
                    real matrix R
                    R = mm_ranks(X,1,1)
                    return(rows(X), R[rows(X)])
                    }
                    end
                    rangestat (myrank) cheq_deviation, interval (month_id -12 0) casewise


                    Within the current (latest) month of the event window all ranks are the same across all firms, eventhough the value "cheq_deviation" is different across firms.

                    Thank you for your help.

                    Best,
                    Michael

                    Comment

                    Working...
                    X