Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating percentiles in STATA

    Hi, I am using the following loop to get the 0.5%, 1%, 2%, 5%, 95%, 98%, 99% and 99.5% percentiles for the variable "Return". However, STATA does not recognise the 0.5 and also the 99.5, I believe this is because these are not integers. Could anyone please help me, I would much appreciate this since I have been struggling for a while to get around this problem. I know I could get the 0.5 percentile and then use gen command to create my dummy variables, however I am required to use a more efficient way.

    _pctile Return, percentiles( 0.5 1 2 5 95 98 99 99.5)
    return list
    local i = 1
    foreach n of numlist 0.5 1 2 5 95 98 99 99.5 {
    gen byte above`n' = Return >= `r(r`i')'
    local ++i
    }


    Kind Regards,
    Adrian

  • #2
    The most obvious problem here is that

    Code:
    above0.5
    is an illegal variable name. Try

    Code:
    local i = 1
    foreach n of numlist 0.5 1 2 5 95 98 99 99.5 {
        gen byte above`i' = Return >= r(r`i')
        local ++i
    }

    Comment


    • #3
      Thank you Nick this worked great, regarding the percentiles, I have to use 0.5%, 1%, 2% and 5% as my lower dummies, hence the return for them will have to be '<=' rather than '>=', is there any way I can create a loop which recognises that these values are lower dummies and the 95, 98, 99 and 99.5 upper dummies, or should I just create 2 separate loops with different signs.

      Thank you for your quick response and help.

      Comment


      • #4
        This appears to be what you are asking for. Note that the indicators (you say dummies) are not disjoint.

        Code:
        clear 
        set seed 2803 
        set obs 10000 
        gen y = rnormal()
        sort y 
        
        _pctile y, percentiles(0.5 1 2 5 95 98 99 99.5) 
        
        tokenize 0.5 1 2 5 95 98 99 99.5 
        local op <= 
        
        forval i = 1/8 { 
            if `i' == 5 local op >= 
            gen wanted`i' = y `op' r(r`i')
            label var wanted`i' "`op' ``i'' pctile"
        }
        
        d 
        su wanted* 
        
        
        . d 
        
        Contains data
          obs:        10,000                          
         vars:             9                          
        -----------------------------------------------------------------------------------------------------------------------------------
                      storage   display    value
        variable name   type    format     label      variable label
        -----------------------------------------------------------------------------------------------------------------------------------
        y               float   %9.0g                 
        wanted1         float   %9.0g                 <= 0.5 pctile
        wanted2         float   %9.0g                 <= 1 pctile
        wanted3         float   %9.0g                 <= 2 pctile
        wanted4         float   %9.0g                 <= 5 pctile
        wanted5         float   %9.0g                 >= 95 pctile
        wanted6         float   %9.0g                 >= 98 pctile
        wanted7         float   %9.0g                 >= 99 pctile
        wanted8         float   %9.0g                 >= 99.5 pctile
        -----------------------------------------------------------------------------------------------------------------------------------
        Sorted by: y
             Note: Dataset has changed since last saved.
        
        . su wanted* 
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
             wanted1 |     10,000        .005    .0705372          0          1
             wanted2 |     10,000         .01    .0995037          0          1
             wanted3 |     10,000         .02     .140007          0          1
             wanted4 |     10,000         .05    .2179558          0          1
             wanted5 |     10,000         .05    .2179558          0          1
        -------------+---------------------------------------------------------
             wanted6 |     10,000         .02     .140007          0          1
             wanted7 |     10,000         .01    .0995037          0          1
             wanted8 |     10,000        .005    .0705372          0          1
        
        .

        Comment


        • #5
          Here is another way that (I think) achieves what (I think) Original Poster wants (lots of thinking went into this sentence):

          Code:
          . sysuse auto, clear
          (1978 Automobile Data)
          
          . _pctile price, percentiles( 0.5 1 2 5 95 98 99 99.5)
          
          . local i = 1
          
          . foreach n of numlist 0.5 1 2 5 95 98 99 99.5 {
            2. local name = strtoname("cutat`n'")
            3. gen `name' = cond(`n'<50,price<=r(r`i'), price>=r(r`i'))
            4. local ++i
            5. }
          
          . des price cutat*
          
                        storage   display    value
          variable name   type    format     label      variable label
          -----------------------------------------------------------------------------------
          price           int     %8.0gc                Price
          cutat_5         float   %9.0g                 
          cutat1          float   %9.0g                 
          cutat2          float   %9.0g                 
          cutat5          float   %9.0g                 
          cutat95         float   %9.0g                 
          cutat98         float   %9.0g                 
          cutat99         float   %9.0g                 
          cutat99_5       float   %9.0g                 
          
          . summ price cutat*, sep(0)
          
              Variable |        Obs        Mean    Std. Dev.       Min        Max
          -------------+---------------------------------------------------------
                 price |         74    6165.257    2949.496       3291      15906
               cutat_5 |         74    .0135135    .1162476          0          1
                cutat1 |         74    .0135135    .1162476          0          1
                cutat2 |         74     .027027    .1632691          0          1
                cutat5 |         74    .0540541    .2276679          0          1
               cutat95 |         74    .0540541    .2276679          0          1
               cutat98 |         74     .027027    .1632691          0          1
               cutat99 |         74    .0135135    .1162476          0          1
             cutat99_5 |         74    .0135135    .1162476          0          1
          
          .

          Comment


          • #6
            I'd suggest deleting the cross-post on Stack Overflow. It's not helping anyone.

            Comment


            • #7
              Dear Nick I have now deleted the post on Stack Overflow.
              Last edited by Adrian Cernescu; 31 Oct 2020, 12:40.

              Comment


              • #8
                Originally posted by Joro Kolev View Post
                Here is another way that (I think) achieves what (I think) Original Poster wants (lots of thinking went into this sentence):

                Code:
                . sysuse auto, clear
                (1978 Automobile Data)
                
                . _pctile price, percentiles( 0.5 1 2 5 95 98 99 99.5)
                
                . local i = 1
                
                . foreach n of numlist 0.5 1 2 5 95 98 99 99.5 {
                2. local name = strtoname("cutat`n'")
                3. gen `name' = cond(`n'<50,price<=r(r`i'), price>=r(r`i'))
                4. local ++i
                5. }
                
                . des price cutat*
                
                storage display value
                variable name type format label variable label
                -----------------------------------------------------------------------------------
                price int %8.0gc Price
                cutat_5 float %9.0g
                cutat1 float %9.0g
                cutat2 float %9.0g
                cutat5 float %9.0g
                cutat95 float %9.0g
                cutat98 float %9.0g
                cutat99 float %9.0g
                cutat99_5 float %9.0g
                
                . summ price cutat*, sep(0)
                
                Variable | Obs Mean Std. Dev. Min Max
                -------------+---------------------------------------------------------
                price | 74 6165.257 2949.496 3291 15906
                cutat_5 | 74 .0135135 .1162476 0 1
                cutat1 | 74 .0135135 .1162476 0 1
                cutat2 | 74 .027027 .1632691 0 1
                cutat5 | 74 .0540541 .2276679 0 1
                cutat95 | 74 .0540541 .2276679 0 1
                cutat98 | 74 .027027 .1632691 0 1
                cutat99 | 74 .0135135 .1162476 0 1
                cutat99_5 | 74 .0135135 .1162476 0 1
                
                .
                Joro Kolev, would you mind explaining me how the code works. I am not very familiar with Stata and I do not understand what some of the commands do like: "cond" and "strtoname". Also, usually, we have that i=0, but in here we have to use i = 1. Why is this the case? I know it may be basic but would you mind walking me through how the code works?

                Many thanks in advance.
                Last edited by Adrian Cernescu; 31 Oct 2020, 13:02.

                Comment


                • #9
                  Adrian, if you understand Nick's code in #4, just go with it. My code is in no way better, it is just doing the same in a different way. I thought it would be fun because my code kind of looks more like your initial code.

                  Otherwise my code does exactly what your initial code was trying to do, except that I use the strtoname() function to produce a valid Stata name for the variable, and then the cond() function works as follows:

                  cond(A, B, C)

                  if the condition specified in A is true, the function executes the thing specified in B, otherwise (that is if condition A is wrong) it proceeds to execute C.

                  You can see more detail in the help, but basically it is your code except for fixing the name issue, and introducing this thing that you seemed to want, that if the percentiles are low, you need dummies for when it is lower than the percentile, and when the percentiles are high you want dummies for when higher than the percentiles.

                  Originally posted by Adrian Cernescu View Post

                  Joro Kolev, would you mind explaining me how the code works. I am not very familiar with Stata and I do not understand what some of the commands do like: "cond" and "strtoname". I know it may be basic but would you mind walking me through how the code works?

                  Many thanks in advance.

                  Comment

                  Working...
                  X