Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating Percentiles

    Hi,

    I have 3,691 observations and I want to create income percentiles for them. However, whenever I use: pctile percentilechildwage = WSEARNHOURLY, nquantiles(10), it creates percentiles for the first 9 observations. Should I change it nquantiles to 3,692?

  • #2
    Not so. They are calculated for all available observations; the results are placed in the first 9.

    Comment


    • #3
      To be honest, I have no idea what -pctile- is doing either.

      Code:
      . sysuse auto, clear
      (1978 Automobile Data)
      
      . pctile percs = price, nquantiles(10)
      
      . xtile perccategories = price, nquantiles(10)
      
      . sort price
      
      . list price percs perccategories, sep(0)
      
           +----------------------------+
           |  price    percs   percca~s |
           |----------------------------|
        1. |  3,291        .          1 |
        2. |  3,299        .          1 |
        3. |  3,667        .          1 |
        4. |  3,748        .          1 |
        5. |  3,798        .          1 |
        6. |  3,799     4425          1 |
        7. |  3,829        .          1 |
        8. |  3,895        .          1 |
        9. |  3,955        .          2 |
       10. |  3,984        .          2 |
       11. |  3,995        .          2 |
       12. |  4,010        .          2 |
       13. |  4,060        .          2 |
       14. |  4,082        .          2 |
       15. |  4,099     3895          2 |
       16. |  4,172        .          3 |
       17. |  4,181        .          3 |
       18. |  4,187        .          3 |
       19. |  4,195        .          3 |
       20. |  4,296        .          3 |
       21. |  4,389        .          3 |
       22. |  4,424        .          3 |
       23. |  4,425        .          3 |
       24. |  4,453     6165          4 |
       25. |  4,482        .          4 |
       26. |  4,499        .          4 |
       27. |  4,504        .          4 |
       28. |  4,516        .          4 |
       29. |  4,589        .          4 |
       30. |  4,647        .          4 |
       31. |  4,697        .          5 |
       32. |  4,723        .          5 |
       33. |  4,733        .          5 |
       34. |  4,749     4099          5 |
       35. |  4,816     4647          5 |
       36. |  4,890        .          5 |
       37. |  4,934        .          5 |
       38. |  5,079        .          6 |
       39. |  5,104        .          6 |
       40. |  5,172        .          6 |
       41. |  5,189     7827          6 |
       42. |  5,222        .          6 |
       43. |  5,379        .          6 |
       44. |  5,397        .          6 |
       45. |  5,705        .          6 |
       46. |  5,719        .          7 |
       47. |  5,788     5705          7 |
       48. |  5,798        .          7 |
       49. |  5,799        .          7 |
       50. |  5,886        .          7 |
       51. |  5,899        .          7 |
       52. |  6,165        .          7 |
       53. |  6,229        .          8 |
       54. |  6,295        .          8 |
       55. |  6,303        .          8 |
       56. |  6,342        .          8 |
       57. |  6,486        .          8 |
       58. |  6,850        .          8 |
       59. |  7,140        .          8 |
       60. |  7,827   5006.5          8 |
       61. |  8,129        .          9 |
       62. |  8,814        .          9 |
       63. |  9,690        .          9 |
       64. |  9,735        .          9 |
       65. | 10,371        .          9 |
       66. | 10,372    11385          9 |
       67. | 11,385        .          9 |
       68. | 11,497        .         10 |
       69. | 11,995        .         10 |
       70. | 12,990        .         10 |
       71. | 13,466        .         10 |
       72. | 13,594        .         10 |
       73. | 14,500        .         10 |
       74. | 15,906        .         10 |
           +----------------------------+
      
      . _pctile price, nquantiles(10)
      
      . return list
      
      scalars:
                       r(r1) =  3895
                       r(r2) =  4099
                       r(r3) =  4425
                       r(r4) =  4647
                       r(r5) =  5006.5
                       r(r6) =  5705
                       r(r7) =  6165
                       r(r8) =  7827
                       r(r9) =  11385

      Comment


      • #4
        As Joro suggest, the xtile command is more likely to be what is wanted in post #1. The pctile command places the 9 cutpoints between the 10 categories in whatever observations happen to be the first 9. Below I arrange for the first 9 observations to be the final observation in each of the first 9 categories.
        Code:
        . sysuse auto, clear
        (1978 Automobile Data)
        
        . xtile perccategories = price, nquantiles(10)
        
        . generate gnxl = 1
        
        . bysort perccategories (price): replace gnxl = 0 if _n==_N
        (10 real changes made)
        
        . sort gnxl price    
        
        . pctile percs = price, nquantiles(10)
        
        . sort price
        
        . list price perccategories percs, sepby(perccategories) abbreviate(16)
        
             +----------------------------------+
             |  price   perccategories    percs |
             |----------------------------------|
          1. |  3,291                1        . |
          2. |  3,299                1        . |
          3. |  3,667                1        . |
          4. |  3,748                1        . |
          5. |  3,798                1        . |
          6. |  3,799                1        . |
          7. |  3,829                1        . |
          8. |  3,895                1     3895 |
             |----------------------------------|
          9. |  3,955                2        . |
         10. |  3,984                2        . |
         11. |  3,995                2        . |
         12. |  4,010                2        . |
         13. |  4,060                2        . |
         14. |  4,082                2        . |
         15. |  4,099                2     4099 |
             |----------------------------------|
         16. |  4,172                3        . |
         17. |  4,181                3        . |
         18. |  4,187                3        . |
         19. |  4,195                3        . |
         20. |  4,296                3        . |
         21. |  4,389                3        . |
         22. |  4,424                3        . |
         23. |  4,425                3     4425 |
             |----------------------------------|
         24. |  4,453                4        . |
         25. |  4,482                4        . |
         26. |  4,499                4        . |
         27. |  4,504                4        . |
         28. |  4,516                4        . |
         29. |  4,589                4        . |
         30. |  4,647                4     4647 |
             |----------------------------------|
         31. |  4,697                5        . |
         32. |  4,723                5        . |
         33. |  4,733                5        . |
         34. |  4,749                5        . |
         35. |  4,816                5        . |
         36. |  4,890                5        . |
         37. |  4,934                5   5006.5 |
             |----------------------------------|
         38. |  5,079                6        . |
         39. |  5,104                6        . |
         40. |  5,172                6        . |
         41. |  5,189                6        . |
         42. |  5,222                6        . |
         43. |  5,379                6        . |
         44. |  5,397                6        . |
         45. |  5,705                6     5705 |
             |----------------------------------|
         46. |  5,719                7        . |
         47. |  5,788                7        . |
         48. |  5,798                7        . |
         49. |  5,799                7        . |
         50. |  5,886                7        . |
         51. |  5,899                7        . |
         52. |  6,165                7     6165 |
             |----------------------------------|
         53. |  6,229                8        . |
         54. |  6,295                8        . |
         55. |  6,303                8        . |
         56. |  6,342                8        . |
         57. |  6,486                8        . |
         58. |  6,850                8        . |
         59. |  7,140                8        . |
         60. |  7,827                8     7827 |
             |----------------------------------|
         61. |  8,129                9        . |
         62. |  8,814                9        . |
         63. |  9,690                9        . |
         64. |  9,735                9        . |
         65. | 10,371                9        . |
         66. | 10,372                9        . |
         67. | 11,385                9    11385 |
             |----------------------------------|
         68. | 11,497               10        . |
         69. | 11,995               10        . |
         70. | 12,990               10        . |
         71. | 13,466               10        . |
         72. | 13,594               10        . |
         73. | 14,500               10        . |
         74. | 15,906               10        . |
             +----------------------------------+

        Comment


        • #5
          I use -xtile- and -_pctile- a lot, and I have never used -pctile- before. To me the behaviour of -pctile- does not make any sense (Why would one place the cutoff points in whichever observations happen to come first in the current sort order? Why would one leave the rest of the observations missing?). I (almost) expected -pctile- to do what William showed above, except for that I did not expect it to generate missings.

          I thought this is what -pctile- is doing:

          Code:
          . xtile perccategories = price, nquantiles(10)
          
          . bysort perccategories (price): gen thoughtperc = price[_N]
          
          . list price perccategories thoughtperc, sepby(perccategories) abbreviate(16)
          
               +---------------------------------------+
               |  price   perccategories   thoughtperc |
               |---------------------------------------|
            1. |  3,291                1          3895 |
            2. |  3,299                1          3895 |
            3. |  3,667                1          3895 |
            4. |  3,748                1          3895 |
            5. |  3,798                1          3895 |
            6. |  3,799                1          3895 |
            7. |  3,829                1          3895 |
            8. |  3,895                1          3895 |
               |---------------------------------------|
            9. |  3,955                2          4099 |
           10. |  3,984                2          4099 |
           11. |  3,995                2          4099 |
           12. |  4,010                2          4099 |
           13. |  4,060                2          4099 |
           14. |  4,082                2          4099 |
           15. |  4,099                2          4099 |
               |---------------------------------------|
           16. |  4,172                3          4425 |
           17. |  4,181                3          4425 |
           18. |  4,187                3          4425 |
           19. |  4,195                3          4425 |
           20. |  4,296                3          4425 |
           21. |  4,389                3          4425 |
           22. |  4,424                3          4425 |
           23. |  4,425                3          4425 |
               |---------------------------------------|
           24. |  4,453                4          4647 |
           25. |  4,482                4          4647 |
           26. |  4,499                4          4647 |
           27. |  4,504                4          4647 |
           28. |  4,516                4          4647 |
           29. |  4,589                4          4647 |
           30. |  4,647                4          4647 |
               |---------------------------------------|
           31. |  4,697                5          4934 |
           32. |  4,723                5          4934 |
           33. |  4,733                5          4934 |
           34. |  4,749                5          4934 |
           35. |  4,816                5          4934 |
           36. |  4,890                5          4934 |
           37. |  4,934                5          4934 |
               |---------------------------------------|
           38. |  5,079                6          5705 |
           39. |  5,104                6          5705 |
           40. |  5,172                6          5705 |
           41. |  5,189                6          5705 |
           42. |  5,222                6          5705 |
           43. |  5,379                6          5705 |
           44. |  5,397                6          5705 |
           45. |  5,705                6          5705 |
               |---------------------------------------|
           46. |  5,719                7          6165 |
           47. |  5,788                7          6165 |
           48. |  5,798                7          6165 |
           49. |  5,799                7          6165 |
           50. |  5,886                7          6165 |
           51. |  5,899                7          6165 |
           52. |  6,165                7          6165 |
               |---------------------------------------|
           53. |  6,229                8          7827 |
           54. |  6,295                8          7827 |
           55. |  6,303                8          7827 |
           56. |  6,342                8          7827 |
           57. |  6,486                8          7827 |
           58. |  6,850                8          7827 |
           59. |  7,140                8          7827 |
           60. |  7,827                8          7827 |
               |---------------------------------------|
           61. |  8,129                9         11385 |
           62. |  8,814                9         11385 |
           63. |  9,690                9         11385 |
           64. |  9,735                9         11385 |
           65. | 10,371                9         11385 |
           66. | 10,372                9         11385 |
           67. | 11,385                9         11385 |
               |---------------------------------------|
           68. | 11,497               10         15906 |
           69. | 11,995               10         15906 |
           70. | 12,990               10         15906 |
           71. | 13,466               10         15906 |
           72. | 13,594               10         15906 |
           73. | 14,500               10         15906 |
           74. | 15,906               10         15906 |
               +---------------------------------------+

          Comment


          • #6
            pctile is easy explained, despite a negative comment that its results may not be what you want, to which the riposte is that you are not obliged to use it. xtile which bins and _pctile which leaves behind r-class results each has a more obvious rationale.

            pctile -- given a variable and a number k of quantile bins -- say 10 for deciles -- yields k - 1 quantiles and puts them in a variable. If putting them in the first k - 1 observations seems arbitrary then sure, but where else would they go? In one sense each quantile is associated with all the observations it summarizes so as logical might be k - 1 new variables, each containing one of the quantiles, but that would be wasteful of memory and in fact not nearly so useful.

            As for what use is the variable, let's turn that round: if you can't think of a use for the variable, then you have no call for the command. As for why anyone would want it, some uses include listing or tabulation on the one hand and graphing on the other. Putting the quantiles consistently in the same place can be useful -- indeed is virtually essential -- for comparison of different groups or variables whenever that is wanted.


            pctile defies a very general but not quite universal principle of Stata, which can be called alignment within observations, namely that what goes in an observation belongs together. It is not quite universal, as you can defy it if you want. As every spreadsheet user learns quickly, you are in charge and can put anything you like anywhere you like if you think that would be useful or it amuses you. Stuff in cell G3 need have no relation to A3 or B3 if you have call to do something different.

            pcile clearly also predates an idea long wanted but only implemented in Stata 16, that you can have multiple frames. It also does not do something that might be wanted, put the results in a vector (matrix with one row or column) but that is one step away.

            Comment


            • #7
              It is certainly educational for me to know that there are applications of the current behaviour of -pctile-. I do not know of these applications, and hence I think that the percentiles that are generated should go like I showed in #5 above.

              I agree that a better design would be if -pctile- or -_pctile- leaves behind a matrix containing the percentiles calculated. Then whoever has applications for the current behaviour of -pctile- can achieve it with one step more converting the matrix into a variable. Whoever does not, still is better off because matrices of numbers are a bit easier to manipulate without loops.

              Finally unrelated to this thread, but as we have started to discuss these commands: Major deficiency that I perceive in these commands -xtile- and -pctile- is that they are not byable.

              Overall Stata Corp might want to rethink the behaviour of -pctile- and the issue of non-byability.

              Comment


              • #8
                It is, indeed, inconvenient that -xtile- is not byable, but there is an -egen- -xtile()- function that is byable. It's not an official Stata -egen- function, but it is available from SSC and, if memory serves, it was written by Nick Cox.

                Another way around this limitation of -xtile-, one which I use fairly often, is to just wrap your -xtile- command in a program and use -runby-. The latter is also available from SSC and is written by Robert Picard and me.

                Comment


                • #9
                  xtile() in egenmore from SSC was written by Ulrich Kohler. There are at least two community-contributed commands on SSC that do the same thing, by Matthieu Gomez and Attaullah Shah. (Michael Stepner's fastxtile I think doesn't support a by() option or by: prefix.)

                  I can't see any reason why pctile should not acquire matrix output as an optional extra. I am not in support of rewriting it to do something different if that is what is being suggested.

                  byable sounds good, but then a command needs to face the awkward practicality that users may ask for so many quantiles when they have in fact have fewer observations in which to put them for a particular group. In principle that could happen already but it's much more likely to happen with byable output.

                  For completeness let us add that
                  egen has a pctile() function, which is very useful.

                  Quantile binning is over-used any way, but that is a different story!

                  Comment


                  • #10
                    My view is that basic calculations should be incorporated in the Stata core, and are best taken care of by the Stata Corp team, because they should know best how to write fast, efficient code.

                    To convince oneself of the issues surrounding the lack of byability of -xtile- one just needs to Google search:
                    site:https://www.statalist.org xtile by
                    and see the abundance of threads on the topic.

                    As recent as 19 Jul 2020, the issue cropped up here
                    https://www.statalist.org/forums/for...191-egen-xtile

                    There are user written solutions--some of them faster, some of them slower, some of them more user friendly, some of them less--and it is easy enough for one to write his own loop. My point is that as basic task as calculating percentiles should be well thought through and well incorporated in the Stata core.

                    Comment


                    • #11
                      In my field of research, I often need to create portfolios that are based on the classification of companies on various characteristics. These portfolios are re-created / re-balanced in each period such as months or years. I felt almost handicapped when came to know that xtile is not byable. It took me almost a year to develop a byable version (astile), most of the time being taken by making it faster. The program is available on the SSC. and its byable version is extremely fast.
                      Code:
                      ssc install astile
                      The primary reason behind this was Stata Corp.'s reluctance to add features / improve the speed of its older programs. I think Stata Corp. is devoting more resources and time to big projects, such as the integration of Python in Stata 16, or maybe enhancing the speed of the SEM, etc.
                      Regards
                      --------------------------------------------------
                      Attaullah Shah, PhD.
                      Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
                      FinTechProfessor.com
                      https://asdocx.com
                      Check out my asdoc program, which sends outputs to MS Word.
                      For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

                      Comment


                      • #12
                        I agree with #10. official support for xtile supporting operations by: (or by()) is overdue. Then community-contributed efforts can be forgotten about!

                        Comment


                        • #13
                          If you are looking to generate a variable containing the percentile score for each observation, you would first want to generate the z-score of each observation, then use the normal function to find the cumulative distribution.

                          net install zscore.pkg
                          zscore WSEARNHOURLY // generates a variable containing the z-scores for each observation called z_WSEARNHOURLY

                          gen p_wsearnhourly = normal(z_WSEARNHOURLY)

                          Comment


                          • #14
                            #13 should never be used unless you can explain what it assumes. It's a completely different approach assuming exact normality of distribution.

                            Comment

                            Working...
                            X