Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combining groups of percentiles in collapse

    Hi guys,

    I'm trying to collapse a large dataset by year, gender, and economic sector but instead of the usual percentile groups I'm trying to use quintiles, or deciles.

    I know that xtile is a popular command but it doesn't work with by or with bysort , and I'm splitting things by three dimensions. I outline an example using the unions dataset below.

    Code:
    webuse union , clear tab1 year union black age

    I know how to make this work for a given percentile, so for the tenth percentile I could write this

    Code:
    collapse /// (p10) age (count) N=age, /// by(year union black)
    The only problem is I want to scale up from percentiles to deciles or quintiles. Something which would look kind of like this.

    Code:
    collapse /// (p1-p10) age (count) N=age, /// by(year union black)
    Does anyone have any ideas? Any help would be appreciated.

  • #2
    I don't understand what you are trying to do. But if you install -egenmore- from SSC, it contains an -egen- function -xtile()- that you can use with -by:-.

    Comment


    • #3
      This example will hopefully start you in a useful direction.
      Code:
      . webuse union
      (NLS Women 14-24 in 1968)
      
      . 
      . local args 
      
      . foreach p in min p10 p20 p30 p40 p50 p60 p70 p80 p90 max {
        2.         local args `args' (`p') age_`p'=age
        3. }
      
      . 
      . collapse (count) N=age `args' , by(year union black)
      
      . list year union black N age_min age_p50 age_max if year==70
      
           +----------------------------------------------------------+
           | year   union   black     N   age_min   age_p50   age_max |
           |----------------------------------------------------------|
        1. |   70       0       0   993        16        22        28 |
        2. |   70       0       1   317        16        22        28 |
        3. |   70       1       0   235        17        22        27 |
        4. |   70       1       1   113        17        22        27 |
           +----------------------------------------------------------+

      Comment


      • #4
        Accidental duplicate of post #3

        Comment


        • #5
          Thanks Clyde, I will check it out.

          Thanks William, looking at the list output am I right in thinking that age_p50 is listing the 50th percentile age for different permutations of year, union, and black? The code is really clever and I will definitely use it in the future, but I'm looking for a way to collapse the data by middle decile, rather than percentile.

          Comment


          • #6
            Like Clyde, I really don't understand what you seek.The collapse command you show in post #1 as "for the tenth percentile I could write this"
            Code:
            collapse /// (p10) age (count) N=age, /// by(year union black)
            creates in the collapsed dataset one observation for each combination of year, union, and black, each containing the 10th percentile (upper cutoff of first decile) of the variable age, and the count of the number of observations in that combination of year, union, and black. My code generalizes this to produce all nine cutoffs, and the minimum and the maximum.

            Comment


            • #7
              Another solution would be something in the lines of:

              Code:
              . webuse union , clear
              (NLS Women 14-24 in 1968)
              
              . for num 10(10)90: egen pctileX = pctile(age), p(X) by(year union black)
              
              ->  egen pctile10 = pctile(age), p(10) by(year union black)
              
              ->  egen pctile20 = pctile(age), p(20) by(year union black)
              
              ->  egen pctile30 = pctile(age), p(30) by(year union black)
              
              ->  egen pctile40 = pctile(age), p(40) by(year union black)
              
              ->  egen pctile50 = pctile(age), p(50) by(year union black)
              
              ->  egen pctile60 = pctile(age), p(60) by(year union black)
              
              ->  egen pctile70 = pctile(age), p(70) by(year union black)
              
              ->  egen pctile80 = pctile(age), p(80) by(year union black)
              
              ->  egen pctile90 = pctile(age), p(90) by(year union black)
              and then if you want to collapse the data:

              Code:
              . collapse (mean) pctile10-pctile90, by(year union black)

              Comment


              • #8
                What's biting here -- or to change the metaphor, clouding the discussion -- is the ambiguity of *ile terminology. Historically, which means 19th and 20th century, the *ile terms all referred to levels of variables defined by the fraction or probability or percent less than the corresponding values. The most complete list of such terms I know is at

                https://stats.stackexchange.com/ques.../235334#235334

                although further contributions would be received joyfully.

                Then at some point the terminology was variously extended, transferred or muddied, depending on how clear and careful authors were, so that the *iles referred to the intervals, classes or bins delimited by such levels. For example, the three quartiles are

                lower or first quartile, median, upper or third quartile

                which define four intervals, according to whether values are less than, equal to or more than each level. (Whether the inequalities run <= or < we can leave on one side.)

                Some people have tried to use complementary terminology, such as the first quarter being those values less than or equal than the first quartile, but such carefulness may be doomed so long as others more often publish using looser terminology. Alternatively, a more optimistic view is that any ambiguity rarely persists for long.

                In this case, #1 refers to calculation of the 10th percentile. If the syntax for collapse of (p1-p10) were allowed it could only refer to simultaneous calculation of percentiles and as such has nothing to do with binning unless those percentiles are used in further calculations.

                Otherwise put, using
                xtile or the related xtile from egenmore (SSC) as discussed in #1 and #2 is not the same as calculating percentiles as discussed in #1, #3 and =7.

                Comment


                • #9
                  Cross-posted at https://www.reddit.com/r/stata/comme...s_in_collapse/

                  Comment

                  Working...
                  X