Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Conditions by frequency deciles

    Hi, I would appreciate help with a small issue. I am working with a dataset of student grades (yscore variable). Each exam observation includes the students grades and details about each student. I want to create a local condition that refers to the grade decile (i.e. if gradedecile==1/if gradedecile==2 etc. ). This is the code I am currently using to generate a decile indicator variable:


    Code:
    . xtile decyscore = yscore, nq(10)
    
    . tab decyscore
    
             10 |
      quantiles |
      of yscore |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              1 |    863,921       14.27       14.27
              2 |    465,787        7.69       21.96
              3 |    630,105       10.41       32.37
              4 |    629,772       10.40       42.77
              5 |    769,836       12.71       55.48
              6 |    325,334        5.37       60.86
              7 |    555,690        9.18       70.04
              8 |    713,439       11.78       81.82
              9 |    496,215        8.20       90.02
             10 |    604,557        9.98      100.00
    ------------+-----------------------------------
          Total |  6,054,656      100.00
    Why is the frequency and percentage of each group different? I want yscore to be divided by frequency/percent deciles, and not by grade deciles, therefore the frequency of each group should be identical. I have tried pctile too with no success.

  • #2
    I don't follow at all your wish to see frequency/percent deciles, and not grade deciles, which sounds confused to me.

    But I think I understand your problem, which arises from tied values. Students with the same grade must be assigned to the same bin.

    This bites almost everywhere with quantile binning unless the number of distinct outcomes greatly exceeds the number of bins.

    Here is a trivial example from the auto dataset. mpg is conventionally reported as integer, but a range from 12 to 41 mpg in the data gives 30 distinct plausible values, but in practice they don't all occur, and there is clumping. So xtile can't get very close to the ideal of 10% in each bin. The frequencies range from 3 to 10, and only 5 match any optimistic expectation that they should be 7 or 8.

    Code:
    . sysuse auto, clear
    (1978 automobile data)
    
    . xtile bin = mpg, nq(10)
    
    . tab bin
    
             10 |
      quantiles |
         of mpg |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              1 |          8       10.81       10.81
              2 |         10       13.51       24.32
              3 |          9       12.16       36.49
              4 |          8       10.81       47.30
              5 |          3        4.05       51.35
              6 |         10       13.51       64.86
              7 |          7        9.46       74.32
              8 |          5        6.76       81.08
              9 |          7        9.46       90.54
             10 |          7        9.46      100.00
    ------------+-----------------------------------
          Total |         74      100.00
    
    . table mpg bin
    
    -----------------------------------------------------------------
                  |                 10 quantiles of mpg              
                  |  1    2   3   4   5    6   7   8   9   10   Total
    --------------+--------------------------------------------------
    Mileage (mpg) |                                                  
      12          |  2                                              2
      14          |  6                                              6
      15          |       2                                         2
      16          |       4                                         4
      17          |       4                                         4
      18          |           9                                     9
      19          |               8                                 8
      20          |                   3                             3
      21          |                        5                        5
      22          |                        5                        5
      23          |                            3                    3
      24          |                            4                    4
      25          |                                5                5
      26          |                                    3            3
      28          |                                    3            3
      29          |                                    1            1
      30          |                                         2       2
      31          |                                         1       1
      34          |                                         1       1
      35          |                                         2       2
      41          |                                         1       1
      Total       |  8   10   9   8   3   10   7   5   7    7      74
    -----------------------------------------------------------------
    The problem is flagged in various places, e.g.

    Section 4 in https://www.stata-journal.com/articl...article=pr0054

    Section 6 in https://www.stata-journal.com/articl...article=dm0095

    but there isn't a good remedy for very unequal bins except (often) to use the original data, which do carry more information.


    Comment


    • #3
      Thanks, I now understand the problem. Will settle for slightly unequal bins for the time being.

      Comment

      Working...
      X