Conditions by frequency deciles

Nitsan Machlis

Join Date: Jun 2022
Posts: 19

Conditions by frequency deciles

26 Aug 2022, 02:42

Hi, I would appreciate help with a small issue. I am working with a dataset of student grades (yscore variable). Each exam observation includes the students grades and details about each student. I want to create a local condition that refers to the grade decile (i.e. if gradedecile==1/if gradedecile==2 etc. ). This is the code I am currently using to generate a decile indicator variable:

Code:

. xtile decyscore = yscore, nq(10)

. tab decyscore

         10 |
  quantiles |
  of yscore |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |    863,921       14.27       14.27
          2 |    465,787        7.69       21.96
          3 |    630,105       10.41       32.37
          4 |    629,772       10.40       42.77
          5 |    769,836       12.71       55.48
          6 |    325,334        5.37       60.86
          7 |    555,690        9.18       70.04
          8 |    713,439       11.78       81.82
          9 |    496,215        8.20       90.02
         10 |    604,557        9.98      100.00
------------+-----------------------------------
      Total |  6,054,656      100.00

Why is the frequency and percentage of each group different? I want yscore to be divided by frequency/percent deciles, and not by grade deciles, therefore the frequency of each group should be identical. I have tried pctile too with no success.

Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35685

26 Aug 2022, 03:09

I don't follow at all your wish to see frequency/percent deciles, and not grade deciles, which sounds confused to me.

But I think I understand your problem, which arises from tied values. Students with the same grade must be assigned to the same bin.

This bites almost everywhere with quantile binning unless the number of distinct outcomes greatly exceeds the number of bins.

Here is a trivial example from the auto dataset. mpg is conventionally reported as integer, but a range from 12 to 41 mpg in the data gives 30 distinct plausible values, but in practice they don't all occur, and there is clumping. So xtile can't get very close to the ideal of 10% in each bin. The frequencies range from 3 to 10, and only 5 match any optimistic expectation that they should be 7 or 8.

Code:

. sysuse auto, clear
(1978 automobile data)

. xtile bin = mpg, nq(10)

. tab bin

         10 |
  quantiles |
     of mpg |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          8       10.81       10.81
          2 |         10       13.51       24.32
          3 |          9       12.16       36.49
          4 |          8       10.81       47.30
          5 |          3        4.05       51.35
          6 |         10       13.51       64.86
          7 |          7        9.46       74.32
          8 |          5        6.76       81.08
          9 |          7        9.46       90.54
         10 |          7        9.46      100.00
------------+-----------------------------------
      Total |         74      100.00

. table mpg bin

-----------------------------------------------------------------
              |                 10 quantiles of mpg              
              |  1    2   3   4   5    6   7   8   9   10   Total
--------------+--------------------------------------------------
Mileage (mpg) |                                                  
  12          |  2                                              2
  14          |  6                                              6
  15          |       2                                         2
  16          |       4                                         4
  17          |       4                                         4
  18          |           9                                     9
  19          |               8                                 8
  20          |                   3                             3
  21          |                        5                        5
  22          |                        5                        5
  23          |                            3                    3
  24          |                            4                    4
  25          |                                5                5
  26          |                                    3            3
  28          |                                    3            3
  29          |                                    1            1
  30          |                                         2       2
  31          |                                         1       1
  34          |                                         1       1
  35          |                                         2       2
  41          |                                         1       1
  Total       |  8   10   9   8   3   10   7   5   7    7      74
-----------------------------------------------------------------

The problem is flagged in various places, e.g.

Section 4 in https://www.stata-journal.com/articl...article=pr0054

Section 6 in https://www.stata-journal.com/articl...article=dm0095

but there isn't a good remedy for very unequal bins except (often) to use the original data, which do carry more information.

Comment

Nitsan Machlis

Join Date: Jun 2022

Posts: 19
#3

29 Aug 2022, 23:30

Thanks, I now understand the problem. Will settle for slightly unequal bins for the time being.
Comment

Announcement

Conditions by frequency deciles

Comment

Comment