Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • What does 'egen newvar=cut(var), group(#)' mean

    Dear Stata users,

    In the funtion -egen-, we can invoke -egen newvar=cut(var), group(#)- to generate a new categorical variable. And the option group(#) specifies the number of equal frequency grouping intervals to be used in the absence of breaks. But actually this sentence cannot always give us equal frequency groups. And I also want to know the difference between -egen, cut()- and -xitle-. Thank you very much.

    Code:
    . sysuse auto
    (1978 Automobile Data)
    
    . egen price2=cut(price), group(5)
    
    . tabulate price2
    
         price2 |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |         14       18.92       18.92
              1 |         15       20.27       39.19
              2 |         15       20.27       59.46
              3 |         15       20.27       79.73
              4 |         15       20.27      100.00
    ------------+-----------------------------------
          Total |         74      100.00
    
    . xtile price3=price, n(5)
    
    . tabulate price3
    
    5 quantiles |
       of price |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              1 |         15       20.27       20.27
              2 |         15       20.27       40.54
              3 |         15       20.27       60.81
              4 |         15       20.27       81.08
              5 |         14       18.92      100.00
    ------------+-----------------------------------
          Total |         74      100.00
    
    . egen mpg2=cut(mpg), group(5)
    
    . tabulate mpg2
    
           mpg2 |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |         14       18.92       18.92
              1 |         13       17.57       36.49
              2 |         16       21.62       58.11
              3 |         12       16.22       74.32
              4 |         19       25.68      100.00
    ------------+-----------------------------------
          Total |         74      100.00
    
    . xtile mpg3=mpg, n(5)
    
    . tabulate mpg3
    
    5 quantiles |
         of mpg |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              1 |         18       24.32       24.32
              2 |         17       22.97       47.30
              3 |         13       17.57       64.86
              4 |         12       16.22       81.08
              5 |         14       18.92      100.00
    ------------+-----------------------------------
          Total |         74      100.00
    
    .

  • #2
    At a general, and not very useful level, the difference between the two commands is the algorithm that they use to allocated observations to groups when there are conflicts.

    I have worked extensively on -xtile-, and the Stata manual contains an (incomplete in my view) description of the algorithm. However the -xtile- algorithm is well accepted, and in other circles out of the Stata community it is referred as Hyndman and Fan algorithm 2 (or maybe 7, cannot recall, the point is that it is an accepted algorithm of how to calculate quantiles)
    Hyndman, Rob J., and Yanan Fan. "Sample quantiles in statistical packages." The American Statistician 50, no. 4 (1996): 361-365.

    I do not think that the algorithm of -egen, cut- is documented anywhere. Therefore to figure out what it does, you need to look at the code itself.

    In short I would say just use -xtile- because we know what it does.

    Comment


    • #3
      There is also another mysterious function that has been discontinued

      Code:
      generate groupclassification = group(numberofgroups)
      that splits your data into roughly equal number of groups. You can find attached a Stata Tip I submitted to Stata Journal in 2007 elaborating on "Dividing ordered data into equal-sized subsets". The editors had a point in rejecting the tip, because nobody knows what the -group- function does. What I know is that it is as fast as lightning.

      Attached Files

      Comment


      • #4
        Thank you very much Joro Kolev, your reply is very helpful for a guide.
        By the way, there is a community-contributed command named -quantiles- (SSC) that can categorizes varname by its quantiles, and what's more it can yield equal sized categories.
        Last edited by Chen Samulsion; 18 Sep 2021, 02:21.

        Comment


        • #5
          The claim that "nobody knows what the -group- function does" is exaggerated. https://www.stata.com/statalist/arch.../msg00406.html made comments and referenced earlier discussions, which to my understanding still apply. Otherwise I agree strongly with Joro Kolev

          I do not think that the algorithm of -egen, cut- is documented anywhere. Therefore to figure out what it does, you need to look at the code itself.

          On quantile binning, let's back up. Suppose you want quintile bins (5 bins with equal frequencies in each) and

          1. Your sample size is a multiple of 5 (meaning, an exact multiple).

          2. All values of the variable to be binned are non-missing and distinct.

          Then every competent binning recipe will give the same results. This is also true "for any value of 5". These are the only circumstances in which a claim to produce equal-sized categories can be correct.

          In practice: what can cause complications?

          a. Missing values. Usually you will want missing values to be ignored. Exceptions that occur to me: you may have reason to believe that all missing values represent extremely high values that should belong in the highest bin, or contrariwise extremely small values that should all belong in the lowest bin. So, you need code that does that, although I suspect you'll need to write it yourself. Henceforth, let's say "valid" for observations that should be binned, including these exceptions.

          b. The number of valid observations may not be a multiple of the number of bins.

          c. Ties. Usually the rule is that observations with the same value of the variable to be binned belong in the same bin, regardless of the unequal frequencies that may ensue. If you are prepared (think you have good reason) to put some 42s in one bin and other 42s in another bin, you're playing a different game. You should still worry mightily about the reproducibility of what you are doing with respect to other variables in the dataset.

          c'. The problem of ties bites hardest when the number of distinct values is less than the number of bins requested.

          a b c are obvious enough when spelled out, but there's more.

          d. What are the rules when a value is exactly equal to a bin boundary? I played a little with egen, cut() when it was folded into official Stata, but gave up on it because you had to look at the code to see what it did at boundaries, I didn't want the burden of doing that, or to transfer the burden of ambiguity to readers or users of my code. (There were other problems, which may or may not have been fixed or better documented since it was introduced.)

          e. Puzzling through it may seem, the results of quantile binning can vary on whether the algorithm starts with low values or starts with high values.

          Quantile binning is variously popular or unpopular according to tribal habits. In particular, people working with business or finance data often seem happy with ideas such as the best performing 20% of stocks or firms and they should be able to explain that better than I can. Other way round, although the practice can be found in medical statistics literature. it is widely denounced too as leading to loss or distortion of information.



          Comment


          • #6
            Thank you Nick, I am happy that this thread induce your detailed discussion. Considering equal-sized categories is appropriate to very limited circumstances, the command -quantiles- seems beautiful but misleading in some degree.

            Comment

            Working...
            X