Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Convert continuous variable into categorical groups

    Does anyone know the difference between the following two commands?

    Code:
    egen variable_name = cut(X), group(4)
    xtile variable_name = X, nq(4)
    * Continuous variable X (e.g. income, age)

    Do both commands create exactly the same variable (quartile)?

    Best wishes,
    David

    Version: Stata/IC 13.0
    Last edited by David Silverstein; 21 Feb 2015, 04:07.

  • #2
    This will show that the two commands do not necessarily generate exactly the same groups. I can't tell you what to prefer.
    Code:
    sysuse auto, clear
    egen x1 = cut(price), group(4)
    xtile x2 = price, nq(4)
    tab1 x1 x2

    Comment


    • #3
      Dear Svend,

      Thank you very much for you response!

      As you pointed out, each command creates different variables (not the same).

      From my understanding, the first command creates four ordered categories with equal distances, while the latter creates quartiles (i.e. four equal groups). But I can't see the difference between the two...

      Comment


      • #4
        I don't know what "equal distances" would mean here, but in intent both syntaxes divide on quantiles. That leaves room for differing on exactly how to do it.

        egen, cut produces codings 0-3 and xtile codings 1-4 for quartiles, but the codings are not just 1 different.

        If a value is exactly equal to a quantile then cut() assigns values to the higher bin but xtile doesn't.

        The code is accessible in both cases, but it is easiest (modifying Svend's example) just to create a sandbox large enough to show the point and small enough to be clear what is happening.

        Code:
         
        . sysuse auto, clear
        (1978 Automobile Data)
        
        . keep in 1/10 
        (64 observations deleted)
        
        . egen x1 = cut(price), group(4)
        
        . xtile x2 = price, nq(4)
        
        . _pctile price, nq(4) 
        
        . ret li 
        
        scalars:
                         r(r1) =  4099
                         r(r2) =  4782.5
                         r(r3) =  5788
        
        . sort price 
        
        . l price x1 x2 
        
             +------------------+
             |  price   x1   x2 |
             |------------------|
          1. |  3,799    0    1 |
          2. |  4,082    0    1 |
          3. |  4,099    1    1 |
          4. |  4,453    1    2 |
          5. |  4,749    1    2 |
             |------------------|
          6. |  4,816    2    3 |
          7. |  5,189    2    3 |
          8. |  5,788    3    3 |
          9. |  7,827    3    4 |
         10. | 10,372    3    4 |
             +------------------+
        Apart from only wanting to do quantile binning very rarely, my prejudice is that it is good to be explicit about the definitions you are using. Do-it-yourself binning was discussed in http://www.stata-journal.com/article...article=pr0054

        Comment


        • #5
          Dear Professor Cox,

          I appreciate your very helpful comment!

          I understood as follows. Please correct me, if I'm mistaken.

          If the number of observations here is 12 for example (i.e. multiples of four), then both syntaxes (for quartiles) produce exactly the same variable.
          But if the number of observations is 10 for example (i.e. not multiples of four), then both syntaxes does not produce the same variable. And this is because

          cut() assigns values to the higher bin but xtile doesn't.

          Comment


          • #6
            No; I didn't say that and it's not implied by what I wrote.

            Having the number of values a multiple of the number of bins you want isn't sufficient to guarantee anything. The rule shared by these commands is always that values that are the same are assigned to the same bin. So, it is all too easy with ties to get mismatches between definitions depending on whether the inequality is strict or weak, and that may bite regardless of the number of value. Here's an example with 12 observations where the naive expectation that quartile-based bins will mean exactly equal frequencies is confounded by ties and also where the two methods do not agree (even allowing for the offset of 1, which is clearly something you could easily fix).

            Code:
             
            . clear
            
            . set obs 12
            obs was 0, now 12
            
            . gen y = ceil(_n/2)
            
            . l y, sep(3)
            
                 +---+
                 | y |
                 |---|
              1. | 1 |
              2. | 1 |
              3. | 2 |
                 |---|
              4. | 2 |
              5. | 3 |
              6. | 3 |
                 |---|
              7. | 4 |
              8. | 4 |
              9. | 5 |
                 |---|
             10. | 5 |
             11. | 6 |
             12. | 6 |
                 +---+
            
            . xtile x1=y, nq(4)
            
            . egen x2 = cut(y), group(4)
            
            . l, sep(3)
            
                 +-------------+
                 | y   x1   x2 |
                 |-------------|
              1. | 1    1    0 |
              2. | 1    1    0 |
              3. | 2    1    1 |
                 |-------------|
              4. | 2    1    1 |
              5. | 3    2    1 |
              6. | 3    2    1 |
                 |-------------|
              7. | 4    3    2 |
              8. | 4    3    2 |
              9. | 5    3    3 |
                 |-------------|
             10. | 5    3    3 |
             11. | 6    4    3 |
             12. | 6    4    3 |
                 +-------------+
            If you have large datasets and/or measurements with fractional parts (and so fewer ties), you might expect at most small differences in frequencies that are trivial, but this problem often bites even with thousands of observations, regardless of many Stata users' unwillingness to believe it. The _Stata Journal_ paper cited earlier in the thread gives examples.

            Binning is reproducible if you have totally explicit definitions, but its other supposed virtues are elusive.

            Comment


            • #7
              Unfortunately those of us who do not subscribe to the Stata Journal will have to pay $11.75 to view the cited article, or wait until it becomes available freely for download in or around December of this year. A subscription is starting to look more attractive.

              Comment


              • #8
                Thank you very much again!

                I found your paper very helpful and understand that xtile (regardless the number of value) does not necessarily lead to equal frequencies especially when we consider the issue of ties.

                Comment


                • #9
                  Dear William,

                  I found the paper on Google Scholar, which is accessible for free!

                  Comment


                  • #10
                    David: That copy doesn't appear with the permission of Stata Press.

                    Comment


                    • #11
                      Hello,
                      once the xtile function has been used, how do you know exactly what are the cutpoints used by this function. Let's say you have a really big database and you don't want to list the observations.
                      thanks!

                      Comment


                      • #12
                        Hello All
                        Please I started using the egen cut() yesterday. Initially it worked. But on this data set, I get this curious output for cutting at 2.8, and at 2.9, giving the same results. Any explanations please?
                        Ezeanosike Obum
                        2.5
                        2.5
                        2.6
                        2.7
                        2.8
                        2.8
                        2.8
                        2.9
                        3
                        3
                        3
                        3
                        3
                        3
                        3
                        3
                        3.1
                        3.1
                        3.1
                        3.1
                        3.2
                        3.2
                        3.2
                        3.2
                        3.2
                        3.2
                        3.2
                        3.2
                        3.2
                        3.2
                        3.2
                        3.2
                        3.2
                        3.3
                        3.3
                        3.3
                        3.3
                        3.3
                        3.3
                        3.3
                        3.3
                        3.4
                        3.4
                        3.4
                        3.4
                        3.4
                        3.4
                        3.4
                        3.4
                        3.4
                        3.4
                        3.4
                        3.5
                        3.5
                        3.5
                        3.5
                        3.5
                        3.5
                        3.5
                        3.5
                        3.5
                        3.5
                        3.6
                        3.6
                        3.6
                        3.6
                        3.6
                        3.6
                        3.6
                        3.6
                        3.6
                        3.7
                        3.7
                        3.7
                        3.7
                        3.7
                        3.7
                        3.7
                        3.7
                        3.7
                        3.7
                        3.7
                        3.7
                        3.7
                        3.8
                        3.8
                        3.8
                        3.8
                        3.8
                        3.8
                        3.8
                        3.8
                        3.8
                        3.8
                        3.8
                        3.8
                        3.8
                        3.8
                        3.8
                        3.8
                        3.8
                        3.8
                        3.8
                        3.8
                        3.9
                        3.9
                        3.9
                        3.9
                        3.9
                        3.9
                        3.9
                        3.9
                        3.9
                        3.9
                        4
                        4
                        4
                        4
                        4
                        4
                        4
                        4
                        4
                        4
                        4
                        4.1
                        4.1
                        4.1
                        4.1
                        4.1
                        4.1
                        4.2
                        4.2
                        4.2
                        4.2
                        4.3
                        4.3
                        4.3
                        4.3
                        4.3
                        4.3
                        4.3
                        4.3
                        4.3
                        4.4
                        4.4
                        4.4
                        4.5
                        4.5
                        4.5
                        4.5
                        4.6
                        4.6
                        4.7
                        4.8
                        4.8
                        4.9
                        5
                        5.1
                        5.6
                        cord_albumin





































                        egen albumin20 = cut(cord_albumin), at(2.5, 2.8, 5.6) label
                        . egen albumin22 = cut(cord_albumin), at(2.5, 2.9, 5.6) label


                        . table albumin20
                        albumin20 | Freq.
                        ----------+-----------
                        2.5- | 7
                        2.8- | 153
                        ----------------------

                        . table albumin22

                        ----------------------
                        albumin22 | Freq.
                        ----------+-----------
                        2.5- | 7
                        2.9- | 153
                        ----------------------

                        . tab albumin20

                        albumin20 | Freq. Percent Cum.
                        ------------+-----------------------------------
                        2.5- | 7 4.38 4.38
                        2.8- | 153 95.63 100.00
                        ------------+-----------------------------------
                        Total | 160 100.00

                        . tab albumin22

                        albumin22 | Freq. Percent Cum.
                        ------------+-----------------------------------
                        2.5- | 7 4.38 4.38
                        2.9- | 153 95.63 100.00
                        ------------+-----------------------------------
                        Total | 160 100.00

                        Comment

                        Working...
                        X