Convert continuous variable into categorical groups

David Silverstein

Join Date: Feb 2015

Posts: 14
#1

Convert continuous variable into categorical groups

21 Feb 2015, 03:10

Does anyone know the difference between the following two commands?

Code:

egen variable_name = cut(X), group(4) xtile variable_name = X, nq(4)

* Continuous variable X (e.g. income, age)

Do both commands create exactly the same variable (quartile)?

Best wishes,
David
Version: Stata/IC 13.0

Last edited by David Silverstein; 21 Feb 2015, 04:07.
Tags: None
Svend Juul

Join Date: Apr 2014

Posts: 515
#2

21 Feb 2015, 05:48

This will show that the two commands do not necessarily generate exactly the same groups. I can't tell you what to prefer.

Code:

sysuse auto, clear egen x1 = cut(price), group(4) xtile x2 = price, nq(4) tab1 x1 x2
Comment
David Silverstein

Join Date: Feb 2015

Posts: 14
#3

21 Feb 2015, 07:14

Dear Svend,

Thank you very much for you response!

As you pointed out, each command creates different variables (not the same).

From my understanding, the first command creates four ordered categories with equal distances, while the latter creates quartiles (i.e. four equal groups). But I can't see the difference between the two...
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35637
#4

21 Feb 2015, 07:34

I don't know what "equal distances" would mean here, but in intent both syntaxes divide on quantiles. That leaves room for differing on exactly how to do it.

egen, cut produces codings 0-3 and xtile codings 1-4 for quartiles, but the codings are not just 1 different.

If a value is exactly equal to a quantile then cut() assigns values to the higher bin but xtile doesn't.

The code is accessible in both cases, but it is easiest (modifying Svend's example) just to create a sandbox large enough to show the point and small enough to be clear what is happening.

Code:

. sysuse auto, clear (1978 Automobile Data) . keep in 1/10 (64 observations deleted) . egen x1 = cut(price), group(4) . xtile x2 = price, nq(4) . _pctile price, nq(4) . ret li scalars: r(r1) = 4099 r(r2) = 4782.5 r(r3) = 5788 . sort price . l price x1 x2 +------------------+ | price x1 x2 | |------------------| 1. | 3,799 0 1 | 2. | 4,082 0 1 | 3. | 4,099 1 1 | 4. | 4,453 1 2 | 5. | 4,749 1 2 | |------------------| 6. | 4,816 2 3 | 7. | 5,189 2 3 | 8. | 5,788 3 3 | 9. | 7,827 3 4 | 10. | 10,372 3 4 | +------------------+

Apart from only wanting to do quantile binning very rarely, my prejudice is that it is good to be explicit about the definitions you are using. Do-it-yourself binning was discussed in http://www.stata-journal.com/article...article=pr0054
Comment
David Silverstein

Join Date: Feb 2015

Posts: 14
#5

21 Feb 2015, 07:59

Dear Professor Cox,

I appreciate your very helpful comment!

I understood as follows. Please correct me, if I'm mistaken.

If the number of observations here is 12 for example (i.e. multiples of four), then both syntaxes (for quartiles) produce exactly the same variable.
But if the number of observations is 10 for example (i.e. not multiples of four), then both syntaxes does not produce the same variable. And this is because

cut() assigns values to the higher bin but xtile doesn't.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35637
#6

21 Feb 2015, 08:19

No; I didn't say that and it's not implied by what I wrote.

Having the number of values a multiple of the number of bins you want isn't sufficient to guarantee anything. The rule shared by these commands is always that values that are the same are assigned to the same bin. So, it is all too easy with ties to get mismatches between definitions depending on whether the inequality is strict or weak, and that may bite regardless of the number of value. Here's an example with 12 observations where the naive expectation that quartile-based bins will mean exactly equal frequencies is confounded by ties and also where the two methods do not agree (even allowing for the offset of 1, which is clearly something you could easily fix).

Code:

. clear . set obs 12 obs was 0, now 12 . gen y = ceil(_n/2) . l y, sep(3) +---+ | y | |---| 1. | 1 | 2. | 1 | 3. | 2 | |---| 4. | 2 | 5. | 3 | 6. | 3 | |---| 7. | 4 | 8. | 4 | 9. | 5 | |---| 10. | 5 | 11. | 6 | 12. | 6 | +---+ . xtile x1=y, nq(4) . egen x2 = cut(y), group(4) . l, sep(3) +-------------+ | y x1 x2 | |-------------| 1. | 1 1 0 | 2. | 1 1 0 | 3. | 2 1 1 | |-------------| 4. | 2 1 1 | 5. | 3 2 1 | 6. | 3 2 1 | |-------------| 7. | 4 3 2 | 8. | 4 3 2 | 9. | 5 3 3 | |-------------| 10. | 5 3 3 | 11. | 6 4 3 | 12. | 6 4 3 | +-------------+

If you have large datasets and/or measurements with fractional parts (and so fewer ties), you might expect at most small differences in frequencies that are trivial, but this problem often bites even with thousands of observations, regardless of many Stata users' unwillingness to believe it. The _Stata Journal_ paper cited earlier in the thread gives examples.

Binning is reproducible if you have totally explicit definitions, but its other supposed virtues are elusive.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

21 Feb 2015, 09:43

Unfortunately those of us who do not subscribe to the Stata Journal will have to pay $11.75 to view the cited article, or wait until it becomes available freely for download in or around December of this year. A subscription is starting to look more attractive.
Comment
David Silverstein

Join Date: Feb 2015

Posts: 14
#8

21 Feb 2015, 09:45

Thank you very much again!

I found your paper very helpful and understand that xtile (regardless the number of value) does not necessarily lead to equal frequencies especially when we consider the issue of ties.
Comment
David Silverstein

Join Date: Feb 2015

Posts: 14
#9

21 Feb 2015, 09:53

Dear William,

I found the paper on Google Scholar, which is accessible for free!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35637
#10

21 Feb 2015, 10:28

David: That copy doesn't appear with the permission of Stata Press.
Comment
nawar han

Join Date: Sep 2015

Posts: 18
#11

12 Oct 2015, 18:40

Hello,
once the xtile function has been used, how do you know exactly what are the cutpoints used by this function. Let's say you have a really big database and you don't want to list the observations.
thanks!
Comment

OBUM EZEANOSIKE

Join Date: Jul 2014
Posts: 21

#12

22 Dec 2017, 09:38

Hello All
Please I started using the egen cut() yesterday. Initially it worked. But on this data set, I get this curious output for cutting at 2.8, and at 2.9, giving the same results. Any explanations please?
Ezeanosike Obum

2.5
2.5
2.6
2.7
2.8
2.8
2.8
2.9
3
3
3
3
3
3
3
3
3.1
3.1
3.1
3.1
3.2
3.2
3.2
3.2
3.2
3.2
3.2
3.2
3.2

3.2
3.2
3.2
3.2
3.3
3.3
3.3
3.3
3.3
3.3
3.3
3.3
3.4
3.4
3.4
3.4
3.4
3.4
3.4
3.4
3.4
3.4
3.4
3.5
3.5
3.5
3.5
3.5
3.5
3.5
3.5
3.5
3.5
3.6
3.6
3.6

3.6
3.6
3.6
3.6
3.6
3.6
3.7
3.7
3.7
3.7
3.7
3.7
3.7
3.7
3.7
3.7
3.7
3.7
3.7
3.8
3.8
3.8
3.8
3.8
3.8
3.8
3.8
3.8
3.8
3.8
3.8
3.8

3.8
3.8
3.8
3.8
3.8
3.8
3.8
3.9
3.9
3.9
3.9
3.9
3.9
3.9
3.9
3.9
3.9
4
4
4
4
4
4
4
4
4
4
4
4.1
4.1
4.1
4.1

4.1
4.1
4.2
4.2
4.2
4.2
4.3
4.3
4.3
4.3
4.3
4.3
4.3
4.3
4.3
4.4
4.4
4.4
4.5
4.5
4.5
4.5
4.6
4.6
4.7
4.8
4.8
4.9
5
5.1
5.6

cord_albumin

egen albumin20 = cut(cord_albumin), at(2.5, 2.8, 5.6) label
. egen albumin22 = cut(cord_albumin), at(2.5, 2.9, 5.6) label

. table albumin20
albumin20 | Freq.
----------+-----------
2.5- | 7
2.8- | 153
----------------------

. table albumin22

----------------------
albumin22 | Freq.
----------+-----------
2.5- | 7
2.9- | 153
----------------------

. tab albumin20

albumin20 | Freq. Percent Cum.
------------+-----------------------------------
2.5- | 7 4.38 4.38
2.8- | 153 95.63 100.00
------------+-----------------------------------
Total | 160 100.00

. tab albumin22

albumin22 | Freq. Percent Cum.
------------+-----------------------------------
2.5- | 7 4.38 4.38
2.9- | 153 95.63 100.00
------------+-----------------------------------
Total | 160 100.00

Announcement

Convert continuous variable into categorical groups

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment