Quantiles

Matt Hewitson

Join Date: Mar 2022

Posts: 9
#1

Quantiles

10 Oct 2022, 15:59

Good morning

I have created new quantile variables for my data using xtile variable_4 = variable, nq4 successfully for a most of my variables however, with 2 in particular, the code is only creating 2 quantiles.

Is this because there are a large number of variables with the same value?

Thank you for your help.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#2

10 Oct 2022, 16:07

Is this because there are a large number of variables with the same value?

Yes. When you are calculating quantile groups, all the observations that have the same value must end up in the same group. When there are large number of ties, this may mean that there simply don't exist as many quantile groups as you were hoping to get.

For example, if we want 3 groups (terciles) and the values of the data are 1, 2, 2, 2, 2, 2, 2, 2, 3, the 1 cannot form a tercile group by itself, because it is only one of 8 observations. So, all of the 2's must now be lumped in with the 1. That group, the 1 and all of the 2's, now constitute 7/8ths of the data, so that "tercile" is already overfilled. So then 3 starts the next group (which is tercile group 3 because we are well past the 2/3rds mark), and, being all that is left of the data, is the sole member of that group. So we end up with 2 groups: group 1, consisting of 1 and all the 2's, and group 3, consisting of just 3.

Quantile groups do not play nicely with data that has a large number of ties.

Last edited by Clyde Schechter; 10 Oct 2022, 16:13.
2 likes
Comment
Matt Hewitson

Join Date: Mar 2022

Posts: 9
#3

10 Oct 2022, 17:18

Awesome Clyde, thank you so much for the quick reply!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#4

11 Oct 2022, 02:40

For lengthier discussion see (e.g.) Section 6 in https://www.stata-journal.com/articl...article=dm0095 and Section 4 in https://www.stata-journal.com/articl...article=pr0054

Quantile binning is more or less doomed to disappointment when the number of distinct values is small. Better to use the original variable!
1 like
Comment
ericmelse

Join Date: May 2014

Posts: 434
#5

11 Oct 2022, 02:59

As well as Cox, N. J. (2018). Speaking Stata: Logarithmic Binning and Labeling. The Stata Journal, 18(1), 262–286.

http://publicationslist.org/eric.melse
1 like
Comment
ericmelse

Join Date: May 2014

Posts: 434
#6

11 Oct 2022, 03:08

Some further cautious remarks are made by Bennette, C., & Vickers, A. (2012). Against quantiles: categorization of continuous variables in epidemiologic research, and its discontents. BMC medical research methodology, 12, 21.

Against quantiles: categorization of continuous variables in epidemiologic research, and its discontents | BMC Medical Research Methodology | Full Text

https://doi.org

Quantiles are a staple of epidemiologic research: in contemporary epidemiologic practice, continuous variables are typically categorized into tertiles, quartiles and quintiles as a means to illustrate the relationship between a continuous exposure and a binary outcome. In this paper we argue that this approach is highly problematic and present several potential alternatives. We also discuss the perceived drawbacks of these newer statistical methods and the possible reasons for their slow adoption by epidemiologists. The use of quantiles is often inadequate for epidemiologic research with continuous variables.

http://publicationslist.org/eric.melse
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#7

11 Oct 2022, 05:14

The reference (to another paper of mine) in #5 may be worth your reading but I don't think it touches on this issue.

The reference in #6 is quoted in https://www.stata-journal.com/articl...article=dm0095 and I certainly recommend reading it.
1 like
Comment
Matt Hewitson

Join Date: Mar 2022

Posts: 9
#8

28 Oct 2022, 17:43

Thank you Nick Cox ericmelse & @Clyde !
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment