Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Quantiles

    Good morning

    I have created new quantile variables for my data using xtile variable_4 = variable, nq4 successfully for a most of my variables however, with 2 in particular, the code is only creating 2 quantiles.

    Is this because there are a large number of variables with the same value?

    Thank you for your help.

  • #2
    Is this because there are a large number of variables with the same value?
    Yes. When you are calculating quantile groups, all the observations that have the same value must end up in the same group. When there are large number of ties, this may mean that there simply don't exist as many quantile groups as you were hoping to get.

    For example, if we want 3 groups (terciles) and the values of the data are 1, 2, 2, 2, 2, 2, 2, 2, 3, the 1 cannot form a tercile group by itself, because it is only one of 8 observations. So, all of the 2's must now be lumped in with the 1. That group, the 1 and all of the 2's, now constitute 7/8ths of the data, so that "tercile" is already overfilled. So then 3 starts the next group (which is tercile group 3 because we are well past the 2/3rds mark), and, being all that is left of the data, is the sole member of that group. So we end up with 2 groups: group 1, consisting of 1 and all the 2's, and group 3, consisting of just 3.

    Quantile groups do not play nicely with data that has a large number of ties.
    Last edited by Clyde Schechter; 10 Oct 2022, 16:13.

    Comment


    • #3
      Awesome Clyde, thank you so much for the quick reply!

      Comment


      • #4
        For lengthier discussion see (e.g.) Section 6 in https://www.stata-journal.com/articl...article=dm0095 and Section 4 in https://www.stata-journal.com/articl...article=pr0054

        Quantile binning is more or less doomed to disappointment when the number of distinct values is small. Better to use the original variable!



        Comment


        • #5
          As well as Cox, N. J. (2018). Speaking Stata: Logarithmic Binning and Labeling. The Stata Journal, 18(1), 262–286.
          http://publicationslist.org/eric.melse

          Comment


          • #6
            Some further cautious remarks are made by Bennette, C., & Vickers, A. (2012). Against quantiles: categorization of continuous variables in epidemiologic research, and its discontents. BMC medical research methodology, 12, 21.
            Quantiles are a staple of epidemiologic research: in contemporary epidemiologic practice, continuous variables are typically categorized into tertiles, quartiles and quintiles as a means to illustrate the relationship between a continuous exposure and a binary outcome. In this paper we argue that this approach is highly problematic and present several potential alternatives. We also discuss the perceived drawbacks of these newer statistical methods and the possible reasons for their slow adoption by epidemiologists. The use of quantiles is often inadequate for epidemiologic research with continuous variables.
            http://publicationslist.org/eric.melse

            Comment


            • #7
              The reference (to another paper of mine) in #5 may be worth your reading but I don't think it touches on this issue.

              The reference in #6 is quoted in https://www.stata-journal.com/articl...article=dm0095 and I certainly recommend reading it.



              Comment


              • #8
                Thank you Nick Cox ericmelse & @Clyde !

                Comment

                Working...
                X