Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • create quintile dummy variables problem

    Let's say we have a dataset a:
    Code:
    use https://stats.idre.ucla.edu/stat/stata/examples/kirk/cr4, clear
    I want to create five dummy variables for quintiles of a, namely a1, a2, a3, a4, a5.
    My code is:
    Code:
    tabdisp order a, cellvar
    summarize a, detail
    gen byte a1=a<=r(p20)
    gen byte a2=a> r(p20) & a<=r(p40)
    ......
    gen byte a5=a> r(p80) & a<=r(p100)
    Then, I list a1, a2, a3, a4, a5, the result is for a1, all equal 1, but for other variables, all equal 0, which is not I want.
    Can someone tell me where is the problem? Thanks in advance.
    Last edited by Micky Lu; 09 Mar 2023, 07:23.

  • #2
    Using:
    Code:
    sum a, detail
    return list
    will show that this command only produces percentiles for 1, 5, 10, 25, 50, 75, 90, 95, and 99.

    If you'd like customized cuts such as 20, 40, 80, and 100, look into help pctile.
    Last edited by Ken Chui; 09 Mar 2023, 07:13.

    Comment


    • #3
      I agree with @Ken Chul except that if you want a categorical variable, you will want -xtile- rather than pctile - but this is discussed in the same help file; you can, of course, then use the categorical variable as a set of indicator (dummy) variables in your model, if you have one to estimate, using factor variable notation; see
      Code:
      h fvvarlist

      Comment


      • #4
        Originally posted by Ken Chui View Post
        Using:
        Code:
        sum a, detail
        return list
        will show that this command only produces percentiles for 1, 5, 10, 25, 50, 75, 90, 95, and 99.

        If you'd like customized cuts such as 20, 40, 80, and 100, look into help pctile.
        Thank you. But still I'm wondering my dummy variable a1 only takes value of 1, and the other variables all take value of 0?
        For example:
        Code:
        a1 a2 a3 a4 a5
        --------------------
        1   0.  0.  0.  0
        1   0.  0.  0.  0
        1   0.  0.  0.  0
        1   0.  0.  0.  0
        1   0.  0.  0.  0
        1   0.  0.  0.  0
        1   0.  0.  0.  0
        1   0.  0.  0.  0

        Comment


        • #5
          Originally posted by Micky Lu View Post

          Thank you. But still I'm wondering my dummy variable a1 only takes value of 1, and the other variables all take value of 0?
          For example:
          Code:
          a1 a2 a3 a4 a5
          --------------------
          1 0. 0. 0. 0
          1 0. 0. 0. 0
          1 0. 0. 0. 0
          1 0. 0. 0. 0
          1 0. 0. 0. 0
          1 0. 0. 0. 0
          1 0. 0. 0. 0
          1 0. 0. 0. 0
          My guess is because r(p20) does not exist, so it's equivalent to missing. Because missing values in Stata are considered a very big number, so all your values in a is smaller than that, resulting all 1.

          Moving onto line 2, since missing is very big, none of the a is bigger than that, so all are 0.

          Code:
          use https://stats.idre.ucla.edu/stat/stata/examples/kirk/cr4, clear
          
          summarize a, detail
          gen byte a1 = a <= r(p20)
          gen byte a2 = a > r(p20) & a<=r(p40)
          
          display 4 < .
          display 4 < r(p20)
          display r(p20) == .
          display 4 > r(p20)
          See that the first three all return TRUE (1), and the last one FALSE (0):

          Code:
          . display 4 < .
          1
          
          . display 4 < r(p20)
          1
          
          . display r(p20) == .
          1
          
          . display 4 > r(p20)
          0
          Last edited by Ken Chui; 09 Mar 2023, 07:37.

          Comment


          • #6
            Thank you Ken Chui !
            Is it correct if I do this?
            Code:
            _pctile a, nq(5)
            gen byte a1=a<=r(r1)
            gen byte a2=a> r(r1) & a<=r(r2)
            ...
            gen byte a5=a> r(r4) & a<=r(r5)
            Last edited by Micky Lu; 09 Mar 2023, 08:01.

            Comment


            • #7
              Originally posted by Micky Lu View Post
              Thank you Ken Chui !
              Is it correct if I do this?
              Code:
              _pctile a, nq(5)
              gen byte a1=a<=r(r1)
              gen byte a2=a> r(r1) & a<=r(r2)
              ...
              gen byte a5=a> r(r4) & a<=r(r5)
              As #3 clarifies, use xtile, which directly returns the categorical variable. E.g.

              Code:
              xtile pct_a = a, nq(6)
              Then you can create binary indicators using tabulate, with the gen option.
              Last edited by Ken Chui; 09 Mar 2023, 09:04.

              Comment


              • #8
                Originally posted by Ken Chui View Post

                As #3 clarifies, use xtile, which directly returns the categorical variable. E.g.

                Code:
                xtile pct_a = a, nq(6)
                Then you can create binary indicators using tabulate, with the gen option.
                When I use xtile, the same problem occurs as before.
                What if I use
                Code:
                _pctile a, p(20(20)80)
                ?

                Comment


                • #9
                  if the problem is either sparse data or lots of ties, then anything will be problematic - it would help us help you if you provided a data example using -dataex- and CODE blocks as described in the FAQ

                  Comment


                  • #10
                    Originally posted by Micky Lu View Post

                    When I use xtile, the same problem occurs as before.
                    What if I use
                    Code:
                    _pctile a, p(20(20)80)
                    ?
                    You're trying to cut a variable that only has 4 possible values (1, 2, 3, 4) into 5 groups based on their percentiles, of course it wouldn't work. Try practice that on a larger data set (e.g. websue nhanes2).

                    Comment


                    • #11
                      The original example data

                      Code:
                       use https://stats.idre.ucla.edu/stat/stata/examples/kirk/cr4, clear
                      points up several issues which have been touched on in the thread but not pushed as far they can go.

                      This post takes Ken Chui's post a little further.

                      Stata's rules for quantile binning include (here a deliberate echo of Richard Feynman in a different riff)

                      ** the same values must go in the same bin **

                      When thinking and talking about quantile binning I like to fire up a quantile plot, even though often the same information is as easily visible in a simple table.
                      Click image for larger version

Name:	quintiles.png
Views:	1
Size:	14.8 KB
ID:	1705109

                      That makes it clear that there are four distinct values, and four only, and they have equal frequency. So, any way you try it quantile binning can't populate all 5 quintile bins.

                      The simple point is thus to fire up a graph of the distribution if quantile binning results are puzzling or disappointing.
                      Last edited by Nick Cox; 09 Mar 2023, 12:27.

                      Comment


                      • #12
                        Thank you all for your clarification!
                        Nick Cox Your graph makes it very clear.

                        Comment

                        Working...
                        X