Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • generating cataegorical variable based on quartiles

    Hello, I have a variable "Corporate Governance" for 7395 firm year observations. I want to create a categorical variable on 4 quartiles in a way that:
    1 represent Corporate Governance value less than 25th quartile
    2 represents Corporate Governance value between 25th and 50th quartile
    3 represents Corporate Governance value between 50th and 75th quartile
    4 represents Corporate Governance value between 75th and 100th quartile

    any short way to do that? thanks.

  • #2
    -help xtile-

    Just bear in mind, that depending on the distribution of values of your Corporate Governance variable, it might not be possible to partition the data into four close-to-equal-size groups based on this variable. People are often disappointed in the way their quartiles (or other quantiles) turn out.

    Comment


    • #3
      The -xtile- command is what you need.

      Code:
      xtile newvar = oldvar, nq(4)
      ​​Or equivalently

      Code:
      egen new = xtile(old), nq(4)
      ​​​​​
      Where nq specifies the number of quantiles.

      Comment


      • #4
        im feeling confused right now. My Corporate Index has a value from 0 to 17 with 0 being the lowest CG value and 17 being the highest. Median value of my sample is 8 which actually does not represent the midpoint value. moreover my data is also not being distributed equally in 4 equal quartiles. 1st quartile has 2188 observations, 2nd quartile ==1649, 3rd quartile ==1882, 4th quartile ==1673

        is there a way out in this situation?

        Comment


        • #5
          Probably not. This is what I referred to in my previous post about people typically being disappointed in the results of forming quartiles or other quantiles. The problem is most likely that your data are too clumped to permit an equal division. Here's an example:

          Code:
          clear
          set obs 100
          set seed 1234
          gen score = rnormal(8, 2)
          replace score = round(score, 1)
          tab score
          which produces
          Code:
                score |      Freq.     Percent        Cum.
          ------------+-----------------------------------
                    2 |          1        1.00        1.00
                    3 |          1        1.00        2.00
                    4 |          3        3.00        5.00
                    5 |          5        5.00       10.00
                    6 |          6        6.00       16.00
                    7 |         20       20.00       36.00
                    8 |         23       23.00       59.00
                    9 |         19       19.00       78.00
                   10 |         14       14.00       92.00
                   11 |          5        5.00       97.00
                   12 |          2        2.00       99.00
                   13 |          1        1.00      100.00
          ------------+-----------------------------------
                Total |        100      100.00
          So, where do you draw the line between the bottom quartile and the next. If you draw it between 6 and 7, your bottom quartile will only have 16 observations instead of the hoped-for 25. But if you go up a notch and draw the line between 7 and 8, you end up with 36 observations in the bottom "quartile." Similarly with separating the second and third quartiles: no matter where you draw the line, you can't get the groups to be of even close to equal size.

          I'm guessing if you look at the distribution of your Corporate Index score, you will encounter a similar situation.

          My advice is to forget about quartiles. Even when you get four nearly equal size groups, they're generally not a good way to work with data. In any kind of regression analysis you would be better off just using the Composite Index itself as a predictor than any arbitrary division into categories. Cutting up a continuous variable into categories just throws away information. If you weren't thinking of using this in a regression, but just wanted to look at, say, the trend of some other variable(s) across the quartiles of the Corporate Index score, you could accomplish that better with a correlation coefficient or a scatterplot.





          Last edited by Clyde Schechter; 30 Sep 2016, 23:23.

          Comment


          • #6
            Originally posted by Clyde Schechter View Post
            -help xtile-

            Just bear in mind, that depending on the distribution of values of your Corporate Governance variable, it might not be possible to partition the data into four close-to-equal-size groups based on this variable. People are often disappointed in the way their quartiles (or other quantiles) turn out.
            Hi, Clyde,

            I used
            Code:
            xtile y_quart = y, nq(4)
            and got the following output without any 2nd quartile. I am confused why is it happening. Is it because for more than 50% observations (1789 out of 2825), the value is zero. Thanks.

            Code:
            y | Freq. Percent Cum.
            ------------+-----------------------------------
            1 | 1,789 63.33 63.33
            3 | 364 12.88 76.21
            4 | 672 23.79 100.00
            ------------+-----------------------------------
            Total | 2,825 100.00
            
            Variable | Obs      Mean   Std. Dev. Min Max
            -------------+---------------------------------------------------------
                   y | 2,825 23.03929 91.96089 0 1164
            Last edited by Nitin Jain; 11 Aug 2022, 22:35.

            Comment


            • #7
              #6 Short answer is Yes. If your variable has distinct values 1, 3, 4 then at most you can map one-to-one to 3 other distinct values. xtile won't split tied values into different bins. In this case you get out precisely what you put it in and there is (literally) zero gain from quantile binning, which is in my view over-rated any way.

              Code:
              . clear 
              
              . set obs 2825 
              Number of observations (_N) was 0, now 2,825.
              
              . gen y = cond(_n <= 1789, 1, cond(_n <= (1789 + 364), 3, 4))
              
              . 
              . tab y 
              
                        y |      Freq.     Percent        Cum.
              ------------+-----------------------------------
                        1 |      1,789       63.33       63.33
                        3 |        364       12.88       76.21
                        4 |        672       23.79      100.00
              ------------+-----------------------------------
                    Total |      2,825      100.00
              
              . 
              . xtile y_quart=y, nq(4)
              
              . 
              . tab y y_quart 
              
                         |         4 quantiles of y
                       y |         1          3          4 |     Total
              -----------+---------------------------------+----------
                       1 |     1,789          0          0 |     1,789 
                       3 |         0        364          0 |       364 
                       4 |         0          0        672 |       672 
              -----------+---------------------------------+----------
                   Total |     1,789        364        672 |     2,825 
              
              .

              Comment


              • #8
                Originally posted by Nick Cox View Post
                #6 Short answer is Yes. If your variable has distinct values 1, 3, 4 then at most you can map one-to-one to 3 other distinct values. xtile won't split tied values into different bins. In this case you get out precisely what you put it in and there is (literally) zero gain from quantile binning, which is in my view over-rated any way.

                Code:
                . clear
                
                . set obs 2825
                Number of observations (_N) was 0, now 2,825.
                
                . gen y = cond(_n <= 1789, 1, cond(_n <= (1789 + 364), 3, 4))
                
                .
                . tab y
                
                y | Freq. Percent Cum.
                ------------+-----------------------------------
                1 | 1,789 63.33 63.33
                3 | 364 12.88 76.21
                4 | 672 23.79 100.00
                ------------+-----------------------------------
                Total | 2,825 100.00
                
                .
                . xtile y_quart=y, nq(4)
                
                .
                . tab y y_quart
                
                | 4 quantiles of y
                y | 1 3 4 | Total
                -----------+---------------------------------+----------
                1 | 1,789 0 0 | 1,789
                3 | 0 364 0 | 364
                4 | 0 0 672 | 672
                -----------+---------------------------------+----------
                Total | 1,789 364 672 | 2,825
                
                .
                Ok, thanks, Nick.

                Comment

                Working...
                X