generating cataegorical variable based on quartiles

Sanaullah Farooq

Join Date: Mar 2016

Posts: 107
#1

generating cataegorical variable based on quartiles

30 Sep 2016, 21:22

Hello, I have a variable "Corporate Governance" for 7395 firm year observations. I want to create a categorical variable on 4 quartiles in a way that:
1 represent Corporate Governance value less than 25th quartile
2 represents Corporate Governance value between 25th and 50th quartile
3 represents Corporate Governance value between 50th and 75th quartile
4 represents Corporate Governance value between 75th and 100th quartile

any short way to do that? thanks.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

30 Sep 2016, 21:48

-help xtile-

Just bear in mind, that depending on the distribution of values of your Corporate Governance variable, it might not be possible to partition the data into four close-to-equal-size groups based on this variable. People are often disappointed in the way their quartiles (or other quantiles) turn out.
1 like
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#3

30 Sep 2016, 21:48

The -xtile- command is what you need.

Code:

xtile newvar = oldvar, nq(4)

Or equivalently

Code:

egen new = xtile(old), nq(4)

Where nq specifies the number of quantiles.
1 like
Comment
Sanaullah Farooq

Join Date: Mar 2016

Posts: 107
#4

30 Sep 2016, 22:27

im feeling confused right now. My Corporate Index has a value from 0 to 17 with 0 being the lowest CG value and 17 being the highest. Median value of my sample is 8 which actually does not represent the midpoint value. moreover my data is also not being distributed equally in 4 equal quartiles. 1st quartile has 2188 observations, 2nd quartile ==1649, 3rd quartile ==1882, 4th quartile ==1673

is there a way out in this situation?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#5

30 Sep 2016, 23:20

Probably not. This is what I referred to in my previous post about people typically being disappointed in the results of forming quartiles or other quantiles. The problem is most likely that your data are too clumped to permit an equal division. Here's an example:

Code:

clear set obs 100 set seed 1234 gen score = rnormal(8, 2) replace score = round(score, 1) tab score

which produces

Code:

score | Freq. Percent Cum. ------------+----------------------------------- 2 | 1 1.00 1.00 3 | 1 1.00 2.00 4 | 3 3.00 5.00 5 | 5 5.00 10.00 6 | 6 6.00 16.00 7 | 20 20.00 36.00 8 | 23 23.00 59.00 9 | 19 19.00 78.00 10 | 14 14.00 92.00 11 | 5 5.00 97.00 12 | 2 2.00 99.00 13 | 1 1.00 100.00 ------------+----------------------------------- Total | 100 100.00

So, where do you draw the line between the bottom quartile and the next. If you draw it between 6 and 7, your bottom quartile will only have 16 observations instead of the hoped-for 25. But if you go up a notch and draw the line between 7 and 8, you end up with 36 observations in the bottom "quartile." Similarly with separating the second and third quartiles: no matter where you draw the line, you can't get the groups to be of even close to equal size.

I'm guessing if you look at the distribution of your Corporate Index score, you will encounter a similar situation.

My advice is to forget about quartiles. Even when you get four nearly equal size groups, they're generally not a good way to work with data. In any kind of regression analysis you would be better off just using the Composite Index itself as a predictor than any arbitrary division into categories. Cutting up a continuous variable into categories just throws away information. If you weren't thinking of using this in a regression, but just wanted to look at, say, the trend of some other variable(s) across the quartiles of the Corporate Index score, you could accomplish that better with a correlation coefficient or a scatterplot.

Last edited by Clyde Schechter; 30 Sep 2016, 23:23.
2 likes
Comment
Nitin Jain

Join Date: Apr 2022

Posts: 65
#6

11 Aug 2022, 22:31

Originally posted by Clyde Schechter View Post

-help xtile-

Just bear in mind, that depending on the distribution of values of your Corporate Governance variable, it might not be possible to partition the data into four close-to-equal-size groups based on this variable. People are often disappointed in the way their quartiles (or other quantiles) turn out.

Hi, Clyde,

I used

Code:

xtile y_quart = y, nq(4)

and got the following output without any 2nd quartile. I am confused why is it happening. Is it because for more than 50% observations (1789 out of 2825), the value is zero. Thanks.

Code:

y | Freq. Percent Cum. ------------+----------------------------------- 1 | 1,789 63.33 63.33 3 | 364 12.88 76.21 4 | 672 23.79 100.00 ------------+----------------------------------- Total | 2,825 100.00 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- y | 2,825 23.03929 91.96089 0 1164

Last edited by Nitin Jain; 11 Aug 2022, 22:35.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

12 Aug 2022, 03:27

#6 Short answer is Yes. If your variable has distinct values 1, 3, 4 then at most you can map one-to-one to 3 other distinct values. xtile won't split tied values into different bins. In this case you get out precisely what you put it in and there is (literally) zero gain from quantile binning, which is in my view over-rated any way.

Code:

. clear 

. set obs 2825 
Number of observations (_N) was 0, now 2,825.

. gen y = cond(_n <= 1789, 1, cond(_n <= (1789 + 364), 3, 4))

. 
. tab y 

          y |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |      1,789       63.33       63.33
          3 |        364       12.88       76.21
          4 |        672       23.79      100.00
------------+-----------------------------------
      Total |      2,825      100.00

. 
. xtile y_quart=y, nq(4)

. 
. tab y y_quart 

           |         4 quantiles of y
         y |         1          3          4 |     Total
-----------+---------------------------------+----------
         1 |     1,789          0          0 |     1,789 
         3 |         0        364          0 |       364 
         4 |         0          0        672 |       672 
-----------+---------------------------------+----------
     Total |     1,789        364        672 |     2,825 

.

Comment

Nitin Jain

Join Date: Apr 2022
Posts: 65

12 Aug 2022, 10:21

Originally posted by Nick Cox View Post

Code:

. clear

. set obs 2825
Number of observations (_N) was 0, now 2,825.

. gen y = cond(_n <= 1789, 1, cond(_n <= (1789 + 364), 3, 4))

.
. tab y

y | Freq. Percent Cum.
------------+-----------------------------------
1 | 1,789 63.33 63.33
3 | 364 12.88 76.21
4 | 672 23.79 100.00
------------+-----------------------------------
Total | 2,825 100.00

.
. xtile y_quart=y, nq(4)

.
. tab y y_quart

| 4 quantiles of y
y | 1 3 4 | Total
-----------+---------------------------------+----------
1 | 1,789 0 0 | 1,789
3 | 0 364 0 | 364
4 | 0 0 672 | 672
-----------+---------------------------------+----------
Total | 1,789 364 672 | 2,825

.

Ok, thanks, Nick.

Announcement

generating cataegorical variable based on quartiles

Comment

Comment

Comment

Comment

Comment

Comment

Comment