Why creating deciles results is unsystematic allocation of observations of the ranking variable ?

Ahmed Abdalla

Join Date: Apr 2014

Posts: 82
#1

Why creating deciles results is unsystematic allocation of observations of the ranking variable ?

13 May 2014, 13:31

Hi
I am trying to create deciles based on a certain variable (this variable is negative for all observations). I use the following code to create 10 deciles that are updated each year:
egen decile = xtile(l_op2) , by(yr) p(1/9)

I get 10 deciles each year. The only problem is that all deciles except decile 10 have fairly similar number of observations within each decile. Decile 10 has too many observations comparing to other deciles.For example, I get:

. sum l_op2 if decile==10 & yr==2011

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
l_ op2 | 8 -.3147382 .0837412 -.4147727 -.2305505

. sum l_op2 if decile==10 & yr==2011

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
l_op2 | 698 -.0048349 .0064964 -.0293935 -7.32e-07

This pattern has almost been for all years that is the average across all years is:

. sum l_op2 if decile==1

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
l_op2 | 135 -.2464536 .1394698 -.6817232 -.0689984

. sum l_op2 if decile==10

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
l_op2 | 10361 -.0060971 .0090912 -.0675198 -6.19e-08

The only problem is that decile 10 has many obs comparing to others ?

How can I fix this problem ?
Tags: None
FernandoRios

Join Date: Apr 2014

Posts: 2471
#2

13 May 2014, 13:57

Hi Ahmed,
First, sorry, i think i left this part of the question hanging on a previous post of yours.
I think you are right, and there is something particular on your data.
I would like first to check a simple frequency table of the decile variable. Based on our summary, there are probably no "intermediate" deciles.
If that is not the case, and the "large" amount of observations occur only in decile 10, then my guess is there is a problem with the command you are using, in which case i would opt to use the "old" xtile function.
Lets see first that decile frequency to understand better the underlying problem.
Fernando
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#3

13 May 2014, 13:58

Ahmed,

It looks like you have a lot of observations with l_op2 equal to zero (or very close to zero). These show up in the results of the sum command as 7.32e-07 and -6.19e-08. When you have a lot of observations with the same value, Stata puts them in the same quantile. After all, would you rather have two adjacent quantiles which both have the same values in them? How would it decide which quantile to use for any given value? The solution is to put them all in the same quantile, which results in quantiles that are larger than they should be. You may want to consider having fewer than 10 quantiles if you want to have equal sized quantiles. The exact number of quantiles will depend on how many identical values you have. For example, if 20% of your values are zero, you could have 5 quantiles, one of which would be entirely zeros and would be equal in size to the other four quantiles.

Regards,
Joe
Comment
Ahmed Abdalla

Join Date: Apr 2014

Posts: 82
#4

13 May 2014, 14:34

Joe and Fernardo

1- I tried with 5 deciles but same behavior ( code: egen decile = xtile(l_op2) if l_op2 !=0, by(yr) p(1/4) )
2- I tabulated the deciles to see frequency ( code: tabuale decile )
decile | Freq. Percent Cum.
------------+-----------------------------------
1 | 135 1.18 1.18
2 | 113 0.99 2.18
3 | 105 0.92 3.10
4 | 130 1.14 4.24
5 | 10,915 95.76 100.00
------------+-----------------------------------
Total | 11,398 100.00

3- Joe, you are right I have a lot of zero l_op2 , but when I create my deciles I EXCLUDE them, so my exact code was: ( egen decile = xtile(l_op2) if l_op2 !=0, by(yr) p(1/4) )

I summarized l_op2 if it does not equal to zero and get:

sum l_op2 if l_op2 !=0

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
l_op2 | 11398 -.0140074 .0372738 -.6817232 -6.19e-08

4- the variable l_op2 is mainly a ratio and that is why it has low values ..

This decile formation is critical in my analysis as I make a trading strategy based on that later so I hope I can be as accurate as possible in that.

Hope to get some suggestions,

Ahmed
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2471
#5

13 May 2014, 14:47

Ahmed,
I think that first you might need a "weaker" zero constrain, which should contain the very small (say below 1e-7) values.
It puzzles me, however, that you DO have other deciles. Im not familiar with egen =xtile(), but seems it does something I usually dont do.
So, while not so popular, try creating the xtiles like this:
xtile aux=l_op2 if yr==1990 & abs(l_op2)>1e-7, n(5)

I would do this first to compare the results agains the egen xtile command, and second to use a weaker concept of "zero" on your l_op2 variable.
Hope this helps
Fernando
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#6

13 May 2014, 15:02

Ahmed,

Whatever that value is, even if not precisely zero, there are too many for this analysis. If 95% of your values are the same, there is not much hope of getting any sort of meaningful quantiles. The only other solution I can think of is to make sure that you are storing the l_op2 variable as a double. Perhaps there are some different unique values in there but the precision of the variable prevents xtile from perceiving them as different.

Also, are you calculating the ratio or are you getting it from somewhere else? It might be instructive to look at the numerators and denominators and see why so many of the ratios are the same.

Regards,
Joe
Comment
Ahmed Abdalla

Join Date: Apr 2014

Posts: 82
#7

13 May 2014, 16:12

Thanks to both of you,
1- I can explain how do I calculate the ranking variable. All variables in the regressions are ratios (scaled variables)

gen c2=.

levelsof sic_2, local(levels)

foreach x of local levels {
foreach z of numlist 1990/2012 {

capture reg ce si if sic_2==`x' & yr==`z'

if _rc == 0 {

replace c2=_b[si] if e(sample)

}
}
}

sum c2
sum c2 if c2 <1 & c2>0
replace c2=0 if c2>1
replace c2=0 if c2<0
sum c2

gen op2 =c2*si*-1 // op2 will be negative to indicate their expense nature
replace op2=0 if op2==.

gen l_op2=l.op2

sum op2 if op2 !=0,d

op2
-------------------------------------------------------------
Percentiles Smallest
1% -.1929019 -.6817232
5% -.0713695 -.6445665
10% -.0360368 -.6391318 Obs 14523
25% -.0120815 -.6348315 Sum of Wgt. 14523

50% -.0031527 Mean -.0155736
Largest Std. Dev. .0418924
75% -.000761 -1.59e-07
90% -.000199 -1.08e-07 Variance .001755
95% -.0000809 -1.03e-07 Skewness -7.166473
99% -.0000125 -6.19e-08 Kurtosis 74.69785

2- Fernando, I tried your code xtile aux=l_op2 if yr==1999 & abs(l_op2)>1e-7, n(5)
sum aux

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
aux | 726 2.997245 1.41616 1 5

I thought to do that in aloop for all years:
. levelsof yr, local(levels)
. foreach z of local levels {
2. xtile aux=l_op2 if yr ==`z' & abs(l_op2)>1e-7, n(5)
3. }

I got no observations

3- Do you think that one way to avoid this problem is to multiply l_op2 by 1000,000 for example and then do the rank ?

Ahmed
Comment
Ahmed Abdalla

Join Date: Apr 2014

Posts: 82
#8

13 May 2014, 16:13

Joe,
I did not get the double part from your email. Can we do something to deal with that ?

I also tried to use the same code to rank into deciles using a different variable....but the same results a lot of obs are assigned to the 10 decile

I tried to winsorize the variable l_op2, or use other variables in my dataset but the same issue, Stata assigns a lot of obs to the 10th decile and almost all other deciles have similar number of obs....It is puzzling !!!!

Last edited by Ahmed Abdalla; 13 May 2014, 16:28. Reason: to clarify further
Comment
Ahmed Abdalla

Join Date: Apr 2014

Posts: 82
#9

13 May 2014, 17:39

I have the feeling that something is going wrong with the code itself, as I have tried on ranking using different variables...
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#10

13 May 2014, 18:51

Ahmed,

Looking at the output you displayed, I'm not so sure anymore whether precision is the real problem. But just in case, all of the variables involved in the calculation of l_op2 (si, c2, op2) should be declared as double when they are created (e.g.: gen double c2=.). This will store the numbers with twice as much precision as with float, which is the default.

What is probably more useful would be looking at the code for pctile/xtile (if possible) and seeing what methods are used and how much precision is maintained. I don't have time to do this tonight, but if no one else volunteers, I will look at it tomorrow.

Regards,
Joe
Comment
Ahmed Abdalla

Join Date: Apr 2014

Posts: 82
#11

13 May 2014, 20:21

Joe
I will try what you suggest also tomorrow morning (US time : ), but note that this happens with other variables as well even if they are not created from a regression. for example , ranking on si itself results in the same thing !
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35726
#12

13 May 2014, 23:28

The same values have the same ranks, so ranking will change nothing here. Joe has put his finger on the problem. All programs that assign to quantile-based groups follow the principle that observations which have the same value will end up in the same group: what else could they do without arbitrariness?

In principle, you could be the first to spot a bug. In practice, people do this all the time with this kind of data, so you can be confident that a bug would have been unearthed by now.

I get the impression that you are not really looking at your data. Some graphical examination of its fine structure (e.g. with spikeplot) is likely to underline its quirks.
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#13

14 May 2014, 07:56

Ahmed,

I just realized what the problem is (can't believe none of us spotted it until now)...

Code:

egen decile = xtile(l_op2) , by(yr) p(1/9)

This specification of the egenmore xtile() function creates percentiles 1%, 2%, 3%, ..., 10%. Accordingly, the last group is going to have 90% of the data. The correct way to do this is one of the following:

Code:

egen decile = xtile(l_op2) , by(yr) p(10(10)90) egen decile = xtile(l_op2) , by(yr) nq(10) // Better (more concise) since you are using a standard definition of decile

The attempt to do quintiles failed for the same reason: p(1/4) means 1%, 2%, 3%, 4%.

The attempt to use the xtile command might have worked, but I think there was another problem there: because you are using a lagged variable all of the observations for the first year (1990) are missing, hence "no observations". The egenmore xtile() function works better because by() can handle such things.

Regards,
Joe
1 like
Comment
Ahmed Abdalla

Join Date: Apr 2014

Posts: 82
#14

14 May 2014, 11:38

Dear Joe Canner
I can't believe that ! It seems to be solved , you are great !! , again Big Like for your post !! Thanks also Nick for your hints!

I used this code now : egen decile = xtile(l_op2) if op2 !=0 , by(yr) p(10(10)90)
and this code egen decile = xtile(l_op2) if l_op2W !=0 , by(yr) nq(10)

to exclude firms with zero l_op2 and it worked properly, almost each decile each year and all over the sample years has fairly similar number of observations.... That is awesome !

However, I can't understand why Stata did that before using p(1/9) ? and I didn't get what you suggest by using egenmore xtile () with by () , do you suggest any changes to the code because I am using lagged variable as the ranking variable ?

Best Ahmed
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#15

14 May 2014, 12:03

Ahmed,

As I mentioned at the beginning of my last post, the percentiles() option takes a numlist which corresponds to the percentiles you want. Percentiles range from 1% to 99% and your numlist (1/9) only asks for 1%, 2%, 3%, ... up to 9%. The last group includes everything from 10% to 100%. In contrast, the correct numlist (10(10)90) creates percentiles 10%, 20%, ..., 90%, corresponding to the 10 deciles you want. See the example in the egenmore help:

percentiles() requests percentiles corresponding to numlist: for example, p(25(25)75) is used to
create a variable according to quartiles.

My last comment had to do with why the xtile command (suggested by Fernando) didn't work. The xtile command, used with a loop, doesn't work with lagged variables, but the egenmore xtile() function with the by() option does. You used the latter, so you are fine.

Regards,
Joe
Comment

Announcement

Why creating deciles results is unsystematic allocation of observations of the ranking variable ?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment