egen xtile does not assign observations to 10th decile

Huib Jan Meulenbelt

Join Date: Mar 2017

Posts: 20
#1

egen xtile does not assign observations to 10th decile

20 Mar 2017, 05:41

Dear Statalist,

I am having trouble with the user-written function xtile as an egen function.
My data set consists of data of numerous stocks for the period 1963 - 2016. Each stock is identified by its unique permno.
For each stock on each date (described as "time") I calculate a measure on which the xtile sorting depends on. This measure is called the "breakhigh52w2". The minimum value of this measure is 0.007 and the maximum is 1.
The distribution of this measure looks as the following:

Distribution.gph

Next, I insert the following command:

egen breakhigh52w_a = xtile(breakhigh52w), by(time) nq(10)

A histogram gives a visual representation of this variable.

Histogram.gph

Clearly, there are a lot less observations in the 10th decile. This confuses me as for each date there should be at least one observation be assigned to the 10th decile right?
I assume that the large number of observations with a "breakhigh52w2"- value of 1 and the boundary points of the deciles are responsible for this.

Does anyone has an idea how to overcome this problem?

Thanks in advance,

Huib
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

20 Mar 2017, 06:56

Hello Huib,

Welcome to the Statalist.

You may wish to take a look at the FAQ, particularly the topic onhow to share graphics in the Stata Forum.

That said, you may wish to present the output of the 'summarize var' command.

I suspect there is concentration of the values well beloe the last decile.

Last edited by Marcos Almeida; 20 Mar 2017, 07:06.

Best regards,

Marcos
Comment
Huib Jan Meulenbelt

Join Date: Mar 2017

Posts: 20
#3

20 Mar 2017, 07:37

Dear Marcos,

Thanks for your reply. I will edit my question and use the correct extension to share graphics. In addition, I added the summary output of the variable.
I am sorry about the previous post!

---------

Dear Statalist,

I am having trouble with the user-written function xtile as an egen function.
My data set consists of data of numerous stocks for the period 1963 - 2016. Each stock is identified by its unique permno.
For each stock on each date (described as "time") I calculate a measure on which the xtile sorting depends on. This measure is called the "breakhigh52w2". The minimum value of this measure is 0.007 and the maximum is 1.
The distribution of this measure looks as the following:

The command summarize, detail gives the following information about the variable:

Next, I insert the following command:

egen breakhigh52w_a = xtile(breakhigh52w), by(time) nq(10)

A histogram gives a visual representation of this variable.

Clearly, there are a lot less observations in the 10th decile. This confuses me as for each date there should be at least one observation be assigned to the 10th decile right?
I assume that the large number of observations with a "breakhigh52w2"- value of 1 and the boundary points of the deciles are responsible for this.

Does anyone has an idea how to overcome this problem?

Thanks in advance,

Huib
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35485

20 Mar 2017, 07:43

The use of egen, xtile() (from egenmore on SSC, as you are asked to explain) is incidental here, as a convenience because you are binning groupwise.

This looks like a generic problem and a consequence of binning your data when ties are present.

To make it concrete, suppose that 15% of values are equal to the maximum and you want decile-based bins. What do you expect to happen then?

No Stata program or function known to me will split those tied values, and assign 10% to one (top) decile bin and 5% to another, the next decile bin.

The rule that overrides all others is that the same values must end up in the same bin.

At most your choice here (apart from not binning, usually a very good idea) is to bin on the negation of your variable. Discussion within http://www.stata-journal.com/sjpdf.h...iclenum=pr0054

Here's an example. It's a discrete example, but the principle is the same. Stata's convention will imply that top bin is empty.

Code:

. clear

. matrix freq = [5,10,10,10,10,10,10,10,10,15]

. set obs 10
number of observations (_N) was 0, now 10

. gen y = _n

. gen freq = freq[1, _n]

. expand freq
(90 observations created)

. tab y

          y |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          5        5.00        5.00
          2 |         10       10.00       15.00
          3 |         10       10.00       25.00
          4 |         10       10.00       35.00
          5 |         10       10.00       45.00
          6 |         10       10.00       55.00
          7 |         10       10.00       65.00
          8 |         10       10.00       75.00
          9 |         10       10.00       85.00
         10 |         15       15.00      100.00
------------+-----------------------------------
      Total |        100      100.00

. xtile bin1=y , nq(10)

. tab bin1

         10 |
  quantiles |
      of y  |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         15       15.00       15.00
          2 |         10       10.00       25.00
          3 |         10       10.00       35.00
          4 |         10       10.00       45.00
          5 |         10       10.00       55.00
          6 |         10       10.00       65.00
          7 |         10       10.00       75.00
          8 |         10       10.00       85.00
          9 |         15       15.00      100.00
------------+-----------------------------------
      Total |        100      100.00

. gen negy = -y

. xtile bin2=negy , nq(10)

. tab bin2

         10 |
  quantiles |
   of negy  |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         15       15.00       15.00
          2 |         10       10.00       25.00
          3 |         10       10.00       35.00
          4 |         10       10.00       45.00
          5 |         10       10.00       55.00
          6 |         10       10.00       65.00
          7 |         10       10.00       75.00
          8 |         10       10.00       85.00
          9 |         10       10.00       95.00
         10 |          5        5.00      100.00
------------+-----------------------------------
      Total |        100      100.00

Announcement

egen xtile does not assign observations to 10th decile

Comment

Comment

Comment