Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • egen xtile does not assign observations to 10th decile

    Dear Statalist,

    I am having trouble with the user-written function xtile as an egen function.
    My data set consists of data of numerous stocks for the period 1963 - 2016. Each stock is identified by its unique permno.
    For each stock on each date (described as "time") I calculate a measure on which the xtile sorting depends on. This measure is called the "breakhigh52w2". The minimum value of this measure is 0.007 and the maximum is 1.
    The distribution of this measure looks as the following:

    Distribution.gph

    Next, I insert the following command:

    egen breakhigh52w_a = xtile(breakhigh52w), by(time) nq(10)

    A histogram gives a visual representation of this variable.

    Histogram.gph

    Clearly, there are a lot less observations in the 10th decile. This confuses me as for each date there should be at least one observation be assigned to the 10th decile right?
    I assume that the large number of observations with a "breakhigh52w2"- value of 1 and the boundary points of the deciles are responsible for this.

    Does anyone has an idea how to overcome this problem?

    Thanks in advance,

    Huib

  • #2
    Hello Huib,

    Welcome to the Statalist.

    You may wish to take a look at the FAQ, particularly the topic onhow to share graphics in the Stata Forum.


    That said, you may wish to present the output of the 'summarize var' command.

    I suspect there is concentration of the values well beloe the last decile.
    Last edited by Marcos Almeida; 20 Mar 2017, 07:06.
    Best regards,

    Marcos

    Comment


    • #3
      Dear Marcos,

      Thanks for your reply. I will edit my question and use the correct extension to share graphics. In addition, I added the summary output of the variable.
      I am sorry about the previous post!

      ---------

      Dear Statalist,

      I am having trouble with the user-written function xtile as an egen function.
      My data set consists of data of numerous stocks for the period 1963 - 2016. Each stock is identified by its unique permno.
      For each stock on each date (described as "time") I calculate a measure on which the xtile sorting depends on. This measure is called the "breakhigh52w2". The minimum value of this measure is 0.007 and the maximum is 1.
      The distribution of this measure looks as the following:

      Click image for larger version

Name:	Distribution.png
Views:	1
Size:	10.9 KB
ID:	1379236


      The command summarize, detail gives the following information about the variable:

      Click image for larger version

Name:	Summary var.PNG
Views:	1
Size:	9.4 KB
ID:	1379238


      Next, I insert the following command:

      egen breakhigh52w_a = xtile(breakhigh52w), by(time) nq(10)

      A histogram gives a visual representation of this variable.

      Click image for larger version

Name:	Graph.png
Views:	1
Size:	13.8 KB
ID:	1379237


      Clearly, there are a lot less observations in the 10th decile. This confuses me as for each date there should be at least one observation be assigned to the 10th decile right?
      I assume that the large number of observations with a "breakhigh52w2"- value of 1 and the boundary points of the deciles are responsible for this.

      Does anyone has an idea how to overcome this problem?

      Thanks in advance,

      Huib

      Comment


      • #4
        The use of egen, xtile() (from egenmore on SSC, as you are asked to explain) is incidental here, as a convenience because you are binning groupwise.

        This looks like a generic problem and a consequence of binning your data when ties are present.

        To make it concrete, suppose that 15% of values are equal to the maximum and you want decile-based bins. What do you expect to happen then?

        No Stata program or function known to me will split those tied values, and assign 10% to one (top) decile bin and 5% to another, the next decile bin.

        The rule that overrides all others is that the same values must end up in the same bin.

        At most your choice here (apart from not binning, usually a very good idea) is to bin on the negation of your variable. Discussion within http://www.stata-journal.com/sjpdf.h...iclenum=pr0054

        Here's an example. It's a discrete example, but the principle is the same. Stata's convention will imply that top bin is empty.

        Code:
        . clear
        
        . matrix freq = [5,10,10,10,10,10,10,10,10,15]
        
        . set obs 10
        number of observations (_N) was 0, now 10
        
        . gen y = _n
        
        . gen freq = freq[1, _n]
        
        . expand freq
        (90 observations created)
        
        . tab y
        
                  y |      Freq.     Percent        Cum.
        ------------+-----------------------------------
                  1 |          5        5.00        5.00
                  2 |         10       10.00       15.00
                  3 |         10       10.00       25.00
                  4 |         10       10.00       35.00
                  5 |         10       10.00       45.00
                  6 |         10       10.00       55.00
                  7 |         10       10.00       65.00
                  8 |         10       10.00       75.00
                  9 |         10       10.00       85.00
                 10 |         15       15.00      100.00
        ------------+-----------------------------------
              Total |        100      100.00
        
        . xtile bin1=y , nq(10)
        
        . tab bin1
        
                 10 |
          quantiles |
              of y  |      Freq.     Percent        Cum.
        ------------+-----------------------------------
                  1 |         15       15.00       15.00
                  2 |         10       10.00       25.00
                  3 |         10       10.00       35.00
                  4 |         10       10.00       45.00
                  5 |         10       10.00       55.00
                  6 |         10       10.00       65.00
                  7 |         10       10.00       75.00
                  8 |         10       10.00       85.00
                  9 |         15       15.00      100.00
        ------------+-----------------------------------
              Total |        100      100.00
        
        . gen negy = -y
        
        . xtile bin2=negy , nq(10)
        
        . tab bin2
        
                 10 |
          quantiles |
           of negy  |      Freq.     Percent        Cum.
        ------------+-----------------------------------
                  1 |         15       15.00       15.00
                  2 |         10       10.00       25.00
                  3 |         10       10.00       35.00
                  4 |         10       10.00       45.00
                  5 |         10       10.00       55.00
                  6 |         10       10.00       65.00
                  7 |         10       10.00       75.00
                  8 |         10       10.00       85.00
                  9 |         10       10.00       95.00
                 10 |          5        5.00      100.00
        ------------+-----------------------------------
              Total |        100      100.00

        Comment

        Working...
        X