Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Quintiles by value

    Dear Statalist,

    I want to separate a specific variable in quintiles, not by the number of observations, which is the result generated by xtile command, but by the values of the observations.

    The code with I used is this:

    Code:
    xtile port_LB = LB_a_1, nq(5)
    
    table port_LB, contents(n LB_a_1 min LB_a_1 max LB_a_1 mean LB_a_1)
    Click image for larger version

Name:	stata.png
Views:	1
Size:	6.5 KB
ID:	1504291



    This quintiles are formed by the numbers of observation, as we can see in the first columm.

    There's any way to do this by the values of the observations? Does anyone have any idea about how to do this?

    Thank you!
    Last edited by Alysson Francisco; 21 Jun 2019, 11:22.

  • #2
    This quintiles are formed by the numbers of observation, as we can see in the first columm.

    There's any way to do this by the values of the observations? Does anyone have any idea about how to do this?
    I don't understand what you mean by this. Can you explain, or, better, provide some example data and then show what you want the result to look like.

    Comment


    • #3
      Hi, Clyde.

      This is part of my database. I renamed LB_a_1 to gross profit and port_LB to quintiles.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float gross_profit byte quintiles
        .2029267 3
       .27466056 4
        .0776501 1
        .1576813 2
       .23777103 3
        .3014265 4
       .08390924 1
       .16547313 2
        .2385117 3
        .3221767 4
        .0828101 1
       .16660033 2
        .1750131 2
       .23489387 3
       .06181969 1
       .12803297 1
       .19096455 3
       .25748774 3
      .070972264 1
       .14391461 2
        .2282676 3
        .3017316 4
        .3786891 4
       .08248884 1
        .1910788 3
       .16843036 2
       .44742125 5
       .09678757 1
       .19497286 3
        .3208095 4
       .44742125 5
       .12232751 1
        .2221124 3
        .3037178 4
         .389981 5
      .065729104 1
        .1168596 1
       .16843036 2
       .28889105 4
       .09105394 1
        .1777167 2
       .22721837 3
       .29732373 4
        .0836201 1
       .15665257 2
       .21167165 3
        .2921291 4
       .07056332 1
       .15794978 2
       .23639095 3
        .2989278 4
       .05041697 1
       .09325901 1
        .1244713 1
        .1809014 3
       .04665191 1
       .13138473 1
        .2094828 3
       .27780348 4
       .08002219 1
        .1402248 2
         .210746 3
       .28453812 4
       .06527575 1
       .12133432 1
       .18889205 3
        .1666926 2
       .24155883 3
       .07321396 1
       .15547764 2
         .227936 3
        .3193174 4
       .07096003 1
       .14997013 2
        .2033346 3
       .27716222 4
       .06871334 1
       .14448956 2
       .19901946 3
       .18628158 3
       .07198721 1
       .14604652 2
        .2338707 3
        .3221313 4
       .07961577 1
       .16129784 2
        .2428835 3
        .3909354 5
       .56055456 5
       .17199475 2
        .3452226 4
       .50433874 5
        .6777239 5
       .18824925 3
         .364068 4
         .459493 5
        .5533788 5
       .14893661 2
       .28366256 4
        .4243927 5
      end
      If I run the xtile command, these are the results:
      Click image for larger version

Name:	stata1.png
Views:	1
Size:	4.0 KB
ID:	1504319





      Where N(gross_profit) provide the number of observations in dataset for each quintile So, the quintiles are based in the number of observations, in other words, the xtile command takes the total of observations and divide for 5, to generate the quintiles, and I do not want this.

      I would like to generate a quitile based in the value of gross_profit.

      Ex: If the lowest value of gross_profit is 0.04 and the highest is 1.85, I would like to do quintiles between them, based in the values of gross_profit variable. So, if the first quintile ends in a gross_profit = 0.2, for example, the observations between 0.04 and 0.2 create the first quintile, no matters how many observations have there.

      The point is generate the quintiles without specifying the values, because there are lot of variables to do this.

      Thank you for any help!
      Last edited by Alysson Francisco; 21 Jun 2019, 15:01.

      Comment


      • #4
        So, if the first quintile ends in a gross_profit = 0.2, for example, the observations between 0.04 and 0.2 create the first quintile, no matters how many observations have there.
        OK, but where did that number 0.2 come from? How did you derive that number from the data? And what would be the upper ends of the other 4 groups? How are they calculated from the data?

        By the way, whatever the answers to the above questions, you should not refer to these groups as quintiles. The word quintile specifically means five groups containing equal numbers of members (or as equal as possible given ties) based on the ordering.

        Comment


        • #5
          I share most of Clive's puzzlement, not least over terminology: whatever you want, they can't be quintile-based bins or classes.

          However, I find it a little easier to guess what you may be seeking, as similar-seeming questions arise here once every few years. These threads and references may help.

          https://www.stata.com/statalist/arch.../msg00883.html

          The thread starting with https://www.stata.com/statalist/arch.../msg00480.html

          https://www.statalist.org/forums/for...g-jenks-splits

          See also cluster kmeans and the community-contributed command

          Code:
          ssc install group1d 
          help group1d
          Despite some small creative pride in the program group1d -- creation being here translation rather than origination -- which goes back to Hartigan's wonderful book Clustering Algorithms in 1975, I am sceptical that so-called natural breaks that people seek aren't usually arbitrary, unrepeatable small gaps in a continuum.

          I have it in mind to extend the program to include L_1 as well as L_2 as a criterion.

          In the case of the kind of data in #3 I would be more likely to work with log profit than with profit.

          Comment


          • #6
            Thank you Clyde and Nick for your answers. Really helped me to solve my problem.

            Thank you again!

            Comment

            Working...
            X