Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Create a percetile of a variable and compute the mean of another variable inside each percentile

    Hello,

    I have the following dataset:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long gid float animals double Age
    62356 0                      0
    62357 0                      0
    79599 0      .1778533116282972
    79600 0     .17785357331615614
    79601 0      .1778536186029953
    80317 0     .15909446787289028
    80318 1     .17610458395941464
    80319 0      .1778535061320099
    80320 0      .1778534511499288
    80321 0     .17785336784015726
    80322 0        .17785364358528
    80323 0     .17785352006697508
    80324 0     .17785340046719966
    80325 0      .1778538689661957
    80326 0     .17785426020441264
    80327 0     .17785365978238588
    80328 0     .17785336676075758
    80329 0     .17785305873016455
    80330 0     .17785384947953276
    80331 0                      0
    80332 0                      0
    81037 3     .14923540560909032
    81038 2      .1654487674794609
    81039 0     .17785390171007975
    81040 0     .17785403231246733
    81041 0     .17785384652552239
    81042 0     .17785314257483265
    81043 0     .17785357228453336
    81044 0     .17785338302178388
    81045 0     .17785369357522995
    81046 0     .17785348120435174
    81047 0     .17785344378697232
    81048 0      .1778532409188358
    81049 0     .17785387566598843
    81050 0     .18247074342281167
    81051 0    .011753128763974531
    81052 1                      0
    81053 0                      0
    81054 0                      0
    81055 0                      0
    81756 0      .1778543271591957
    81757 0     .17785356209429892
    81758 0     .17785367363626392
    81759 0     .17785313465776653
    81760 0      .1778536476339525
    81761 0     .17785362234644708
    81762 0      .1778533081250373
    81763 0     .17785398037716504
    81764 0     .17785360667215855
    81765 0     .17785288419261747
    81766 0     .17785301989732716
    81767 0      .1778528778077373
    81768 0      .1778540126846906
    81769 0     .17785418244653783
    81770 0      .2192007436492247
    81771 0    .020631879957504234
    81772 0                      0
    81773 0                      0
    81774 0                      0
    81775 0 5.9604641222676946e-08
    81776 0 5.9604641222676946e-08
    82476 0     .17785365924672192
    82477 0      .1778531219408137
    82478 0     .17785362497488677
    82479 0     .17785388698396787
    82480 0      .1778532557042638
    82481 0     .17785362581781783
    82482 0     .17785249066568554
    82483 0     .17785345179188994
    82484 0     .17785268525491915
    82485 0      .1778538015765605
    82486 0      .1778537318877227
    82487 0     .17785374539445797
    82488 0     .17785365189578783
    82489 0     .17785358452277933
    82490 0      .1778539981315248
    82491 0     .24774813358560266
    82492 0                      0
    82493 0                      0
    82494 0                      0
    82495 0                      0
    82496 0                      0
    82497 0                      0
    82498 0                      0
    83197 0     .17785403006737965
    83198 0      .1778535550090794
    83199 0     .17785366138360814
    83200 0     .17785345138392472
    83201 0      .1778537523648822
    83202 0     .17785178819127623
    83203 0     .17785300154017136
    83204 0     .17785340918089787
    83205 0     .17785373389469328
    83206 0     .17785342168173202
    83207 0     .17785335422018989
    83208 0      .1778531530811449
    83209 0     .17785333043048013
    83210 0     .17936670467301497
    83211 0     .06361043014024137
    83212 0                      0
    end

    I would like to classify my variable "animals" in percentiles. I mean, I want to do 10 intervals, being the first one the lower (the first 10% of observations according to the value of variable animals) and the last one the biggest (the 10% biggest values of variable animals). After that, I would like to compute the mean of Age within each percentile. For example, I want to know the mean of Age of the first 10% of observations (it should be percentile one).

    I am trying to do the first step (classify the variable animals in perceltiles) with the following code:

    Code:
    pctile animals_pct = animals, nq(10) genp(percent)
    However, it does not work, it just give me the value of the first 10 observations. Anyone can help me with my problem? Thank you!
    Last edited by Diego Malo; 23 Mar 2022, 05:11.

  • #2
    pctile "does not work" in the sense that it was designed to calculate and display particular percentiles -- which is not what you want.

    You want decile intervals or bins, for which

    Code:
    xtile wanted = animals, nq(10)
    is a first step. I note in advance that you may be frustrated as

    1. xtile respects ties, and observations with the same value will be assigned to the same bin. With a very spiky distribution, decile bin frequencies may be far from one tenth of the total frequency.

    2. xtile can't identify more bins than there are distinct values in the variable being binned.

    animals at first sight seems a very unsuitable variable for this purpose.

    More general comments at

    https://www.stata-journal.com/articl...article=dm0095 esp. Section 6

    https://www.stata-journal.com/articl...article=pr0054 esp. Section 4

    The largest collection of quantile terms I know is at https://stats.stackexchange.com/ques.../235334#235334




    Comment


    • #3
      Thank you Nick Cox for your comment and for the references. It is helpful.

      I showed some observations. My database is much bigger and the most of the observations are 0. When I wrote the command you suggested in the complete database it's just produce two numbers, 1 and 10. Am I doing something wrong?

      Comment


      • #4
        You're doing nothing wrong in this sense. You can't and need not try anything different by way of binning. But as I predicted earlier, the goal of decile bins is evidently doomed to failure with your data.

        All the zero values must be assigned to the same bin. What happens with the rest of the observations depends on their distribution. But if for example they collectively make up less than 10% of the data, then they too are assigned to one bin.

        Here's a demonstration. You need to run the code to see the graph, but the last table shows the same information.

        Code:
        . clear
        
        . set obs 1000
        Number of observations (_N) was 0, now 1,000.
        
        . gen problem = 0 in 1/950
        (50 missing values generated)
        
        . replace problem = 10 * (_n  - 950) in 951/L
        (50 real changes made)
        
        . quantile problem
        
        . xtile disaster=problem, nq(10)
        
        . tab disaster
        
                 10 |
          quantiles |
         of problem |      Freq.     Percent        Cum.
        ------------+-----------------------------------
                  1 |        950       95.00       95.00
                 10 |         50        5.00      100.00
        ------------+-----------------------------------
              Total |      1,000      100.00
        
        . tab problem disaster
        
                   |    10 quantiles of
                   |        problem
           problem |         1         10 |     Total
        -----------+----------------------+----------
                 0 |       950          0 |       950 
                10 |         0          1 |         1 
                20 |         0          1 |         1 
                30 |         0          1 |         1 
                40 |         0          1 |         1 
                50 |         0          1 |         1 
                60 |         0          1 |         1 
                70 |         0          1 |         1 
                80 |         0          1 |         1 
                90 |         0          1 |         1 
               100 |         0          1 |         1 
               110 |         0          1 |         1 
               120 |         0          1 |         1 
               130 |         0          1 |         1 
               140 |         0          1 |         1 
               150 |         0          1 |         1 
               160 |         0          1 |         1 
               170 |         0          1 |         1 
               180 |         0          1 |         1 
               190 |         0          1 |         1 
               200 |         0          1 |         1 
               210 |         0          1 |         1 
               220 |         0          1 |         1 
               230 |         0          1 |         1 
               240 |         0          1 |         1 
               250 |         0          1 |         1 
               260 |         0          1 |         1 
               270 |         0          1 |         1 
               280 |         0          1 |         1 
               290 |         0          1 |         1 
               300 |         0          1 |         1 
               310 |         0          1 |         1 
               320 |         0          1 |         1 
               330 |         0          1 |         1 
               340 |         0          1 |         1 
               350 |         0          1 |         1 
               360 |         0          1 |         1 
               370 |         0          1 |         1 
               380 |         0          1 |         1 
               390 |         0          1 |         1 
               400 |         0          1 |         1 
               410 |         0          1 |         1 
               420 |         0          1 |         1 
               430 |         0          1 |         1 
               440 |         0          1 |         1 
               450 |         0          1 |         1 
               460 |         0          1 |         1 
               470 |         0          1 |         1 
               480 |         0          1 |         1 
               490 |         0          1 |         1 
               500 |         0          1 |         1 
        -----------+----------------------+----------
             Total |       950         50 |     1,000 
        
        .
        Just occasionally you can improve on the results of xtile by reversing the order of values. There is an example in pr0054, Usually, that's just tinkering.

        But why do you want these bins anyway? From the sound of it, animals is a predictor variable in some larger project. If so, that's fine. Use it in some way that preserves the information. If it is too skewed, and/or contains outliers, consider using some transformation such as square root, cube root, or log(count + 1).

        It's sufficient that binning works very badly here for it to be abandoned. Beyond that, in principle it degrades the data by discarding information, and is dubious on that ground any way.

        These points have been made again and again in several literatures. There are some references in the papers cited, and they in turn lead to others.

        Comment


        • #5
          Really thank you Nick Cox .

          I see what you means. The objective of doing the bins is to calculate the mean of a different variable in each percentile. I would like to know if the mean of variable "Age" varies depending on the percentile considered.

          My variable "animals" take values from 0 to 56 with a big concentration in 0 and in the first values, 1, 2, etc... Is it not possible to do so much, right? I have tried adding 1 and takings logs to reduce the skewed, but xtile still gives me just two values.

          Comment


          • #6
            No, binning is doomed, as I said. Period. I didn't mean to be understood as suggesting that transformation could improve matters, because it can't. You end up with precisely the same set of frequencies. Rather, you might get clearer graphs or results with some transformation, but not better binning.

            You can go

            Code:
            egen mean_age = mean(age), by(animals) 
            egen count_age = count(age), by(animals) 
            and work from there

            Comment


            • #7
              I understand. Thank you Nick Cox for your help and time. I really appreciate.

              I am thinking that maybe I will create by myself artificial bindings. If "animals" takes value 0, assigns 1; if it takes values 1, 2,3, 4, assigns 2 and so on. And after calculate the mean of "Age" by the artificial variable I made it. It's a bit discrecional but I can do several changes.

              Comment


              • #8
                See p.267 of https://www.stata-journal.com/articl...article=gr0072 for something more systematic.

                Comment

                Working...
                X