Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • larger values and averages

    Dear All, I have the data
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(id profit)
    1 100
    2 130
    3 230
    4 120
    5 160
    end
    and for each id, I'd like to calculate the average value of those values of other `id's larger than the value of `id'. For instance, for id=1, the values of all other `id's are larger than 100 (id=1). As such, I want to calculate y=(130+230+120+160)/4. For id=2, the values of id=3 (230) and id=5 (160) are larger than 130 (id=2). As such, I want to calculate y=(230+160)/2. And so on. Anys suggestions?
    Ho-Chuan (River) Huang
    Stata 19.0, MP(4)

  • #2
    Perhaps this does what you want. Note that my sample data was expanded to include a second observation with the value 120, to test how the code handles multiple observations with the same value.
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(id profit)
    1 100
    2 130
    3 230
    4 120
    5 160
    6 120
    end
    sort profit
    generate sf = sum(profit)
    generate sr = sf[_N]-sf
    generate nr = _N-_n
    generate av = 0
    by profit: replace av = sr[_N]/nr[_N]
    list, clean noobs
    Code:
    . list, clean noobs
    
        id   profit    sf    sr   nr         av  
         1      100   100   760    5        152  
         4      120   220   640    4   173.3333  
         6      120   340   520    3   173.3333  
         2      130   470   390    2        195  
         5      160   630   230    1        230  
         3      230   860     0    0          .

    Comment


    • #3
      Dear William, Many thanks for your helpful suggestion.

      Ho-Chuan (River) Huang
      Stata 19.0, MP(4)

      Comment


      • #4



        Here's another way to do it with rangestat (SSC) from Robert Picard and friends. Here 0.1 is just an arbitrary offset sufficient to identify what's higher. In larger problems its choice might be a little more challenging. I used the data example of William Lisowski.

        Code:
        clear
        input float(id profit)
        1 100
        2 130
        3 230
        4 120
        5 160
        6 120
        end
        
        rangestat (mean) profit, interval(profit 0.1 .) 
        
        sort profit 
        
        list , sep(0) 
        
             +-------------------------+
             | id   profit   profit_~n |
             |-------------------------|
          1. |  1      100         152 |
          2. |  6      120   173.33333 |
          3. |  4      120   173.33333 |
          4. |  2      130         195 |
          5. |  5      160         230 |
          6. |  3      230           . |
             +-------------------------+

        Comment


        • #5
          Dear Nick, Thank you so much. Indeed, I was wondering whether -rangestat- can do that.

          Ho-Chuan (River) Huang
          Stata 19.0, MP(4)

          Comment


          • #6
            I will just note that the -rangestat- solution in #4 depends on knowledge that the values of profit are integers, so that 0.1 is guaranteed to be (substantially) less than the smallest difference between values. If prior knowledge about the differences between distinct values is not available, then it gets a bit more complicated:

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input float(id profit)
            1 100
            2 130
            3 230
            4 120
            5 160
            6 120
            end
            
            sort profit
            gen delta = profit - profit[_n-1]
            summ delta if delta > 0
            local epsilon = r(min)/2
            
            rangestat (mean) profit, interval(profit, `epsilon', .)
            It's then a toss-up as to whether this solution is really any simpler than #2.

            (A situation very similar to this arises often in my own work, and I have been tripped up repeatedly by believing I know the minimum difference a priori, only to find out that what I "knew" is incorrect. In this, as in everything else, data are seldom as regular as you think they are before you actually work with them. Now I routinely calculate an epsilon even when I'm pretty sure I know the data.)

            Comment


            • #7
              The rangestat solution depends on the values of profit being roughly commensurate. The example below outlines a potential failure mode.
              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input double (id profit)
              1 0.000000001
              2 0.000000002
              3 10000000000
              4 20000000000
              end
              sort profit
              generate double delta = profit - profit[_n-1]
              summarize delta if delta > 0
              generate double epsilon = r(min)/2
              generate double plus = profit+epsilon
              generate double d2 = plus>profit
              format %21.9f profit plus
              list, clean noobs
              Code:
              . list, clean noobs
              
                  id                  profit       delta     epsilon                    plus   d2  
                   1             0.000000001           .   5.000e-10             0.000000002    1  
                   2             0.000000002   1.000e-09   5.000e-10             0.000000003    1  
                   3   10000000000.000000000   1.000e+10   5.000e-10   10000000000.000000000    0  
                   4   20000000000.000000000   1.000e+10   5.000e-10   20000000000.000000000    0
              And a comparison, although perhaps I have misused rangestat - I don't understand its result.
              Code:
              sort profit
              generate double sf = sum(profit)
              generate double sr = sf[_N]-sf
              generate double nr = _N-_n
              generate double av = 0
              by profit: replace av = sr[_N]/nr[_N]
              rangestat (mean) profit, interval(profit,epsilon, .)
              format %21.9f id av profit_mean
              list id profit av profit_mean, clean noobs
              Code:
              . list id profit av profit_mean, clean noobs
              
                           id                  profit                      av            profit_mean  
                  1.000000000             0.000000001   10000000000.000000000   7500000000.000000000  
                  2.000000000             0.000000002   15000000000.000000000   7500000000.000000000  
                  3.000000000   10000000000.000000000   20000000000.000000000   7500000000.000000000  
                  4.000000000   20000000000.000000000                       .   7500000000.000000000
              Last edited by William Lisowski; 02 Feb 2018, 10:14.

              Comment


              • #8
                Dear Clyde, Many thanks for your additional suggestion.
                Ho-Chuan (River) Huang
                Stata 19.0, MP(4)

                Comment


                • #9
                  Dear William, Many thanks for your additional suggestion.

                  Ho-Chuan (River) Huang
                  Stata 19.0, MP(4)

                  Comment


                  • #10
                    Re #7. Yes, -rangestat- has been misapplied. In #6, epsilon was a local macro, that was dereferenced in the -interval()- option, That is, -interval() saw a number in its second argument. When the second or third argument is a number, the corresponding limit of the inclusion interval is calculated by adding the number to the value of the variable. But in #7, epsilon is a variable. When -interval()- has a variable in the second or third argument, the value of the variable itself is taken as the corresponding limit of the inclusion interval. So the code in #7 seeks to include all observations where profit lies between 5e-10 and ., which, in this data set is every observation. The value 7500000000.000000000 is, to within the rounding error of double precision arithmetic, the correct average of profit in all four observations.

                    Comment


                    • #11
                      Thank you, Clyde, for the clarification on rangestat - it's not part of my day-to-day toolkit.

                      The failure to get the comparison right didn't affect the thrust of my argument, which is that concerns with precision may preclude, given a Stata variable x sorted in ascending order, the existence of a single value of epsilon for which, for any i for which Stata evaluates as true the expression x[i+1]>x[i] Stata then also evaluates as true the expressions x[i]+epsilon<=x[i+1] and x[i]+epsilon>x[i]. I think I've said that correctly.

                      The following corrects the comparison in post #7 by using a macro expression that evaluates to a constant epsilon as the second argumet, for the benefit of future readers of this topic.
                      Code:
                      sort profit
                      generate double sf = sum(profit)
                      generate double sr = sf[_N]-sf
                      generate double nr = _N-_n
                      generate double av = 0
                      by profit: replace av = sr[_N]/nr[_N]
                      rangestat (mean) profit, interval(profit,`=epsilon[1]', .)
                      format %21.9f id av profit_mean
                      list id profit av profit_mean, clean noobs
                      Code:
                      . list id profit av profit_mean, clean noobs
                      
                                   id                  profit                      av             profit_mean  
                          1.000000000             0.000000001   10000000000.000000000   10000000000.000000000  
                          2.000000000             0.000000002   15000000000.000000000   15000000000.000000000  
                          3.000000000   10000000000.000000000   20000000000.000000000   15000000000.000000000  
                          4.000000000   20000000000.000000000                       .   20000000000.000000000

                      Comment


                      • #12
                        You are absolutely right about this. Basically I think that the attempt to calculate profit + epsilon is causing an underflow, and producing the original value of profit as the answer, so that for id = 3, -rangestat- is including id's 3 and 4 instead of just id 4. This is a good point. While there will not be many real world data sets where this kind of thing happens, they will pop up just when you least expect them and cause problems.

                        Your approach in #2 is more robust because it relies only on sorting the existing values and on running sums calculated in sorted order, so that underflow is far less likely to happen (though if I set my mind to it, I'm sure I could create an even more pathological data set that would break this code as well). It's a good reminder that finite-precision addition is not associative and that sometimes we have to be careful about how we compute sums and averages. Thanks!

                        As an aside: -rangestat- and, even more, -rangejoin-, have become such a central part of my work that they are a bit like the internet. I know they are recent developments in my lifetime but I can scarcely remember how life without them was even possible.
                        Last edited by Clyde Schechter; 03 Feb 2018, 10:11.

                        Comment


                        • #13
                          William makes a good point on the potential pitfall of using an epsilon for this. For those who may have trouble following the argument, if the calculated epsilon is 0.000000001, then when added to a large number like 20,000,000,000, the result is still 20,000,000,000 because there are not enough bits in a double to store the exact sum. It's also easy to check using:
                          Code:
                          . assert 0.000000001 + 20000000000 == 20000000000
                          
                          .
                          I still think that rangestat is a good tool for this, you just need to find a better interval bound. If the problem is to find the mean of all profits that are higher than the current observation, then the natural thing to do is to find the next value for profit that is different (similar to what William does with his running sums) as the lower bound to use in the interval. Note that if the next value is missing, rangestat will treat the lower interval bound such that all values will match. When the next value for profit is missing, all subsequent values are also missing so we simply replace the calculated mean with a missing value when the lower bound used was missing. Here's a quick example and how it contrasts with William's approach. Note that since I introduce missing values, William's code needs to be modified to group non-missing values together.

                          Code:
                          * Example generated by -dataex-. To install: ssc install dataex
                          clear
                          input double (id profit)
                          1 0.000000001
                          2 0.000000002
                          3 0.000000002
                          4 10000000000
                          5 20000000000
                          6 .
                          7 .
                          end
                          
                          sort profit
                          gen double next = profit[_n+1]
                          bysort profit: replace next = next[_N]
                          rangestat (mean) profit, interval(profit next .)
                          replace profit_mean = . if mi(next)
                          
                          gen touse = !mi(profit)
                          sort touse profit
                          by touse: generate double sf = sum(profit)
                          by touse: generate double sr = sf[_N]-sf
                          by touse: generate double nr = _N-_n
                          by touse: generate double av = 0 if touse
                          by touse profit: replace av = sr[_N]/nr[_N] if touse
                          
                          sort profit
                          format %21.9f id av profit_mean
                          list id profit av next profit_mean, clean noobs
                          and the results
                          Code:
                          . list id profit av next profit_mean, clean noobs
                          
                                       id      profit                      av        next             profit_mean  
                              1.000000000   1.000e-09    7500000000.000000000   2.000e-09    7500000000.000000000  
                              2.000000000   2.000e-09   15000000000.000000000   1.000e+10   15000000000.000000000  
                              3.000000000   2.000e-09   15000000000.000000000   1.000e+10   15000000000.000000000  
                              4.000000000   1.000e+10   20000000000.000000000   2.000e+10   20000000000.000000000  
                              5.000000000   2.000e+10                       .           .                       .  
                              7.000000000           .                       .           .                       .  
                              6.000000000           .                       .           .                       .  
                          
                          .

                          Comment


                          • #14
                            Thinking outside the box a bit on this one but you can also simply group profits and then target higher profits using the group identifier. Since these identifiers are generated based on the sorted values of profit, this amounts to defining an interval that starts in the next group:

                            Code:
                            * Example generated by -dataex-. To install: ssc install dataex
                            clear
                            input float(id profit)
                            1 100
                            2 130
                            3 230
                            4 120
                            5 160
                            6 120
                            7 .
                            end
                            
                            egen gprofit = group(profit)
                            rangestat (mean) profit , interval(gprofit 1 .) 
                            
                            sort profit
                            list
                            and the results
                            Code:
                            . list
                            
                                 +-----------------------------------+
                                 | id   profit   gprofit   profit_~n |
                                 |-----------------------------------|
                              1. |  1      100         1         152 |
                              2. |  6      120         2   173.33333 |
                              3. |  4      120         2   173.33333 |
                              4. |  2      130         3         195 |
                              5. |  5      160         4         230 |
                                 |-----------------------------------|
                              6. |  3      230         5           . |
                              7. |  7        .         .           . |
                                 +-----------------------------------+
                            
                            .

                            Comment


                            • #15
                              Dear Robert, Many thanks for this wonderful solution.

                              Ho-Chuan (River) Huang
                              Stata 19.0, MP(4)

                              Comment

                              Working...
                              X