larger values and averages

River Huang

Join Date: Mar 2016

Posts: 1908
#1

larger values and averages

01 Feb 2018, 19:08

Dear All, I have the data

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(id profit) 1 100 2 130 3 230 4 120 5 160 end

and for each id, I'd like to calculate the average value of those values of other `id's larger than the value of `id'. For instance, for id=1, the values of all other `id's are larger than 100 (id=1). As such, I want to calculate y=(130+230+120+160)/4. For id=2, the values of id=3 (230) and id=5 (160) are larger than 130 (id=2). As such, I want to calculate y=(230+160)/2. And so on. Anys suggestions?

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

01 Feb 2018, 19:29

Perhaps this does what you want. Note that my sample data was expanded to include a second observation with the value 120, to test how the code handles multiple observations with the same value.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(id profit)
1 100
2 130
3 230
4 120
5 160
6 120
end
sort profit
generate sf = sum(profit)
generate sr = sf[_N]-sf
generate nr = _N-_n
generate av = 0
by profit: replace av = sr[_N]/nr[_N]
list, clean noobs

Code:

. list, clean noobs

    id   profit    sf    sr   nr         av  
     1      100   100   760    5        152  
     4      120   220   640    4   173.3333  
     6      120   340   520    3   173.3333  
     2      130   470   390    2        195  
     5      160   630   230    1        230  
     3      230   860     0    0          .

Comment

River Huang

Join Date: Mar 2016

Posts: 1908
#3

01 Feb 2018, 20:06

Dear William, Many thanks for your helpful suggestion.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35721

02 Feb 2018, 03:00

Here's another way to do it with rangestat (SSC) from Robert Picard and friends. Here 0.1 is just an arbitrary offset sufficient to identify what's higher. In larger problems its choice might be a little more challenging. I used the data example of William Lisowski.

Code:

clear
input float(id profit)
1 100
2 130
3 230
4 120
5 160
6 120
end

rangestat (mean) profit, interval(profit 0.1 .) 

sort profit 

list , sep(0) 

     +-------------------------+
     | id   profit   profit_~n |
     |-------------------------|
  1. |  1      100         152 |
  2. |  6      120   173.33333 |
  3. |  4      120   173.33333 |
  4. |  2      130         195 |
  5. |  5      160         230 |
  6. |  3      230           . |
     +-------------------------+

Comment

River Huang

Join Date: Mar 2016

Posts: 1908
#5

02 Feb 2018, 03:24

Dear Nick, Thank you so much. Indeed, I was wondering whether -rangestat- can do that.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#6

02 Feb 2018, 08:14

I will just note that the -rangestat- solution in #4 depends on knowledge that the values of profit are integers, so that 0.1 is guaranteed to be (substantially) less than the smallest difference between values. If prior knowledge about the differences between distinct values is not available, then it gets a bit more complicated:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(id profit) 1 100 2 130 3 230 4 120 5 160 6 120 end sort profit gen delta = profit - profit[_n-1] summ delta if delta > 0 local epsilon = r(min)/2 rangestat (mean) profit, interval(profit, `epsilon', .)

It's then a toss-up as to whether this solution is really any simpler than #2.

(A situation very similar to this arises often in my own work, and I have been tripped up repeatedly by believing I know the minimum difference a priori, only to find out that what I "knew" is incorrect. In this, as in everything else, data are seldom as regular as you think they are before you actually work with them. Now I routinely calculate an epsilon even when I'm pretty sure I know the data.)
2 likes
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

02 Feb 2018, 10:04

The rangestat solution depends on the values of profit being roughly commensurate. The example below outlines a potential failure mode.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double (id profit)
1 0.000000001
2 0.000000002
3 10000000000
4 20000000000
end
sort profit
generate double delta = profit - profit[_n-1]
summarize delta if delta > 0
generate double epsilon = r(min)/2
generate double plus = profit+epsilon
generate double d2 = plus>profit
format %21.9f profit plus
list, clean noobs

Code:

. list, clean noobs

    id                  profit       delta     epsilon                    plus   d2  
     1             0.000000001           .   5.000e-10             0.000000002    1  
     2             0.000000002   1.000e-09   5.000e-10             0.000000003    1  
     3   10000000000.000000000   1.000e+10   5.000e-10   10000000000.000000000    0  
     4   20000000000.000000000   1.000e+10   5.000e-10   20000000000.000000000    0

And a comparison, although perhaps I have misused rangestat - I don't understand its result.

Code:

sort profit
generate double sf = sum(profit)
generate double sr = sf[_N]-sf
generate double nr = _N-_n
generate double av = 0
by profit: replace av = sr[_N]/nr[_N]
rangestat (mean) profit, interval(profit,epsilon, .)
format %21.9f id av profit_mean
list id profit av profit_mean, clean noobs

Code:

. list id profit av profit_mean, clean noobs

             id                  profit                      av            profit_mean  
    1.000000000             0.000000001   10000000000.000000000   7500000000.000000000  
    2.000000000             0.000000002   15000000000.000000000   7500000000.000000000  
    3.000000000   10000000000.000000000   20000000000.000000000   7500000000.000000000  
    4.000000000   20000000000.000000000                       .   7500000000.000000000

Last edited by William Lisowski; 02 Feb 2018, 10:14.

Comment

River Huang

Join Date: Mar 2016

Posts: 1908
#8

02 Feb 2018, 16:43

Dear Clyde, Many thanks for your additional suggestion.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#9

02 Feb 2018, 16:43

Dear William, Many thanks for your additional suggestion.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#10

02 Feb 2018, 17:36

Re #7. Yes, -rangestat- has been misapplied. In #6, epsilon was a local macro, that was dereferenced in the -interval()- option, That is, -interval() saw a number in its second argument. When the second or third argument is a number, the corresponding limit of the inclusion interval is calculated by adding the number to the value of the variable. But in #7, epsilon is a variable. When -interval()- has a variable in the second or third argument, the value of the variable itself is taken as the corresponding limit of the inclusion interval. So the code in #7 seeks to include all observations where profit lies between 5e-10 and ., which, in this data set is every observation. The value 7500000000.000000000 is, to within the rounding error of double precision arithmetic, the correct average of profit in all four observations.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#11

03 Feb 2018, 04:42

Thank you, Clyde, for the clarification on rangestat - it's not part of my day-to-day toolkit.

The failure to get the comparison right didn't affect the thrust of my argument, which is that concerns with precision may preclude, given a Stata variable x sorted in ascending order, the existence of a single value of epsilon for which, for any i for which Stata evaluates as true the expression x[i+1]>x[i] Stata then also evaluates as true the expressions x[i]+epsilon<=x[i+1] and x[i]+epsilon>x[i]. I think I've said that correctly.

The following corrects the comparison in post #7 by using a macro expression that evaluates to a constant epsilon as the second argumet, for the benefit of future readers of this topic.

Code:

sort profit generate double sf = sum(profit) generate double sr = sf[_N]-sf generate double nr = _N-_n generate double av = 0 by profit: replace av = sr[_N]/nr[_N] rangestat (mean) profit, interval(profit,`=epsilon[1]', .) format %21.9f id av profit_mean list id profit av profit_mean, clean noobs

Code:

. list id profit av profit_mean, clean noobs id profit av profit_mean 1.000000000 0.000000001 10000000000.000000000 10000000000.000000000 2.000000000 0.000000002 15000000000.000000000 15000000000.000000000 3.000000000 10000000000.000000000 20000000000.000000000 15000000000.000000000 4.000000000 20000000000.000000000 . 20000000000.000000000
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#12

03 Feb 2018, 10:08

You are absolutely right about this. Basically I think that the attempt to calculate profit + epsilon is causing an underflow, and producing the original value of profit as the answer, so that for id = 3, -rangestat- is including id's 3 and 4 instead of just id 4. This is a good point. While there will not be many real world data sets where this kind of thing happens, they will pop up just when you least expect them and cause problems.

Your approach in #2 is more robust because it relies only on sorting the existing values and on running sums calculated in sorted order, so that underflow is far less likely to happen (though if I set my mind to it, I'm sure I could create an even more pathological data set that would break this code as well). It's a good reminder that finite-precision addition is not associative and that sometimes we have to be careful about how we compute sums and averages. Thanks!

As an aside: -rangestat- and, even more, -rangejoin-, have become such a central part of my work that they are a bit like the internet. I know they are recent developments in my lifetime but I can scarcely remember how life without them was even possible.

Last edited by Clyde Schechter; 03 Feb 2018, 10:11.
1 like
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#13

03 Feb 2018, 10:58

William makes a good point on the potential pitfall of using an epsilon for this. For those who may have trouble following the argument, if the calculated epsilon is 0.000000001, then when added to a large number like 20,000,000,000, the result is still 20,000,000,000 because there are not enough bits in a double to store the exact sum. It's also easy to check using:

Code:

. assert 0.000000001 + 20000000000 == 20000000000

.

I still think that rangestat is a good tool for this, you just need to find a better interval bound. If the problem is to find the mean of all profits that are higher than the current observation, then the natural thing to do is to find the next value for profit that is different (similar to what William does with his running sums) as the lower bound to use in the interval. Note that if the next value is missing, rangestat will treat the lower interval bound such that all values will match. When the next value for profit is missing, all subsequent values are also missing so we simply replace the calculated mean with a missing value when the lower bound used was missing. Here's a quick example and how it contrasts with William's approach. Note that since I introduce missing values, William's code needs to be modified to group non-missing values together.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double (id profit)
1 0.000000001
2 0.000000002
3 0.000000002
4 10000000000
5 20000000000
6 .
7 .
end

sort profit
gen double next = profit[_n+1]
bysort profit: replace next = next[_N]
rangestat (mean) profit, interval(profit next .)
replace profit_mean = . if mi(next)

gen touse = !mi(profit)
sort touse profit
by touse: generate double sf = sum(profit)
by touse: generate double sr = sf[_N]-sf
by touse: generate double nr = _N-_n
by touse: generate double av = 0 if touse
by touse profit: replace av = sr[_N]/nr[_N] if touse

sort profit
format %21.9f id av profit_mean
list id profit av next profit_mean, clean noobs

and the results

Code:

. list id profit av next profit_mean, clean noobs

             id      profit                      av        next             profit_mean  
    1.000000000   1.000e-09    7500000000.000000000   2.000e-09    7500000000.000000000  
    2.000000000   2.000e-09   15000000000.000000000   1.000e+10   15000000000.000000000  
    3.000000000   2.000e-09   15000000000.000000000   1.000e+10   15000000000.000000000  
    4.000000000   1.000e+10   20000000000.000000000   2.000e+10   20000000000.000000000  
    5.000000000   2.000e+10                       .           .                       .  
    7.000000000           .                       .           .                       .  
    6.000000000           .                       .           .                       .  

.

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#14

04 Feb 2018, 09:01

Thinking outside the box a bit on this one but you can also simply group profits and then target higher profits using the group identifier. Since these identifiers are generated based on the sorted values of profit, this amounts to defining an interval that starts in the next group:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(id profit)
1 100
2 130
3 230
4 120
5 160
6 120
7 .
end

egen gprofit = group(profit)
rangestat (mean) profit , interval(gprofit 1 .) 

sort profit
list

and the results

Code:

. list

     +-----------------------------------+
     | id   profit   gprofit   profit_~n |
     |-----------------------------------|
  1. |  1      100         1         152 |
  2. |  6      120         2   173.33333 |
  3. |  4      120         2   173.33333 |
  4. |  2      130         3         195 |
  5. |  5      160         4         230 |
     |-----------------------------------|
  6. |  3      230         5           . |
  7. |  7        .         .           . |
     +-----------------------------------+

.

Comment

River Huang

Join Date: Mar 2016

Posts: 1908
#15

04 Feb 2018, 16:35

Dear Robert, Many thanks for this wonderful solution.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment

Announcement