Unexpected behavior of the egen function median()

Adam Streff

Join Date: Sep 2021
Posts: 4

Unexpected behavior of the egen function median()

28 Sep 2021, 15:46

Hello all,

This is my first post but I've read up on the FAQ so I hope it will be acceptable. To give some context, I'm working with a dataset that records power outages. Each observation is a sensor that records when an outage begins and when it ends. Here is an example dataset:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int(outage_id site_id) long(outage_time restore_time) str24 sensor_id
1 14 1528913151 1528919452 "530039001351363038393739"
1 14 1528913153 1528919542 "200031000951343334363138"
1 19 1528913151 1528919423 "3b0045000151353432393339"
1 36 1528913152 1528935236 "2b004b001251363038393739"
1 36 1528913151 1528935235 "380025001451343334363036"
2 14 1529042683 1529047119 "530039001351363038393739"
2 16 1529042684 1529047117 "43005d000951343334363138"
2 17 1529042684 1529047119 "280021001251363038393739"
2 30 1529042675 1529061132 "48003c001151363038393739"
2 39 1529042682 1529061134 "560044000151353432393339"
2 44 1529042682 1529061134 "500030001951353339373130"
2 46 1529042683 1529061132 "2e001f001251363038393739"
2 46 1529042684 1529061134 "1e0036000951343334363138"
end

The outage is recorded by the sensor (i.e. 'sensor_id'). Each sensor is located at a site (i.e. 'site_id'). Each outage grouping has its own id (i.e. 'outage_id') and is defined as outages that occur around the same time (within 90 seconds of another sensor reporting an outage). The 'outage_time' and 'restore_time' variables record when the outage begins and ends, respectively. These variables will be converted to date-time variables at a later point.

My goal: create a new variable 'med_restore_time' that is the median restore time within each 'outage_id'. I'm using the egen function in Stata 17.0. Here is what I have tried:

Code:

    * begin by looking at what the median should be
    desc restore_time
    quietly sum restore_time if outage_id==1, d
    di %12.0g `r(p50)'
    quietly sum restore_time if outage_id==2, d
    di %12.0g `r(p50)'
    
    * try median using egen
    by outage_id: egen med_restore1 = median(restore_time)
    format %12.0g med_restore1
    desc med_restore1
    
    * now let's try using different storage types
    recast double restore_time
    by outage_id: egen double med_restore2 = median(restore_time) // specify type
    format %12.0g med_restore2
    desc med_restore2

As you can see, calculating the median with egen does not lead to the actual median. I thought it could have had something to do with variable types but that didn't seem to change anything. Why is this behavior happening with the egen function and how do I make it do what I want to do?

Best,
Adam

Tags: by group, egen, median

George Ford

Join Date: Aug 2014

Posts: 3152
#2

28 Sep 2021, 16:16

old problem, apparently.

HTML Code:

https://www.stata.com/statalist/archive/2006-05/msg00770.html
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#3

28 Sep 2021, 16:29

Originally posted by George Ford View Post

old problem, apparently.

HTML Code:

https://www.stata.com/statalist/archive/2006-05/msg00770.html

Sorry, I don't agree. This FAQ is for an unrelated problem which was corrected in version 6.0.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#4

28 Sep 2021, 16:39

Originally posted by Adam Streff View Post

[/B]function and how do I make it do what I want to do?

Hi Adam, welcome to the Statalist and thank you for providing a reproducible data example.

I cannot reproduce the problem. Your own code will not run as written, but is trivially fixed by sorting by outage_id. However, those results show each method returns identical results.

Generating medians using either -summary- or -egen, median()- as you have yields identical results, as I show with some alterations to your own code.

Code:

* begin by looking at what the median should be desc restore_time quietly sum restore_time if outage_id==1, d gen long method1 = r(p50) if outage_id==1 quietly sum restore_time if outage_id==2, d replace method1 = r(p50) if outage_id==2 * try median using egen bys outage_id: egen method2 = median(restore_time) * compare tabstat method?, by(outage_id) format(%12.0g) nototal assert method1==method2

Result

Code:

. tabstat method?, by(outage_id) format(%12.0g) nototal Summary statistics: Mean Group variable: outage_id outage_id | method1 method2 ----------+-------------------------- 1 | 1528919542 1528919542 2 | 1529061132 1529061132 -------------------------------------

(Note that the mean, shown above, of a constant is no different than showing the first non-missing value.)
Comment

Adam Streff

Join Date: Sep 2021
Posts: 4

29 Sep 2021, 07:33

Originally posted by Leonardo Guizzetti View Post

Hi Adam, welcome to the Statalist and thank you for providing a reproducible data example.

I cannot reproduce the problem. Your own code will not run as written, but is trivially fixed by sorting by outage_id. However, those results show each method returns identical results.

Generating medians using either -summary- or -egen, median()- as you have yields identical results, as I show with some alterations to your own code.

Code:

* begin by looking at what the median should be
desc restore_time
quietly sum restore_time if outage_id==1, d
gen long method1 = r(p50) if outage_id==1
quietly sum restore_time if outage_id==2, d
replace method1 = r(p50) if outage_id==2

* try median using egen
bys outage_id: egen method2 = median(restore_time)

* compare
tabstat method?, by(outage_id) format(%12.0g) nototal
assert method1==method2

Result

Code:

. tabstat method?, by(outage_id) format(%12.0g) nototal

Summary statistics: Mean
Group variable: outage_id

outage_id | method1 method2
----------+--------------------------
1 | 1528919542 1528919542
2 | 1529061132 1529061132
-------------------------------------

(Note that the mean, shown above, of a constant is no different than showing the first non-missing value.)

This is very odd. I copy and pasted your exact code and this is my result which is consistent with what I've been experiencing:

Code:

.         tabstat method?, by(outage_id) format(%12.0g) nototal

Summary statistics: Mean
Group variable: outage_id 

outage_id |      method1      method2
----------+--------------------------
        1 |   1528919542   1528919552
        2 |   1529061132   1529061120
-------------------------------------

I am working on MacOS Big Sur. Could that have anything to do with this weird phenomenon? I don't understand how we could get different results with the exact same inputs...

Comment

Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#6

29 Sep 2021, 07:43

I'm stumped. I'm using Stata 17 on Windows, so it may be a difference in the operating system. Absent other evidence, this seems unlikely. Perhaps someone else using Mac can confirm what they experience? If it is a Mac specific issue I would report it to Stata Technical Services.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35713
#7

29 Sep 2021, 08:15

I suspect this is a surprising and long-standing bug in the median() code. It issues a float by default -- if indeed the float is the default numeric storage type.

This happens even if a user has specified another type on the egen call -- because that specification is ignored.

However, my guess is that if a user has specified another default storage type -- some users prefer doubles as a matter of course -- that this won't bite.

On this hypothesis, different users might understandably see different results. However, I believe that operating system is immaterial here.

Last edited by Nick Cox; 29 Sep 2021, 08:18.
2 likes
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#8

29 Sep 2021, 08:23

To add to Nick's suggestion,

Code:

set type double

then run your code should give you matching results between two methods.
2 likes
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#9

29 Sep 2021, 08:31

Thanks Hua and Nick. I do indeed have my type set to double by default.
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#10

29 Sep 2021, 08:42

The egen behavior does look weird. We will look into it to see if we can make it better. No promises though as I vaguely remember there is a subtle reason making that change difficult side effect wise.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35713
#11

29 Sep 2021, 08:53

@Hua Peng I am doubly glad you're watching this. One reason is that no-one needs to flag the thread to StataCorp technical support. The other is that you are Hua Peng.

An obvious detail but one perhaps still worth underlining is that the median can be a non-integer even if all the values are integers because with an even number of observations the median is the average of the two middlemost values (called the comedians, naturally). So the median of 1 and 2 is 1.5

Ignoring a user call might have been justified if users were simultaneously insisting on say an int and then being surprised at the result.

A different scenario as in this thread is that the original variable might be a long in which case you might need a double to hold the result.

It's up to StataCorp to puzzle this out, but if I were doing this in my own code I might generate a double and then compress to try to meet the dual goals of the correct answer stored efficiently.
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#12

29 Sep 2021, 08:56

Thanks Nick Cox for the suggestion. I will discuss with colleagues and see what can be done.
Comment
Adam Streff

Join Date: Sep 2021

Posts: 4
#13

29 Sep 2021, 09:19

Originally posted by Hua Peng (StataCorp) View Post

To add to Nick's suggestion,

Code:

set type double

then run your code should give you matching results between two methods.

Thanks all for the assistance. Setting the type to double does indeed fix my issue. Best of luck in debugging!
Comment
George Ford

Join Date: Aug 2014

Posts: 3152
#14

29 Sep 2021, 09:34

Yes. Thanks. I too had the discrepancy, which is now fixed.
Comment
Bill Sribney (StataCorp)

Join Date: Jun 2017

Posts: 8
#15

29 Sep 2021, 10:14

This is a bug and we will fix it. egen double x = median(...) should work regardless of set type and it doesn't. We will also add an autotype option as egen group has to automatically set the type so no accuracy is lost if the user fails to specify a type that can hold the result to full accuracy.

Bill Sribney (StataCorp)
2 likes
Comment

Announcement

Unexpected behavior of the egen function median()

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment