Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unexpected behavior of the egen function median()

    Hello all,

    This is my first post but I've read up on the FAQ so I hope it will be acceptable. To give some context, I'm working with a dataset that records power outages. Each observation is a sensor that records when an outage begins and when it ends. Here is an example dataset:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int(outage_id site_id) long(outage_time restore_time) str24 sensor_id
    1 14 1528913151 1528919452 "530039001351363038393739"
    1 14 1528913153 1528919542 "200031000951343334363138"
    1 19 1528913151 1528919423 "3b0045000151353432393339"
    1 36 1528913152 1528935236 "2b004b001251363038393739"
    1 36 1528913151 1528935235 "380025001451343334363036"
    2 14 1529042683 1529047119 "530039001351363038393739"
    2 16 1529042684 1529047117 "43005d000951343334363138"
    2 17 1529042684 1529047119 "280021001251363038393739"
    2 30 1529042675 1529061132 "48003c001151363038393739"
    2 39 1529042682 1529061134 "560044000151353432393339"
    2 44 1529042682 1529061134 "500030001951353339373130"
    2 46 1529042683 1529061132 "2e001f001251363038393739"
    2 46 1529042684 1529061134 "1e0036000951343334363138"
    end
    The outage is recorded by the sensor (i.e. 'sensor_id'). Each sensor is located at a site (i.e. 'site_id'). Each outage grouping has its own id (i.e. 'outage_id') and is defined as outages that occur around the same time (within 90 seconds of another sensor reporting an outage). The 'outage_time' and 'restore_time' variables record when the outage begins and ends, respectively. These variables will be converted to date-time variables at a later point.

    My goal: create a new variable 'med_restore_time' that is the median restore time within each 'outage_id'. I'm using the egen function in Stata 17.0. Here is what I have tried:

    Code:
        * begin by looking at what the median should be
        desc restore_time
        quietly sum restore_time if outage_id==1, d
        di %12.0g `r(p50)'
        quietly sum restore_time if outage_id==2, d
        di %12.0g `r(p50)'
        
        * try median using egen
        by outage_id: egen med_restore1 = median(restore_time)
        format %12.0g med_restore1
        desc med_restore1
        
        * now let's try using different storage types
        recast double restore_time
        by outage_id: egen double med_restore2 = median(restore_time) // specify type
        format %12.0g med_restore2
        desc med_restore2
    As you can see, calculating the median with egen does not lead to the actual median. I thought it could have had something to do with variable types but that didn't seem to change anything. Why is this behavior happening with the egen function and how do I make it do what I want to do?

    Best,
    Adam

  • #2
    old problem, apparently.
    HTML Code:
    https://www.stata.com/statalist/archive/2006-05/msg00770.html

    Comment


    • #3
      Originally posted by George Ford View Post
      old problem, apparently.
      HTML Code:
      https://www.stata.com/statalist/archive/2006-05/msg00770.html
      Sorry, I don't agree. This FAQ is for an unrelated problem which was corrected in version 6.0.

      Comment


      • #4
        Originally posted by Adam Streff View Post
        [/B]function and how do I make it do what I want to do?
        Hi Adam, welcome to the Statalist and thank you for providing a reproducible data example.

        I cannot reproduce the problem. Your own code will not run as written, but is trivially fixed by sorting by outage_id. However, those results show each method returns identical results.

        Generating medians using either -summary- or -egen, median()- as you have yields identical results, as I show with some alterations to your own code.

        Code:
        * begin by looking at what the median should be
        desc restore_time
        quietly sum restore_time if outage_id==1, d
        gen long method1 = r(p50) if outage_id==1
        quietly sum restore_time if outage_id==2, d
        replace method1 = r(p50) if outage_id==2
        
        * try median using egen
        bys outage_id: egen method2 = median(restore_time)
        
        * compare
        tabstat method?, by(outage_id) format(%12.0g) nototal
        assert method1==method2
        Result

        Code:
        . tabstat method?, by(outage_id) format(%12.0g) nototal
        
        Summary statistics: Mean
        Group variable: outage_id
        
        outage_id |      method1      method2
        ----------+--------------------------
                1 |   1528919542   1528919542
                2 |   1529061132   1529061132
        -------------------------------------
        (Note that the mean, shown above, of a constant is no different than showing the first non-missing value.)

        Comment


        • #5
          Originally posted by Leonardo Guizzetti View Post

          Hi Adam, welcome to the Statalist and thank you for providing a reproducible data example.

          I cannot reproduce the problem. Your own code will not run as written, but is trivially fixed by sorting by outage_id. However, those results show each method returns identical results.

          Generating medians using either -summary- or -egen, median()- as you have yields identical results, as I show with some alterations to your own code.

          Code:
          * begin by looking at what the median should be
          desc restore_time
          quietly sum restore_time if outage_id==1, d
          gen long method1 = r(p50) if outage_id==1
          quietly sum restore_time if outage_id==2, d
          replace method1 = r(p50) if outage_id==2
          
          * try median using egen
          bys outage_id: egen method2 = median(restore_time)
          
          * compare
          tabstat method?, by(outage_id) format(%12.0g) nototal
          assert method1==method2
          Result

          Code:
          . tabstat method?, by(outage_id) format(%12.0g) nototal
          
          Summary statistics: Mean
          Group variable: outage_id
          
          outage_id | method1 method2
          ----------+--------------------------
          1 | 1528919542 1528919542
          2 | 1529061132 1529061132
          -------------------------------------
          (Note that the mean, shown above, of a constant is no different than showing the first non-missing value.)
          This is very odd. I copy and pasted your exact code and this is my result which is consistent with what I've been experiencing:

          Code:
          .         tabstat method?, by(outage_id) format(%12.0g) nototal
          
          Summary statistics: Mean
          Group variable: outage_id 
          
          outage_id |      method1      method2
          ----------+--------------------------
                  1 |   1528919542   1528919552
                  2 |   1529061132   1529061120
          -------------------------------------
          I am working on MacOS Big Sur. Could that have anything to do with this weird phenomenon? I don't understand how we could get different results with the exact same inputs...

          Comment


          • #6
            I'm stumped. I'm using Stata 17 on Windows, so it may be a difference in the operating system. Absent other evidence, this seems unlikely. Perhaps someone else using Mac can confirm what they experience? If it is a Mac specific issue I would report it to Stata Technical Services.

            Comment


            • #7
              I suspect this is a surprising and long-standing bug in the median() code. It issues a float by default -- if indeed the float is the default numeric storage type.

              This happens even if a user has specified another type on the egen call -- because that specification is ignored.

              However, my guess is that if a user has specified another default storage type -- some users prefer doubles as a matter of course -- that this won't bite.

              On this hypothesis, different users might understandably see different results. However, I believe that operating system is immaterial here.

              Last edited by Nick Cox; 29 Sep 2021, 08:18.

              Comment


              • #8
                To add to Nick's suggestion,

                Code:
                set type double
                then run your code should give you matching results between two methods.

                Comment


                • #9
                  Thanks Hua and Nick. I do indeed have my type set to double by default.

                  Comment


                  • #10
                    The egen behavior does look weird. We will look into it to see if we can make it better. No promises though as I vaguely remember there is a subtle reason making that change difficult side effect wise.

                    Comment


                    • #11
                      @Hua Peng I am doubly glad you're watching this. One reason is that no-one needs to flag the thread to StataCorp technical support. The other is that you are Hua Peng.

                      An obvious detail but one perhaps still worth underlining is that the median can be a non-integer even if all the values are integers because with an even number of observations the median is the average of the two middlemost values (called the comedians, naturally). So the median of 1 and 2 is 1.5

                      Ignoring a user call might have been justified if users were simultaneously insisting on say an int and then being surprised at the result.

                      A different scenario as in this thread is that the original variable might be a long in which case you might need a double to hold the result.

                      It's up to StataCorp to puzzle this out, but if I were doing this in my own code I might generate a double and then compress to try to meet the dual goals of the correct answer stored efficiently.

                      Comment


                      • #12
                        Thanks Nick Cox for the suggestion. I will discuss with colleagues and see what can be done.

                        Comment


                        • #13
                          Originally posted by Hua Peng (StataCorp) View Post
                          To add to Nick's suggestion,

                          Code:
                          set type double
                          then run your code should give you matching results between two methods.
                          Thanks all for the assistance. Setting the type to double does indeed fix my issue. Best of luck in debugging!

                          Comment


                          • #14
                            Yes. Thanks. I too had the discrepancy, which is now fixed.

                            Comment


                            • #15
                              This is a bug and we will fix it. egen double x = median(...) should work regardless of set type and it doesn't. We will also add an autotype option as egen group has to automatically set the type so no accuracy is lost if the user fails to specify a type that can hold the result to full accuracy.
                              Bill Sribney (StataCorp)

                              Comment

                              Working...
                              X