Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • egen not preserving sort order

    Hi all,

    I've run into some unexpected egen behavior that broke a function that I commonly use in my code. I've boiled it down to the following example:

    Test #1 Code:
    Code:
    /* synthesize */
    clear
    set obs 2
    generate str = cond(_n == 1, "one", "two")
    generate num = _n
    
    /* collapse */
    collapse num, by(str)
    
    /* average */
    set obs 3
    egen avg = mean(num)
    Test #1 Output:
    Code:
         +-----------------+
         | str   num   avg |
         |-----------------|
      1. |         .   1.5 |
      2. | one     1   1.5 |
      3. | two     2   1.5 |
         +-----------------+
    Test #2 Code:
    Code:
    /* synthesize */
    clear
    set obs 2
    generate str = cond(_n == 1, "one", "two")
    generate num = _n
    
    /* DO NOT collapse */
    //collapse num, by(str)
    
    /* average */
    set obs 3
    egen avg = mean(num)
    Test #2 Output:
    Code:
         +-----------------+
         | str   num   avg |
         |-----------------|
      1. | one     1   1.5 |
      2. | two     2   1.5 |
      3. |         .   1.5 |
         +-----------------+
    As you can see, the only difference between the two code snippets is that the first performs a collapse and the second does not. In the output, however, you'll see that egen preserves the sort order in the second output but not the first.

    Maybe I never should have expected egen to preserve sort order, but to me this seems like odd behavior. I'm running the 31 Mar 2020 build of Stata 16.1 MP on MacOS. Please let me know if anyone has any thoughts on this!

    Thanks,
    Reed

  • #2
    In your examples, you don't sort on anything, so my guess is that egen doesn't recognise a sort order to respect. But wave this at StataCorp technical services to check whether it is intended bebaviour. It's certainly explicit in the egen code that the command is sortpreserve.

    The use of set obs may be messing things up. Again, I can't vouch for what is intended in that respect.
    Last edited by Nick Cox; 14 Apr 2020, 11:50.

    Comment


    • #3
      You're absolutely right, egen does not recognize that there is a sort order to preserve.

      Adding the following code prior to egen solves the problem:
      Code:
      generate obs_num = _n
      sort obs_num
      I will send this along to technical services and see if they think it needs to be addressed.

      Thank you!

      Comment


      • #4
        Below I reproduce Test #1 with some additional output.
        Code:
        . /* synthesize */
        . clear
        
        . set obs 2
        number of observations (_N) was 0, now 2
        
        . generate str = cond(_n == 1, "one", "two")
        
        . generate num = _n
        
        . /* collapse */
        . collapse num, by(str)
        
        . list, clean
        
               str   num  
          1.   one     1  
          2.   two     2  
        
        . describe, short
        
        Contains data
          obs:             2                          
         vars:             2                          
        Sorted by: str
             Note: Dataset has changed since last saved.
        
        . /* add observation */
        . set obs 3
        number of observations (_N) was 2, now 3
        
        . list, clean
        
               str   num  
          1.   one     1  
          2.   two     2  
          3.           .  
        
        . describe, short
        
        Contains data
          obs:             3                          
         vars:             2                          
        Sorted by: str
             Note: Dataset has changed since last saved.
        
        . /* egen */
        . egen avg = mean(num)
        
        . list, clean
        
               str   num   avg  
          1.           .   1.5  
          2.   one     1   1.5  
          3.   two     2   1.5  
        
        . describe, short
        
        Contains data
          obs:             3                          
         vars:             3                          
        Sorted by: str
             Note: Dataset has changed since last saved.
        
        . /* which one was actually sorted? */
        . sort str
        
        . list, clean
        
               str   num   avg  
          1.           .   1.5  
          2.   one     1   1.5  
          3.   two     2   1.5  
        
        . describe, short
        
        Contains data
          obs:             3                          
         vars:             3                          
        Sorted by: str
             Note: Dataset has changed since last saved.
        To me, the problem appears to be that adding the extra observation failed to clear the stored sort sequence: for strings, missing values sort to the top rather than to the bottom, where the new observation was added.

        Comment


        • #5
          Thanks for the detailed information. This is a bug and we will fix it in the next Stata 16 executable update. In addition to the string variable, -set obs- can also make mistake when the sort list contains more than one numerical variable with extended missing values and they are not in the same order as they appear on the variable list. For example:

          Code:
          . set obs 2
          number of observations (_N) was 0, now 2
          
          . gen x1 = 1
          
          . gen x2 = 2
          
          . replace x1 = .a in 1
          (1 real change made, 1 to missing)
          
          . replace x2 = .a in 1
          (1 real change made, 1 to missing)
          
          . sort x2 x1
          
          . list
          
               +---------+
               | x1   x2 |
               |---------|
            1. |  1    2 |
            2. | .a   .a |
               +---------+
          
          . set obs 3
          number of observations (_N) was 2, now 3
          
          . list
          
               +---------+
               | x1   x2 |
               |---------|
            1. |  1    2 |
            2. | .a   .a |
            3. |  .    . |
               +---------+
          
          . desc
          
          Contains data
            obs:             3                          
           vars:             2                          
          ---------------------------------------------------------------------------
                        storage   display    value
          variable name   type    format     label      variable label
          ---------------------------------------------------------------------------
          x1              float   %9.0g                
          x2              float   %9.0g                
          ---------------------------------------------------------------------------
          Sorted by: x2
               Note: Dataset has changed since last saved.
          Variable x2 is NOT in order but is on the sorted variable list.
          Last edited by Hua Peng (StataCorp); 15 Apr 2020, 08:28.

          Comment

          Working...
          X