Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sort puzzle

    I have the following code (Stata11)
    Code:
    by year: summarize Misspct
    gen origseq=_n
    sort year
    assert(origseq==_n)
    After the first line, I get an error message "not sorted". However, the assert statement is true. What is going on here? (Maybe it is relevant that the source data (like the author) is a SAS export :-) )

  • #2
    You know that the dataset is so sorted but Stata doesn't. Consider this

    Code:
     
    . clear
    
    . set obs 3
    number of observations (_N) was 0, now 3
    
    . gen y = _n
    
    . by y: su y
    not sorted
    r(5);
    
    . sort y
    Stata's way of knowing that the data are sorted is twofold:

    I know that the data are so sorted, because

    1. I did it earlier.

    2. And I have not changed the sort order since.

    It's understandable that you think that Stata should check by looking at the sort order and seeing whether it satisfies the instructions, but it doesn't do it that way. Checking the sort order requires looking for a flag, not examining the dataset.

    In turn I guess the reason for working this way is that sorting is not trivial for large datasets, so Stata would prefer not to do it unless so instructed.

    Comment


    • #3
      Makes sense. Is there a way to assert that the dataset is sorted? (Like the SAS sortedby option). Or even better, to assert that it is sorted but ask Stata to check?

      Comment


      • #4
        See -help describe-

        If you type

        describe, varlist

        then r(sortlist) lists the variables by which data are sorted (if any). If r(sortlist) is not there then the data are not sorted. So, you could do something like

        Code:
        des, varlist
        if "`r(sortlist)'" == "" dis "Data are not sorted"
        You would think there would be a simpler way but I don't know what it is.
        -------------------------------------------
        Richard Williams, Notre Dame Dept of Sociology
        StataNow Version: 19.5 MP (2 processor)

        EMAIL: [email protected]
        WWW: https://academicweb.nd.edu/~rwilliam/

        Comment


        • #5
          Well, I don't know of a direct answer to your question in #3. But if you simply issue a command to -sort- the data, the -sort- algorithm is extremely fast for a data set that is already sorted in that order (even if Stata didn't know that from the sortedby flag. See the following:

          Code:
          . clear*
          
          . set obs 1000000
          number of observations (_N) was 0, now 1,000,000
          
          . set seed 1234
          
          . gen x = runiform()
          
          . set rmsg on
          r; t=0.00 19:07:05
          
          . sort x
          r; t=1.11 19:07:09
          
          . replace x = x[2] in 2 // "CHANGE" THAT ACTUALLY CHANGES NOTHING
          (0 real changes made)
          r; t=0.00 19:07:43
          
          . des // BUT IT DOES UNSET THE SORTEDBY FLAG
          
          Contains data
            obs:     1,000,000                         
           vars:             1                         
           size:     4,000,000                         
          ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
                        storage   display    value
          variable name   type    format     label      variable label
          ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
          x               float   %9.0g                
          ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
          Sorted by: x
               Note: Dataset has changed since last saved.
          r; t=0.01 19:07:47
          
          . sort x
          r; t=0.05 19:07:53
          // NOTE HOW FAST THE SORT WAS THIS TIME COMPARED TO THE ORIGINAL
          // SPED BY A FACTOR OF MORE THAN 20

          Comment

          Working...
          X