Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • possible to sort a subset of the data?

    I have a large data set compiled by appending many small sets. The small sets are generated from other sources continuously. I keep appending the new small sets. After appending one new small set I will do some work on the large set. In particular I do sorting, among others. If every time I sort the whole large set, it takes long time. My question is can I just sort the new small sets after appending? Or any good strategy on my problem? Many thanks.

  • #2
    Check out -ftools- by Sergio Correa at https://github.com/sergiocorreia/ftools. Read the help file for -fsort- to see if your data meets the conditions under which -fsort- would be faster than -sort-.

    Comment


    • #3
      Thank you Clyde Schechter . I m trying the following:

      Code:
      clear all
      set seed 1234
      set obs 10
      g a = runiform(0,1)
      expand 2, g(group)
      
      fsort a
      fsort a if group == 1
      It seems the two fsort are giving the same result. Do you have any idea how to make use of the if option?

      Comment


      • #4
        No, no. I think you have misunderstood my response.

        I don't know of any -sort- command that will sort only part of a data set. Frankly, the idea doesn't even make much sense to me. Your context is that you have a large, sorted data set, to which you -append- new data. Then you want to have the entire combined data set re-sorted. Do I have that right? Well, you can't accomplish that by just sorting the newly added part of the data anyway. Sort order is, inherently, a property of the entire data set. If it is sufficient for your purposes that the newly added data be sorted inside the combined data, but not within the sort order of the combined data set, then all you need to do is -sort- the data to be added before you do the -append-. But the resulting combined data set as a whole will not be sorted. Take a look at this example:

        Code:
        //    CREATE TWO TOY DATA SETS, MASTER AND USING
        clear
        input int x
        1
        3
        5
        7
        end
        sort x
        tempfile master
        save `master'
        
        clear
        input int x
        8
        6
        4
        2
        end
        tempfile using
        save `using'
        
        // REPLICATE CHARLIE WONG'S PROCESS
        // READ MASTER INTO MEMORY; IT IS ALREADY SORTED    
        use `master', clear
        sort x
        list
        
        // APPEND USING DATA
        append using `using'
        // DATA SET AS A WHOLE IS NOW UNSORTED
        list
        // SORT IT
        sort x
        list
        
        // SHOW WHAT HAPPENS IF WE SORT THE USING
        // DATA SEPARATELY BEFORE APPENDING IT
        use `using', clear
        sort x
        save `using', replace
        
        use `master', clear
        append using `using'
        //    DATA FROM USING IS IN SORT ORDER
        //  AT THE END OF THE DATA SET, BUT THE DATA AS A WHOLE
        //  IS NOT SORTED
        list
        So if this partially sorted result is acceptable, this will save you the time of re-sorting the entire combined data set which, in your real data, is very large.

        But I suspect this is not an acceptable result. And I don't think there is any way around it. What I was suggesting in my response was that you might be able to save a little time by using -fsort- instead of -sort-. Under certain conditions, which Sergio Correa makes clear in his documentation, -fsort- will sort a very large file noticeably faster than -sort- will. It doesn't do it by "sorting only part of the data." It's a full sort, it just uses an algorithm that may be faster than -sort-'s. That's all. I think that's the best you can do. Sorting is an unavoidably computationally intensive process.

        Comment


        • #5
          Thanks a lot Clyde. Is there a way to trick Stata to believe the data as a whole is sorted?

          Comment


          • #6
            Not that I know of.

            It's really not clear to me what you are actually trying to accomplish. If it is important that the final result data set be sorted, then you need to sort it--there is no way around that. If it doesn't need to be sorted, then why not just skip the final sort? What is it you really want to do here?

            Comment


            • #7
              i want the the final result to be sorted, but without sorting the whole data set. if i can sort the small data only after appending, the structure of the whole data should actually be sorted as well, though Stata can't detect it. so i wonder if there is a way to trick Stata to treat the data as sorted..

              Comment


              • #8
                if i can sort the small data only after appending, the structure of the whole data should actually be sorted as well, though Stata can't detect it
                So, if I understand you correctly, this means that the data you are appending is always greater (in sort order on whatever your sort key variables are) than the data already in the main data set. Or, depending on which file is put in memory first, it is always less than the data already in the main data set. If this is the case, you can do it. You need to know the range of observation numbers of the newly appended data. So, for example if the newly appended data start in observation 100000 and continue to observation 150000 you can ask Stata to:

                Code:
                sort in 100000/150000
                So you just need to figure out which observation numbers mark the start and end of the newly added data and you can do this.

                However, if you do this, although the data will in fact be in proper sort order, Stata will not recognize it. So if you then try to do something that requires sort order, Stata will treat it as unsorted data.

                What I don't know is what Stata will do if you then issue a new -sort- command after doing the above. Since the data are in fact already sorted, it may be that the overall sort will go quickly. Or it may go really slowly: some fast sort algorithms actually perform worst with data that is already sorted. For others it can be an average or even best case scenario. I don't know how that works with Stata's sort.

                Comment


                • #9
                  Thanks a lot Clyde! Very helpful.

                  Comment


                  • #10
                    just a comment on Clyde Schechter's last paragraph in #8; the following is from the Stata manual:
                    sorts already-sorted datasets instantly, so Stata’s ignorance costs us little.

                    Comment


                    • #11
                      I'm not sure if I am following entirely, but if it is only a matter of sorting the appended bit of the data, which should by some matter of design also be placed after the existing observations in the bigger data set, can you not just sort the the new set of the data, and then append?

                      Comment

                      Working...
                      X