possible to sort a subset of the data?

charlie wong

Join Date: Jan 2016

Posts: 154
#1

possible to sort a subset of the data?

20 Mar 2018, 10:03

I have a large data set compiled by appending many small sets. The small sets are generated from other sources continuously. I keep appending the new small sets. After appending one new small set I will do some work on the large set. In particular I do sorting, among others. If every time I sort the whole large set, it takes long time. My question is can I just sort the new small sets after appending? Or any good strategy on my problem? Many thanks.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#2

20 Mar 2018, 17:31

Check out -ftools- by Sergio Correa at https://github.com/sergiocorreia/ftools. Read the help file for -fsort- to see if your data meets the conditions under which -fsort- would be faster than -sort-.
Comment
charlie wong

Join Date: Jan 2016

Posts: 154
#3

21 Mar 2018, 10:54

Thank you Clyde Schechter . I m trying the following：

Code:

clear all set seed 1234 set obs 10 g a = runiform(0,1) expand 2, g(group) fsort a fsort a if group == 1

It seems the two fsort are giving the same result. Do you have any idea how to make use of the if option?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#4

21 Mar 2018, 12:05

No, no. I think you have misunderstood my response.

I don't know of any -sort- command that will sort only part of a data set. Frankly, the idea doesn't even make much sense to me. Your context is that you have a large, sorted data set, to which you -append- new data. Then you want to have the entire combined data set re-sorted. Do I have that right? Well, you can't accomplish that by just sorting the newly added part of the data anyway. Sort order is, inherently, a property of the entire data set. If it is sufficient for your purposes that the newly added data be sorted inside the combined data, but not within the sort order of the combined data set, then all you need to do is -sort- the data to be added before you do the -append-. But the resulting combined data set as a whole will not be sorted. Take a look at this example:

Code:

// CREATE TWO TOY DATA SETS, MASTER AND USING clear input int x 1 3 5 7 end sort x tempfile master save `master' clear input int x 8 6 4 2 end tempfile using save `using' // REPLICATE CHARLIE WONG'S PROCESS // READ MASTER INTO MEMORY; IT IS ALREADY SORTED use `master', clear sort x list // APPEND USING DATA append using `using' // DATA SET AS A WHOLE IS NOW UNSORTED list // SORT IT sort x list // SHOW WHAT HAPPENS IF WE SORT THE USING // DATA SEPARATELY BEFORE APPENDING IT use `using', clear sort x save `using', replace use `master', clear append using `using' // DATA FROM USING IS IN SORT ORDER // AT THE END OF THE DATA SET, BUT THE DATA AS A WHOLE // IS NOT SORTED list

So if this partially sorted result is acceptable, this will save you the time of re-sorting the entire combined data set which, in your real data, is very large.

But I suspect this is not an acceptable result. And I don't think there is any way around it. What I was suggesting in my response was that you might be able to save a little time by using -fsort- instead of -sort-. Under certain conditions, which Sergio Correa makes clear in his documentation, -fsort- will sort a very large file noticeably faster than -sort- will. It doesn't do it by "sorting only part of the data." It's a full sort, it just uses an algorithm that may be faster than -sort-'s. That's all. I think that's the best you can do. Sorting is an unavoidably computationally intensive process.
Comment
charlie wong

Join Date: Jan 2016

Posts: 154
#5

22 Mar 2018, 09:21

Thanks a lot Clyde. Is there a way to trick Stata to believe the data as a whole is sorted?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#6

22 Mar 2018, 11:10

Not that I know of.

It's really not clear to me what you are actually trying to accomplish. If it is important that the final result data set be sorted, then you need to sort it--there is no way around that. If it doesn't need to be sorted, then why not just skip the final sort? What is it you really want to do here?
Comment
charlie wong

Join Date: Jan 2016

Posts: 154
#7

22 Mar 2018, 20:15

i want the the final result to be sorted, but without sorting the whole data set. if i can sort the small data only after appending, the structure of the whole data should actually be sorted as well, though Stata can't detect it. so i wonder if there is a way to trick Stata to treat the data as sorted..
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#8

22 Mar 2018, 22:37

if i can sort the small data only after appending, the structure of the whole data should actually be sorted as well, though Stata can't detect it

So, if I understand you correctly, this means that the data you are appending is always greater (in sort order on whatever your sort key variables are) than the data already in the main data set. Or, depending on which file is put in memory first, it is always less than the data already in the main data set. If this is the case, you can do it. You need to know the range of observation numbers of the newly appended data. So, for example if the newly appended data start in observation 100000 and continue to observation 150000 you can ask Stata to:

Code:

sort in 100000/150000

So you just need to figure out which observation numbers mark the start and end of the newly added data and you can do this.

However, if you do this, although the data will in fact be in proper sort order, Stata will not recognize it. So if you then try to do something that requires sort order, Stata will treat it as unsorted data.

What I don't know is what Stata will do if you then issue a new -sort- command after doing the above. Since the data are in fact already sorted, it may be that the overall sort will go quickly. Or it may go really slowly: some fast sort algorithms actually perform worst with data that is already sorted. For others it can be an average or even best case scenario. I don't know how that works with Stata's sort.
Comment
charlie wong

Join Date: Jan 2016

Posts: 154
#9

22 Mar 2018, 23:50

Thanks a lot Clyde! Very helpful.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4462
#10

23 Mar 2018, 06:52

just a comment on Clyde Schechter's last paragraph in #8; the following is from the Stata manual:

sorts already-sorted datasets instantly, so Stata’s ignorance costs us little.
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#11

23 Mar 2018, 07:59

I'm not sure if I am following entirely, but if it is only a matter of sorting the appended bit of the data, which should by some matter of design also be placed after the existing observations in the bigger data set, can you not just sort the the new set of the data, and then append?
Comment

Announcement

possible to sort a subset of the data?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment