Setting version, seed, and sortseed not sufficient for reproducibility?

Bernd Beber

Join Date: May 2021

Posts: 4
#1

Setting version, seed, and sortseed not sufficient for reproducibility?

26 May 2021, 05:12

The following code appears to produce different results depending on the Stata 16.1 edition I use.

Code:

clear all version 16.1 set seed 872387 set sortseed 636445 set obs 1000 gen id = _n gen x = 0 replace x = 1 in 200/600 sort x list id in 1/5

In Stata 16.1 MP (2-core), I get: 985, 742, 906, 188, 931.
In Stata 16.1 SE, I get: 998, 112, 992, 943, 690.

Here is another example using merge:

Code:

clear all version 16.1 set seed 872387 set sortseed 636445 set obs 1 gen x = 1 tempfile s save `s' clear set obs 1000 gen id = _n gen x = 0 replace x = 1 in 200/600 merge m:1 x using `s' list id in 1/10

Again, in Stata 16.1 MP (2-core), I get: 985, 742, 906, 188, 931.
In Stata 16.1 SE, I get: 998, 112, 992, 943, 690.

I understand that I can work around this problem by doing a stable sort (prior to merge, in the second example). But I'm surprised that setting the version, seed, and sortseed do not appear to be sufficient to ensure reproducibility across editions of the same Stata version. Is that true, or am I missing something? Is this due to how sorting is parallelized?

Note that I did not observe this problem with fewer observations. The following produced the same output in both Stata 16.1 SE and Stata 16.1 MP (2-core):

Code:

clear all version 16.1 set sortseed 636445 set obs 100 gen id = _n gen x = 0 replace x = 1 in 20/60 sort x list id in 1/5

Thank you!
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3834
#2

26 May 2021, 05:25

Originally posted by Bernd Beber View Post

Is this due to how sorting is parallelized?

Could be. Try

Code:

set processors 1 ...

in your Stata MP edition.

I cannot reproduce the problem because I do not have access to Stata MP at the moment. If the problem is reproducible, this should be reported to tech-support.
1 like
Comment
Bernd Beber

Join Date: May 2021

Posts: 4
#3

26 May 2021, 05:41

Thank you! If I add that line in the MP edition, I do get the same result.

Code:

clear all version 16.1 set processors 1 set seed 872387 set sortseed 636445 set obs 1000 gen id = _n gen x = 0 replace x = 1 in 200/600 sort x list id in 1/5

This produces the same output in Stata 16.1 SE and MP: 998, 112, 992, 943, 690.

So this does suggest that this could be due to parallelized sorting.

Thanks again.
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 344
#4

26 May 2021, 07:32

Daniel is right. When sort is parallelized, it's essentially a different algorithm from the serial sort. It will produce different sort order for ties. Furthermore, different number of processors may produce different sort order for ties as well. Please see Section Sorting with ties in https://www.stata.com/manuals/dsort.pdf for a detailed discussion.
4 likes
Comment
Bernd Beber

Join Date: May 2021

Posts: 4
#5

26 May 2021, 08:36

Thanks for the response. That makes sense. However, I don't see the documentation at the link you provided directly addressing this issue. It does not mention that serial sort and parallelized sort effectively use different algorithms. I think it would be helpful if it did.

The best-practice suggestions in Section Sorting with ties all make sense, of course. But I wouldn't be surprised if many users will expect that setting version, seed, and sortseed will ensure reproducibility across editions, even if the code includes e.g. merge m:1 without a prior stable sort. It turns out that this is not the case, and I think it'd be helpful to have this mentioned in the documentation for set sortseed.

Right now the documentation says that "access to the sorting seed is provided solely for those doing replication studies," and "using set sortseed outside of such comparisons is strongly discouraged." But if I'm understanding your response correctly, it turns out that set sortseed is not actually sufficient to ensure reproducibility for those doing replication studies. So in fact even those doing replication studies should be strongly discouraged from relying on set sortseed to ensure replicability (unless they also specify the number of processors used).
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 344
#6

26 May 2021, 08:43

Bernd Beber Good point. It is documented but in https://www.stata.com/manuals/psetso...etsortrngstate

"Fourth, it is crucial that sort be fast, and Stata makes no attempt to blunt that speed with false reproducibility. Stata/SE and Stata/MP use the jumbler differently and so produce different orderings of ties, even when starting from the same seed/state. What’s more, Stata/MP with two processors and Stata/MP with four processors also produce different orderings of ties. The older qsort (prior to Stata 17) and the newer fsort (Stata 17 and beyond) also use the jumbler differently and produce different orderings of ties, even when starting from the same seed/state. (See [P] set sortmethod for a discussion of qsort and fsort.) So any reproducibility produced by set sortrngstate is specific to the edition of Stata that you are running and which sort method is being used."

I will ask our documentation team to consider adding a link to the above from the sort documentation.
3 likes
Comment
Bernd Beber

Join Date: May 2021

Posts: 4
#7

26 May 2021, 08:59

Great, thank you again!
Comment

Announcement

Setting version, seed, and sortseed not sufficient for reproducibility?

Comment

Comment

Comment

Comment

Comment

Comment