Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Setting version, seed, and sortseed not sufficient for reproducibility?

    The following code appears to produce different results depending on the Stata 16.1 edition I use.

    Code:
    clear all
    version 16.1
    set seed 872387
    set sortseed 636445
    set obs 1000
    gen id = _n
    gen x = 0
    replace x = 1 in 200/600
    sort x
    list id in 1/5
    In Stata 16.1 MP (2-core), I get: 985, 742, 906, 188, 931.
    In Stata 16.1 SE, I get: 998, 112, 992, 943, 690.

    Here is another example using merge:

    Code:
    clear all
    version 16.1
    set seed 872387
    set sortseed 636445
    set obs 1
    gen x = 1
    tempfile s
    save `s'
    clear
    set obs 1000
    gen id = _n
    gen x = 0
    replace x = 1 in 200/600
    merge m:1 x using `s'
    list id in 1/10
    Again, in Stata 16.1 MP (2-core), I get: 985, 742, 906, 188, 931.
    In Stata 16.1 SE, I get: 998, 112, 992, 943, 690.

    I understand that I can work around this problem by doing a stable sort (prior to merge, in the second example). But I'm surprised that setting the version, seed, and sortseed do not appear to be sufficient to ensure reproducibility across editions of the same Stata version. Is that true, or am I missing something? Is this due to how sorting is parallelized?

    Note that I did not observe this problem with fewer observations. The following produced the same output in both Stata 16.1 SE and Stata 16.1 MP (2-core):

    Code:
    clear all
    version 16.1
    set sortseed 636445
    set obs 100
    gen id = _n
    gen x = 0
    replace x = 1 in 20/60
    sort x
    list id in 1/5
    Thank you!

  • #2
    Originally posted by Bernd Beber View Post
    Is this due to how sorting is parallelized?
    Could be. Try

    Code:
    set processors 1
    ...
    in your Stata MP edition.

    I cannot reproduce the problem because I do not have access to Stata MP at the moment. If the problem is reproducible, this should be reported to tech-support.

    Comment


    • #3
      Thank you! If I add that line in the MP edition, I do get the same result.

      Code:
      clear all
      version 16.1
      set processors 1
      set seed 872387
      set sortseed 636445
      set obs 1000
      gen id = _n
      gen x = 0
      replace x = 1 in 200/600
      sort x
      list id in 1/5
      This produces the same output in Stata 16.1 SE and MP: 998, 112, 992, 943, 690.

      So this does suggest that this could be due to parallelized sorting.

      Thanks again.

      Comment


      • #4
        Daniel is right. When sort is parallelized, it's essentially a different algorithm from the serial sort. It will produce different sort order for ties. Furthermore, different number of processors may produce different sort order for ties as well. Please see Section Sorting with ties in https://www.stata.com/manuals/dsort.pdf for a detailed discussion.

        Comment


        • #5
          Thanks for the response. That makes sense. However, I don't see the documentation at the link you provided directly addressing this issue. It does not mention that serial sort and parallelized sort effectively use different algorithms. I think it would be helpful if it did.

          The best-practice suggestions in Section Sorting with ties all make sense, of course. But I wouldn't be surprised if many users will expect that setting version, seed, and sortseed will ensure reproducibility across editions, even if the code includes e.g. merge m:1 without a prior stable sort. It turns out that this is not the case, and I think it'd be helpful to have this mentioned in the documentation for set sortseed.

          Right now the documentation says that "access to the sorting seed is provided solely for those doing replication studies," and "using set sortseed outside of such comparisons is strongly discouraged." But if I'm understanding your response correctly, it turns out that set sortseed is not actually sufficient to ensure reproducibility for those doing replication studies. So in fact even those doing replication studies should be strongly discouraged from relying on set sortseed to ensure replicability (unless they also specify the number of processors used).

          Comment


          • #6
            Bernd Beber Good point. It is documented but in https://www.stata.com/manuals/psetso...etsortrngstate

            "Fourth, it is crucial that sort be fast, and Stata makes no attempt to blunt that speed with false reproducibility. Stata/SE and Stata/MP use the jumbler differently and so produce different orderings of ties, even when starting from the same seed/state. What’s more, Stata/MP with two processors and Stata/MP with four processors also produce different orderings of ties. The older qsort (prior to Stata 17) and the newer fsort (Stata 17 and beyond) also use the jumbler differently and produce different orderings of ties, even when starting from the same seed/state. (See [P] set sortmethod for a discussion of qsort and fsort.) So any reproducibility produced by set sortrngstate is specific to the edition of Stata that you are running and which sort method is being used."

            I will ask our documentation team to consider adding a link to the above from the sort documentation.

            Comment


            • #7
              Great, thank you again!

              Comment

              Working...
              X