How does Stata randomize observations with the same value when using the command "sort" without "stable"?

Nina Caroline

Join Date: Jul 2015

Posts: 14
#1

How does Stata randomize observations with the same value when using the command "sort" without "stable"?

14 Jun 2016, 09:44

Hello!

I am looking at a randomization file in which no seed was set and am trying to recreate it.

The randomization is created through two steps:
1) Sorting by two characteristics (that make the sorting non-unique)
2) Drawing a bunch of runiforms

The Stata help code on "set seed" gives me the following information: "The sequences the random-number functions produce are determined by the seed, which is just a number and which is set to 123456789 every time Stata is launched. This means that runiform() produces the same sequence each time you start Stata. The first time you use runiform() after Stata is launched, runiform() returns 0.348871704556195. The second time you use it, runiform() returns 0.266885709753138. The third time you use it, ...."

Hence, I was thinking that I have to just write a code that keeps running the randomization do-file until the outcome assignment matches the actual assignment - thus, basically just run the do-file as many times as did the person who wrote initially wrote and ran it. However, this is ignoring point 1. I was assuming that when stata "randomly" orders observations within a non-unique sorting, it also uses the same algorithm as for a random distribution without a seed. However, this does not seem to be the case. Does anyone know how I could try recreate the sorting? How is Stata doing the randomization with the sort command without the option stable?

Thank you very much!!
Tags: sort randomization seed
daniel klein

Join Date: Mar 2014

Posts: 3859
#2

14 Jun 2016, 09:56

I have no answer, but I doubt your strategy of running the do-file as many times as it ran the first time will necessarily give you the expected result. That assumes that by the time the result was created, Stata was launched and then the do-file - and only(!) this do-file - has been executed a given number of times. Are you sure that no other random process was involved when the result was produced, i.e., that the series of runiform() numbers was created without anything else happening in between?

By the way, I think this is an interesting threat. Is it at all possible to replicate a randomized result without having the starting seed?

Best
Daniel
1 like
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#3

14 Jun 2016, 13:06

Nina,

1) call the person who gave your the file, report the problem that no seed was set and you can't reproduce the results. Everyone will understand you.

2) If you wish to dig into this, note that the text you cited differs across different Stata versions. E.g. in Stata 13:

The sequences these functions produce are determined by the seed, which is just a number and which is set to 123456789 every time Stata is launched. This means that runiform() produces the same
sequence each time you start Stata. The first time you use runiform() after Stata is launched, runiform() returns 0.136984078446403146. The second time you use it, runiform() returns
0.643220667960122228. The third time you use it, ....

This is because different Stata versions come with different pseudorandom number generators.

Kreshna Gopal (StataCorp) has recently described the RNG in Stata 14:
http://blog.stata.com/2016/03/10/how...bers-in-stata/

See also his very useful explanation in the Statalist:
http://www.statalist.org/forums/foru...eneral/1310621

Note also that sort is paralleled:
http://www.stata.com/statamp/statamp.pdf

I read that that, it is not precluded that you will get different results depending on the number of CPUs you have in the machine. It is not clear whether any race conditions come into play, which could mean that the results are not reproducible in principle, even if you know all the parameters of the machine that was used for randomization.

daniel klein , there is probably multiple values of seed that would yield identical sort orders. If you know only some information e.g. selected persons were A,C,B,Q,D,X,J in this order, you can't infer anything about the order of non-selected persons from A..Z if this is what you mean by "threat". However with long sequences it is probably more likely that you will be able to find one unique seed value that would match the known sequence, but at the same time it is also more computationally intensive. Do you have a practical case where this is useful?

Best, Sergiy Radyakin
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#4

14 Jun 2016, 13:58

I've always wondered why "the ordering of observations with equal values of varlist is randomized." (from the help file for the sort command). If the word "randomized" here is to be taken literally, doesn't that introduce an extra step that would slow things down? (Whereas the help file says that sort varlist, stable is actually slower than sort varlist) I guess I've assumed that what is really meant here is that the order is random in that it depends on the starting point of the data set (and perhaps other factors) and can't generally be predicted in advance. If that assumption is incorrect, what is the point of actually randomizing the order?
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#5

14 Jun 2016, 15:06

Originally posted by Sergiy Radyakin View Post

if this is what you mean by "threat".

Stupid spelling error. Threat should have been thread (much less threatening). Sorry for misunderstanding.

Also, no practical case, just wondering if, theoretically, you would be able to replicate a series of runiform() that I give you, if I do not give you the seed I used. I suspect with a lot of time, in principle it should be possible.

Best
Daniel
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#6

14 Jun 2016, 16:43

Joe Canner I think the randomization is a by-product of the performance optimization, not the intentional addition. See Stata Journal Article.
I am not sure the exact sorting algorithm employed by Stata is known, but back in 2008 when I looked at its behavior I came to a conclusion that it was likely quicksort:
http://www.stata.com/statalist/archi.../msg00810.html
(I am not sure anymore what was the critical hint and it may well have changed since then.)

There is a discussion of stability of quicksort here:
http://stackoverflow.com/questions/1...oning-approach
and in Wikipedia:
https://en.wikipedia.org/wiki/Quicksort

daniel klein of course it should be possible since it is pseudo-random numbers we are talking about, so they have a final repetition cycle. One way would be to simply tabulate all the random numbers Stata produces and then seek the sequence you mention. While the randomizations I described above are likely to be multiple (resulting from different seed values) the exact random numbers are likely to be unique for each seed, so it should be enough to know just one random value to find the corresponding seed value. The table is likely to be huge: Stata 14 uses a random number generator with a period of 2^19937-1. This means that you will run out of elementary particles in the Universe much sooner than you get to record the first 1% of your table. Of course you don't have to record the table and can compute the numbers one after another on the fly. But similarly a plain brute force attack is pretty much doomed. The key to success is probably to exploit the sequence of the values somehow, as well as their absolute values. Somehow in the back of my memory I maintain that the RSA tokens used in many OTP systems had weaknesses, which could be exploited by observing the values near zero, e.g.: what does 000-018 change to? what does 000-007 change to? that somehow allowed to significantly cut down the search time for the private key, and with enough observations bring that to the lifetime frame. I have no details or source for that, cryptoanalysts will be able to tell you more.

Bottom line: set the seed in your programs and take a note of the randomization algorithm and software version if you care about reproducibility!

Best, Sergiy Radyakin
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#7

14 Jun 2016, 16:56

Thanks, Sergiy; I haven't studied sorting algorithms in 30+ years so I had forgotten about the properties of quicksort. That looks like a likely candidate, since it has to decide which side of the pivot to put things and records that are equal to the pivot can go either way. I imagine that the algorithm runs faster if the two halves are balanced and that would require randomizing equal values to one side or the other.
Comment
Matthew White

Join Date: Apr 2014

Posts: 29
#8

15 Jun 2016, 12:00

I think the Stata "sort seed" determines how sort order ties are broken. You can set this using set sortseed. See help sortseed for more on the difference between set seed and set sortseed. Hopefully the Stata version and the sort seed together encapsulate all aspects of randomness as it relates to sorting, including algorithm, etc.

In general, for randomizations to replicate easily, the version and regular seed must be set, and the dataset sorted by a unique variable list.

When you set seed, the seed has to be between 0 and ~2 billion, and I would guess that it's the same for set sortseed. Obviously 2 billion seeds * 2 billion sort seeds is a lot of combinations to try, but maybe you could determine one independently of the other? That probably depends on the specifics of the dataset and do-file...

Last edited by Matthew White; 15 Jun 2016, 12:03.
Comment

Announcement

How does Stata randomize observations with the same value when using the command "sort" without "stable"?

Comment

Comment

Comment

Comment

Comment

Comment

Comment