Finding the original random sample

Nimi Hoffmann

Join Date: Dec 2017

Posts: 1
#1

Finding the original random sample

28 Dec 2017, 01:46

Hi all

I recently generated a multistage sample using gsample and a sort command.

In the initial iteration, I set the seed before drawing the sample. However, I did not realise that the sort command also has a random element to it. As a consequence, the first sample I drew is not consistent with subsequent samples.

I have now resolved this issue by running a complete sort after my partial sorts. (Set sortseed resolves the issue if the code is run afresh after closing Stata and re-opening it. Set sortseed does not resolve the issue if the sample is drawn more than once in a single Stata session. However, running a complete sort after a partial sort resolves the issue fully. This means that the same sample is drawn, regardless of whether it is drawn repeatedly in the same Stata session, or in a new Stata session. Paul Seed's very helpful post explains why: https://www.stata.com/statalist/arch.../msg00816.html )

My problem now is that the cat is out of the bag. The initial sample that I drew is being used for the survey, and there is nothing I can do about this. I would like to figure out how to draw the same original sample if at all possible.

Here's one approach to the problem I have been toying with. I think I ran the code for the original sample a couple of times in Stata. This means that the original starting point for the pseudorandom generator changed several times. So, one possible solution is to run code that keeps drawing the sample until it matches with the original sample that I drew. During this process I would display c(seed) so that I could figure out what the original starting point was in the pseudorandom generator. However, the set of starting points for the pseudorandom generators is very large, and I suspect my labtop wouldn't be able to cope with this (see - help set seed - the subsection on preserving and restoring the random-number generator state).

Any advice or suggestion in this regard would be deeply appreciated.

Here is the basic code that now generates a consistent sample:
************************************************** ************************
clear
use "masterlist.dta"

by province quintile, sort: gen sample_quintile = learners_proportion*sample_province //sample size for each quintile in each province
bysort province (emis): summ province //this stabilises the sort order so that the same 'random' sample is drawn each time

set seed 300
generate sampled = 0
forval qnum = 1/5 { forval pnum = 1/9 {
gsample sample_quintile [w=learners] if provincecd == `pnum' & quintile == `qnum', strata(quintile) generate(sampledtemp) replace
replace sampledtemp = 1 if sampledtemp > 1 // sampledtemp is sometimes = 2 because of sampling without replacement
replace sampled = 1 if sampledtemp == 1
}
}

Warmly,
Nimi
Tags: None
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#2

29 Dec 2017, 09:41

Nimi,

regretably the default behavior with sort and related commands is "fast" not "reproducible", so unless you are knowledgeable about the problems and know the options-countermeasures your results will not be reproducible by default. See also here:
https://www.stata.com/statalist/arch.../msg00224.html

On the specific problem that you are facing (a rather unpleasant one). You wrote:

However, the set of starting points for the pseudorandom generators is very large,

True. Astronomical. So a blind attack on that is rather challenging. But, two thoughts:

1) you don't need to find that ONE seed which resulted in your sample. You need to find ANY seed which results in the sample you've ended up with.
If you were to toss a coin, and say your random process resulted in a TAILS, then roughly half of the seed candidates would satisfy your search criteria. With a single die - 1/6, etc. So ultimately the probability (and hence the length of the search) depends on your sample size: the larger the sample, the larger should be number of matches, and hence longer the search.

2) Stata always starts with the same seed. And by the time you saved your sample it ran through a finite number of uses of the random number generator.
That number will NOT be astronomical. At most we are talking about the same duration as was the duration of your session in which you were generating your sample. By moving the generator 1 seed forward you should be able to recover this original seed within final reachable period of time. (and you don't have to do all 45 cycles, if one is giving differing results, scratch that seed).

The advantage of 2 is that if your session included any other calls to rnd-generator simply repeating the sampling process may not be fruitful, while moving the generator by 1 seed will recover it 100%. The cost is that for large samples it is proportionately slower.

By recollecting your memories of the session when the sample was generated, you should be able to pick the appropriate method (or just run both in parallel).

Best and good luck, Sergiy Radyakin
Comment
Kreshna Gopal (StataCorp)

StataCorp Employee

Join Date: Apr 2014

Posts: 43
#3

29 Dec 2017, 13:09

I think it may be possible to recover the seed (or rather the state) of the pseudo random-number generator (PNRG) that was used to generate the original sample. First a few remarks:

1. The PNRG used for runiform() (and other random-number functions for other distributions) is independent of the PNRG used for sorting, which randomly break ties in sorts. Here we want to recover the state of the latter.

2. Sergiy Radyakin is correct in asserting that although there are many starting points for PRNGs, Stata always start up with default seeds. So, it should not take long to reach the PNRG state that was reached in a previous Stata session, for both PNRGs i.e. for runiform() and sorting.

3. The PRNGs in Stata works as follows; we can set the seed (or use the default seed) and as random numbers are generated, the state of the PNRG (basically a string of characters) changes. The state of the PRNG can be saved and restored for reproducing results.

Nimi Hoffmann is on the right track in the approach used to recover how the original sample was generated. I could not run the code provided without a dataset. But here is a simplified version of what I think and hope simulates what happened, and how we can recover the seed/state used to create the initial sample. The main point is to use query sortseed and r(sortseed).

First, we start a Stata session, and call sorting routines and gsample several times that effectively move the state of the PRNG for sorting to an unknown state, since we did not set sortseed. (I assume we did set seed, but that constraint too can be relaxed).

For example, we fire up Stata and call sort and gsample 10 times in a loop as follows to create an initial sample.

Code:

forvalues i=1/10 { sysuse auto, clear set seed 12345 // set sortseed 54321 sort rep78 gsample 10 summ price di r(mean) }

Here, set sortseed is commented out. If we did set sortseed, the same sample is created in every iteration.

If we run the code above (in a fresh Stata 15.1 run), the mean price of the initial sample is 6285.9.

Now we need to find the state of the PRNG used for sorting that was used to create this sample. We run the following code in a new Stata session (otherwise the original sample will never be re-created). Here the mean of price is used to verify if it is the same sample, but a more stringent verification may be needed. Note that the seed/state is saved through query sortseed and r(sortseed) iteratively until we reach the right one. help sortseed has more details.

Code:

local sample_mean = 6285.9 local current_mean = 0 local iteration = 0 while (`sample_mean' != `current_mean') { local iteration = `iteration' + 1 di "Iteration `iteration'" sysuse auto, clear set seed 12345 query sortseed local current_sortseed = r(sortseed) set sortseed `current_sortseed' sort rep78 gsample 10 summ price local current_mean = r(mean) } local sample_sortseed `current_sortseed' // Save the right seed/state // We can now verify if sample_sortseed is the correct one sysuse auto, clear set seed 12345 set sortseed `sample_sortseed' sort rep78 gsample 10 summ price assert r(mean) == `sample_mean'

Here we assume that we have used set seed to create the sample, But if the seed used for runiform() is also unkown, we can use the same brute-force search approach to find that PRNG state too. It should not take too long, unless the PRNG was called a large number of times, maybe in a loop.

I hope this helps.

-- Kreshna
2 likes
Comment

Announcement

Finding the original random sample

Comment

Comment