Hi all

I recently generated a multistage sample using gsample and a sort command.

In the initial iteration, I set the seed before drawing the sample. However, I did not realise that the sort command also has a random element to it. As a consequence, the first sample I drew is not consistent with subsequent samples.

I have now resolved this issue by running a complete sort after my partial sorts. (Set sortseed resolves the issue if the code is run afresh after closing Stata and re-opening it. Set sortseed does not resolve the issue if the sample is drawn more than once in a single Stata session. However, running a complete sort after a partial sort resolves the issue fully. This means that the same sample is drawn, regardless of whether it is drawn repeatedly in the same Stata session, or in a new Stata session. Paul Seed's very helpful post explains why: https://www.stata.com/statalist/arch.../msg00816.html )

My problem now is that the cat is out of the bag. The initial sample that I drew is being used for the survey, and there is nothing I can do about this.

Here's one approach to the problem I have been toying with. I think I ran the code for the original sample a couple of times in Stata. This means that the original starting point for the pseudorandom generator changed several times. So, one possible solution is to run code that keeps drawing the sample until it matches with the original sample that I drew. During this process I would display c(seed) so that I could figure out what the original starting point was in the pseudorandom generator. However, the set of starting points for the pseudorandom generators is very large, and I suspect my labtop wouldn't be able to cope with this (see - help set seed - the subsection on preserving and restoring the random-number generator state).

Any advice or suggestion in this regard would be deeply appreciated.

Here is the basic code that now generates a consistent sample:

************************************************** ************************

clear

use "masterlist.dta"

by province quintile, sort: gen sample_quintile = learners_proportion*sample_province //sample size for each quintile in each province

bysort province (emis): summ province //this stabilises the sort order so that the same 'random' sample is drawn each time

set seed 300

generate sampled = 0

forval qnum = 1/5 { forval pnum = 1/9 {

gsample sample_quintile [w=learners] if provincecd == `pnum' & quintile == `qnum', strata(quintile) generate(sampledtemp) replace

replace sampledtemp = 1 if sampledtemp > 1 // sampledtemp is sometimes = 2 because of sampling without replacement

replace sampled = 1 if sampledtemp == 1

}

}

Warmly,

Nimi

I recently generated a multistage sample using gsample and a sort command.

In the initial iteration, I set the seed before drawing the sample. However, I did not realise that the sort command also has a random element to it. As a consequence, the first sample I drew is not consistent with subsequent samples.

I have now resolved this issue by running a complete sort after my partial sorts. (Set sortseed resolves the issue if the code is run afresh after closing Stata and re-opening it. Set sortseed does not resolve the issue if the sample is drawn more than once in a single Stata session. However, running a complete sort after a partial sort resolves the issue fully. This means that the same sample is drawn, regardless of whether it is drawn repeatedly in the same Stata session, or in a new Stata session. Paul Seed's very helpful post explains why: https://www.stata.com/statalist/arch.../msg00816.html )

My problem now is that the cat is out of the bag. The initial sample that I drew is being used for the survey, and there is nothing I can do about this.

__I would like to figure out how to draw the same original sample if at all possible.__Here's one approach to the problem I have been toying with. I think I ran the code for the original sample a couple of times in Stata. This means that the original starting point for the pseudorandom generator changed several times. So, one possible solution is to run code that keeps drawing the sample until it matches with the original sample that I drew. During this process I would display c(seed) so that I could figure out what the original starting point was in the pseudorandom generator. However, the set of starting points for the pseudorandom generators is very large, and I suspect my labtop wouldn't be able to cope with this (see - help set seed - the subsection on preserving and restoring the random-number generator state).

Any advice or suggestion in this regard would be deeply appreciated.

Here is the basic code that now generates a consistent sample:

************************************************** ************************

clear

use "masterlist.dta"

by province quintile, sort: gen sample_quintile = learners_proportion*sample_province //sample size for each quintile in each province

bysort province (emis): summ province //this stabilises the sort order so that the same 'random' sample is drawn each time

set seed 300

generate sampled = 0

forval qnum = 1/5 { forval pnum = 1/9 {

gsample sample_quintile [w=learners] if provincecd == `pnum' & quintile == `qnum', strata(quintile) generate(sampledtemp) replace

replace sampledtemp = 1 if sampledtemp > 1 // sampledtemp is sometimes = 2 because of sampling without replacement

replace sampled = 1 if sampledtemp == 1

}

}

Warmly,

Nimi

## Comment