Hi all
I recently generated a multistage sample using gsample and a sort command.
In the initial iteration, I set the seed before drawing the sample. However, I did not realise that the sort command also has a random element to it. As a consequence, the first sample I drew is not consistent with subsequent samples.
I have now resolved this issue by running a complete sort after my partial sorts. (Set sortseed resolves the issue if the code is run afresh after closing Stata and re-opening it. Set sortseed does not resolve the issue if the sample is drawn more than once in a single Stata session. However, running a complete sort after a partial sort resolves the issue fully. This means that the same sample is drawn, regardless of whether it is drawn repeatedly in the same Stata session, or in a new Stata session. Paul Seed's very helpful post explains why: https://www.stata.com/statalist/arch.../msg00816.html )
My problem now is that the cat is out of the bag. The initial sample that I drew is being used for the survey, and there is nothing I can do about this. I would like to figure out how to draw the same original sample if at all possible.
Here's one approach to the problem I have been toying with. I think I ran the code for the original sample a couple of times in Stata. This means that the original starting point for the pseudorandom generator changed several times. So, one possible solution is to run code that keeps drawing the sample until it matches with the original sample that I drew. During this process I would display c(seed) so that I could figure out what the original starting point was in the pseudorandom generator. However, the set of starting points for the pseudorandom generators is very large, and I suspect my labtop wouldn't be able to cope with this (see - help set seed - the subsection on preserving and restoring the random-number generator state).
Any advice or suggestion in this regard would be deeply appreciated.
Here is the basic code that now generates a consistent sample:
************************************************** ************************
clear
use "masterlist.dta"
by province quintile, sort: gen sample_quintile = learners_proportion*sample_province //sample size for each quintile in each province
bysort province (emis): summ province //this stabilises the sort order so that the same 'random' sample is drawn each time
set seed 300
generate sampled = 0
forval qnum = 1/5 { forval pnum = 1/9 {
gsample sample_quintile [w=learners] if provincecd == `pnum' & quintile == `qnum', strata(quintile) generate(sampledtemp) replace
replace sampledtemp = 1 if sampledtemp > 1 // sampledtemp is sometimes = 2 because of sampling without replacement
replace sampled = 1 if sampledtemp == 1
}
}
Warmly,
Nimi
I recently generated a multistage sample using gsample and a sort command.
In the initial iteration, I set the seed before drawing the sample. However, I did not realise that the sort command also has a random element to it. As a consequence, the first sample I drew is not consistent with subsequent samples.
I have now resolved this issue by running a complete sort after my partial sorts. (Set sortseed resolves the issue if the code is run afresh after closing Stata and re-opening it. Set sortseed does not resolve the issue if the sample is drawn more than once in a single Stata session. However, running a complete sort after a partial sort resolves the issue fully. This means that the same sample is drawn, regardless of whether it is drawn repeatedly in the same Stata session, or in a new Stata session. Paul Seed's very helpful post explains why: https://www.stata.com/statalist/arch.../msg00816.html )
My problem now is that the cat is out of the bag. The initial sample that I drew is being used for the survey, and there is nothing I can do about this. I would like to figure out how to draw the same original sample if at all possible.
Here's one approach to the problem I have been toying with. I think I ran the code for the original sample a couple of times in Stata. This means that the original starting point for the pseudorandom generator changed several times. So, one possible solution is to run code that keeps drawing the sample until it matches with the original sample that I drew. During this process I would display c(seed) so that I could figure out what the original starting point was in the pseudorandom generator. However, the set of starting points for the pseudorandom generators is very large, and I suspect my labtop wouldn't be able to cope with this (see - help set seed - the subsection on preserving and restoring the random-number generator state).
Any advice or suggestion in this regard would be deeply appreciated.
Here is the basic code that now generates a consistent sample:
************************************************** ************************
clear
use "masterlist.dta"
by province quintile, sort: gen sample_quintile = learners_proportion*sample_province //sample size for each quintile in each province
bysort province (emis): summ province //this stabilises the sort order so that the same 'random' sample is drawn each time
set seed 300
generate sampled = 0
forval qnum = 1/5 { forval pnum = 1/9 {
gsample sample_quintile [w=learners] if provincecd == `pnum' & quintile == `qnum', strata(quintile) generate(sampledtemp) replace
replace sampledtemp = 1 if sampledtemp > 1 // sampledtemp is sometimes = 2 because of sampling without replacement
replace sampled = 1 if sampledtemp == 1
}
}
Warmly,
Nimi
Comment