I have a question regarding the randomisation engine used in combination with the ‘sort’ command. I want to sample randomly within strata, and to do this, I first sort uniquely (by hh_id, which is unique in the dataset), and create a random variable r8 to sort. Below is the code extract, and output:
set seed 28479019
set sortseed 4874290
sort hh_id
gen r8 = runiform()
label var r8 "random variable for drawing sample"
gen n1 = _n
bys sample_strata: gen n4 = _n
bys sample_strata (n1): gen n3 = _n
corr n3 n4
areg n4 n3, absorb(sample_strata) cluster(sample_strata)
---------------------
. corr n3 n4
(obs=3,936)
| n3 n4
-------------+------------------
n3 | 1.0000
n4 | 0.6843 1.0000
. areg n4 n3, absorb(sample_strata) cluster(sample_strata)
Linear regression, absorbing indicators Number of obs = 3,172
Absorbed variable: sample_strata No. of categories = 69
F(1, 68) = 5.02
Prob > F = 0.0283
R-squared = 0.4676
Adj R-squared = 0.4558
Root MSE = 33.0733
(Std. err. adjusted for 69 clusters in sample_strata)
------------------------------------------------------------------------------
| Robust
n4 | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
n3 | .0848091 .0378442 2.24 0.028 .0092921 .1603261
_cons | 44.35502 1.834133 24.18 0.000 40.69507 48.01497
------------------------------------------------------------------------------
---------------------------
n3 should give the order within sample_strata (hh_id) which it does. But, n4 is not identical, even though previously the data was sorted by hh_id. Stata documentation (here and here and here) seems to suggest that every time I use a ‘sort’ or ‘bysort’ command where there are multiple values within a group (here ‘sample_strata’), STATA randomly orders observations within. However, the correlation and areg results suggest that the sorting within group was:
set seed 28479019
set sortseed 4874290
sort hh_id
gen r8 = runiform()
label var r8 "random variable for drawing sample"
gen n1 = _n
bys sample_strata: gen n4 = _n
bys sample_strata (n1): gen n3 = _n
corr n3 n4
areg n4 n3, absorb(sample_strata) cluster(sample_strata)
---------------------
. corr n3 n4
(obs=3,936)
| n3 n4
-------------+------------------
n3 | 1.0000
n4 | 0.6843 1.0000
. areg n4 n3, absorb(sample_strata) cluster(sample_strata)
Linear regression, absorbing indicators Number of obs = 3,172
Absorbed variable: sample_strata No. of categories = 69
F(1, 68) = 5.02
Prob > F = 0.0283
R-squared = 0.4676
Adj R-squared = 0.4558
Root MSE = 33.0733
(Std. err. adjusted for 69 clusters in sample_strata)
------------------------------------------------------------------------------
| Robust
n4 | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
n3 | .0848091 .0378442 2.24 0.028 .0092921 .1603261
_cons | 44.35502 1.834133 24.18 0.000 40.69507 48.01497
------------------------------------------------------------------------------
---------------------------
n3 should give the order within sample_strata (hh_id) which it does. But, n4 is not identical, even though previously the data was sorted by hh_id. Stata documentation (here and here and here) seems to suggest that every time I use a ‘sort’ or ‘bysort’ command where there are multiple values within a group (here ‘sample_strata’), STATA randomly orders observations within. However, the correlation and areg results suggest that the sorting within group was:
- NOT the previous sorting (by hh_id)
- NOT fully random, since the correlation between the new sorting and the previous sorting by hh_id is still quite high (and I've tried this with various seeds, and it gives the same conclusion)
Comment