Sorting not unique, and not random (internal sort randomizer)

dennis timo

Join Date: Sep 2023

Posts: 7
#1

Sorting not unique, and not random (internal sort randomizer)

25 Sep 2023, 06:18

I have a question regarding the randomisation engine used in combination with the ‘sort’ command. I want to sample randomly within strata, and to do this, I first sort uniquely (by hh_id, which is unique in the dataset), and create a random variable r8 to sort. Below is the code extract, and output:

set seed 28479019
set sortseed 4874290
sort hh_id
gen r8 = runiform()
label var r8 "random variable for drawing sample"
gen n1 = _n
bys sample_strata: gen n4 = _n
bys sample_strata (n1): gen n3 = _n
corr n3 n4
areg n4 n3, absorb(sample_strata) cluster(sample_strata)

---------------------

. corr n3 n4
(obs=3,936)

| n3 n4
-------------+------------------
n3 | 1.0000
n4 | 0.6843 1.0000

. areg n4 n3, absorb(sample_strata) cluster(sample_strata)

Linear regression, absorbing indicators Number of obs = 3,172
Absorbed variable: sample_strata No. of categories = 69
F(1, 68) = 5.02
Prob > F = 0.0283
R-squared = 0.4676
Adj R-squared = 0.4558
Root MSE = 33.0733

(Std. err. adjusted for 69 clusters in sample_strata)
------------------------------------------------------------------------------
| Robust
n4 | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
n3 | .0848091 .0378442 2.24 0.028 .0092921 .1603261
_cons | 44.35502 1.834133 24.18 0.000 40.69507 48.01497
------------------------------------------------------------------------------

---------------------------

n3 should give the order within sample_strata (hh_id) which it does. But, n4 is not identical, even though previously the data was sorted by hh_id. Stata documentation (here and here and here) seems to suggest that every time I use a ‘sort’ or ‘bysort’ command where there are multiple values within a group (here ‘sample_strata’), STATA randomly orders observations within. However, the correlation and areg results suggest that the sorting within group was:
NOT the previous sorting (by hh_id)

NOT fully random, since the correlation between the new sorting and the previous sorting by hh_id is still quite high (and I've tried this with various seeds, and it gives the same conclusion)

So, I was wondering how exactly the random ordering within group after ‘sort’ or ‘bysort’ is implemented? Is it fully random? If not, how is it not fully random?
Tags: None
dennis timo

Join Date: Sep 2023

Posts: 7
#2

25 Sep 2023, 10:30

The 'sort' helpfile says the following, but it would be great to learn exactly what the algorithm does:

''It turns out that our first results were also randomly ordered. That is true because sort performs a
quick randomized jumbling before sorting. We were already getting a randomized order within the ties.
Do not use this in practice. The randomization performed by sort is designed solely to make sort
faster by preventing any possibility of an initial ordering that defeats the sort algorithm and makes the
sorting much slower. If you want a random ordering within ties, then use a random-number generator
with good properties like the one implemented in runiform(). For more about the random-number
generator, see [R] set seed and the references therein.''
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#3

25 Sep 2023, 11:19

You have a few things going on with your dataset that make it difficult to follow what exactly is happening without seeing your data.

Originally posted by dennis timo View Post

I first sort uniquely (by hh_id, which is unique in the dataset),....

Ok, but then you immediately resort by strata. My guess is that sorting by hh_id alone would not be sorted in such as way as to form discrete blocks that directly map to one stratum.

Originally posted by dennis timo View Post

... and create a random variable r8 to sort.

Ok, but you never use it.

As I understand it, Stata resolves ties internally by random order only if there is a tie in the sort key(s). Further, many people on this forum feel that it is bad practice to sort on a non-unique key. It is worse practice to sort by non-unique keys only to then rely on data that will not be guaranteed to exist in some expected order.

Can you distill this to a minimal working example?

Originally posted by dennis timo View Post

I want to sample randomly within strata

Then you can use -gsample- for this if you don't want to do it by hand. If you want to do it by hand, why not push your unique stratum and hh_id identifiers to a separate dataset, randomly select them, then merge with your exist data?
1 like
Comment
dennis timo

Join Date: Sep 2023

Posts: 7
#4

25 Sep 2023, 11:37

Thanks Leonardo. I agree it is bad practice, and I would not use this code. But -- if this happens by mistake, I would like to know what exactly the sorting algorithm does in the case of ties. It clearly does not seem to fully randomly order within ties, but it also clearly does not maintain the previous ordering within the ties. That's what I'm trying to find out.
Comment
dennis timo

Join Date: Sep 2023

Posts: 7
#5

25 Sep 2023, 11:42

Here is a minimal example. I create random groups, first sort by id, then do a bysort group: command. The bysort does neither maintain the previous ranking by id (n3), nor is it fully random (since the ranking n4 is still systematically correlated with n3, the previous ranking by id). So the question: What exactly is the sorting algorithm doing when there are ties?

insobs 5000
gen id = _n
gen r = runiform()
sort r
gen group = (_n - mod(_n,100))/100
sort id
bys group: gen n4 = _n
bys group (id): gen n3 = _n
corr n3 n4
areg n3 n4, absorb(group)
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#6

25 Sep 2023, 12:30

Thanks for the example. I took the liberty to make some slight changes to create the (intended) group size of 100.

First, let's take the random uniform out of the equation. The code below creates identical sort orders based on n3 and n4.

Code:

clear * cls set seed 18 set sortseed 18 set obs 5000 gen id = _n gen group = (_n - (mod(_n-1, 100) + 1) )/100 + 1 sort id bys group: gen n4 = _n bys group (id): gen n3 = _n assert n3==n4

Now let's look at the example closer to your own in #5.

Code:

clear * cls set seed 18 set sortseed 18 set obs 5000 gen id = _n gen double r = runiform() sort r gen group = (_n - (mod(_n-1, 100) + 1) )/100 + 1 sort id bys group: gen n4 = _n list in 1/10 // notice the random sort order if you omit the -set sortseed- statement and run this code multiple times bys group (id): gen n3 = _n assert n3==n4

This code creates a group id based on a randomized id as determined by the random number, -r-. This order is uniquely determined, so it's order is deterministic.
Second, the data are sorted by -id-, resulting in randomized group numbers in the resulting dataset. Again, this is deterministic -- id is unique.
Next, the data are sorted by group. Here we have our first non-deterministic sort because there are ties among -group-, namely by all the members in the same group. This is where Stata uses the sortseed to break ties arbitrarily, and essentially uses the same technique as we used with -r- to randomly sort data within tied groups to determine the output order.
The final bysort is a gain deterministic, first sorting by group and id, then performing operations by group.
There is no reason to expect n3 and n4 to be identical, hence the error after the -assert- statement.
Comment

Daniel Schaefer

Join Date: Mar 2020
Posts: 814

25 Sep 2023, 13:47

I think there may be a small misunderstanding in Leonardo's otherwise excellent explanation in #6. I don't think OP necessarily expects n3 and n4 to be the same: OP is surprised that n3 and n4 are strongly correlated in #1 and uses the -areg- command to try to control for any between group variance. The relationship is still significant in the -areg-.

I just want to take a moment to point out that neither the code in #5 nor the code in #6 produce strong correlations between n3 and n4 on my machine. n4 does technically have a result with a significant (alpha < 0.05) correlation, but I tried a few different seeds and the significant result goes away on every other seed I tried, so this appears to be a consequence of random chance.

So, OP, it seems like neither your example code nor Leonardo's code can reproduce the strong correlation shown in #1. I'm skeptical that the tie breaking algorithm is anything but pseudorandom. There must be something else going on in your data or your code that isn't accounted for here.

Code:

clear
insobs 5000
gen id = _n
gen r = runiform()
sort r
gen group = (_n - mod(_n,100))/100
sort id
bys group: gen n4 = _n
bys group (id): gen n3 = _n
pwcorr n3 n4, sig
areg n3 n4, absorb(group)
assert n3==n4

Code:

. pwcorr n3 n4, sig

             |       n3       n4
-------------+------------------
          n3 |   1.0000
             |
             |
          n4 |   0.0085   1.0000
             |   0.5494
             |

. areg n3 n4, absorb(group)

Linear regression, absorbing indicators         Number of obs     =      5,000
Absorbed variable: group                        No. of categories =         51
                                                F(   1,   4948)   =       0.31
                                                Prob > F          =     0.5795
                                                R-squared         =     0.0007
                                                Adj R-squared     =    -0.0096
                                                Root MSE          =    29.0078

------------------------------------------------------------------------------
          n3 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          n4 |   .0078784   .0142158     0.55   0.579    -.0199909    .0357477
       _cons |    50.0825   .8265985    60.59   0.000       48.462      51.703
------------------------------------------------------------------------------
F test of absorbed indicators: F(50, 4948) = 0.058            Prob > F = 1.000

. assert n3==n4
4,946 contradictions in 5,000 observations
assertion is false

Code:

clear *
cls
set seed 18
set sortseed 18

set obs 5000

gen id = _n
gen double r = runiform()

sort r
gen group = (_n - (mod(_n-1, 100) + 1) )/100 + 1

sort id
bys group: gen n4 = _n
bys group (id): gen n3 = _n

pwcorr n3 n4, sig
areg n3 n4, absorb(group)
assert n3==n4

Code:

. pwcorr n3 n4, sig

             |       n3       n4
-------------+------------------
          n3 |   1.0000
             |
             |
          n4 |   0.0285   1.0000
             |   0.0439
             |

. areg n3 n4, absorb(group)

Linear regression, absorbing indicators         Number of obs     =      5,000
Absorbed variable: group                        No. of categories =         50
                                                F(   1,   4949)   =       4.02
                                                Prob > F          =     0.0450
                                                R-squared         =     0.0008
                                                Adj R-squared     =    -0.0093
                                                Root MSE          =    29.0026

------------------------------------------------------------------------------
          n3 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          n4 |   .0284925    .014209     2.01   0.045     .0006365    .0563486
       _cons |   49.06113   .8265098    59.36   0.000      47.4408    50.68145
------------------------------------------------------------------------------
F test of absorbed indicators: F(49, 4949) = -0.000           Prob > F = 1.000

. assert n3==n4
4,954 contradictions in 5,000 observations
assertion is false

Comment

dennis timo

Join Date: Sep 2023
Posts: 7

25 Sep 2023, 15:59

Thanks Daniel. This is now very strange. I've run your code on my machine (and with other seeds), and I still get t-stats of 8 / 9 in the areg command. Could this be something to do with my machine or version? I'm using STATA 17.0 MP 8-core.

Code:

Code:

clear
set seed 18
set sortseed 18

insobs 5000
gen id = _n
gen r = runiform()
sort r
gen group = (_n - mod(_n,100))/100
sort id
bys group: gen n4 = _n
bys group (id): gen n3 = _n
pwcorr n3 n4, sig
areg n3 n4, absorb(group)
assert n3==n4

Output:

Code:

. pwcorr n3 n4, sig

             |       n3       n4
-------------+------------------
          n3 |   1.0000 
             |
             |
          n4 |   0.1391   1.0000 
             |   0.0000
             |

. 
. areg n3 n4, absorb(group)

Linear regression, absorbing indicators            Number of obs     =   5,000
Absorbed variable: group                           No. of categories =      51
                                                   F(1, 4948)        =   96.94
                                                   Prob > F          =  0.0000
                                                   R-squared         =  0.0198
                                                   Adj R-squared     =  0.0097
                                                   Root MSE          = 28.7287

------------------------------------------------------------------------------
          n3 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          n4 |    .138617    .014079     9.85   0.000     .1110159    .1662181
       _cons |   43.48279    .818644    53.12   0.000     41.87788    45.08769
------------------------------------------------------------------------------
F test of absorbed indicators: F(50, 4948) = 0.044            Prob > F = 1.000

Last edited by dennis timo; 25 Sep 2023, 16:07.

Comment

Daniel Schaefer

Join Date: Mar 2020
Posts: 814

25 Sep 2023, 16:13

I've tried this on Stata/SE 17.0 for Windows (64-bit x86-64) and Stata/SE 16.1 for Windows (64-bit x86-64). Notably, the same seed produces different results across versions. However, there don't appear to be significant correlations on my end. Suppose we make sure to set the seed for your minimal example. Do we get the same results? Here are mine (Stata/SE 17.0 for Windows (64-bit x86-64)):

Code:

clear
set seed 18
set sortseed 18
insobs 5000
gen id = _n
gen r = runiform()
sort r
gen group = (_n - mod(_n,100))/100
sort id
bys group: gen n4 = _n
bys group (id): gen n3 = _n
pwcorr n3 n4, sig
areg n3 n4, absorb(group)

Code:

. pwcorr n3 n4, sig

             |       n3       n4
-------------+------------------
          n3 |   1.0000
             |
             |
          n4 |  -0.0086   1.0000
             |   0.5455
             |

. areg n3 n4, absorb(group)

Linear regression, absorbing indicators            Number of obs     =   5,000
Absorbed variable: group                           No. of categories =      51
                                                   F(1, 4948)        =    0.41
                                                   Prob > F          =  0.5199
                                                   R-squared         =  0.0007
                                                   Adj R-squared     = -0.0096
                                                   Root MSE          = 29.0075

------------------------------------------------------------------------------
          n3 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          n4 |  -.0091491   .0142157    -0.64   0.520    -.0370181    .0187198
       _cons |   50.94205   .8265896    61.63   0.000     49.32157    52.56253
------------------------------------------------------------------------------
F test of absorbed indicators: F(50, 4948) = 0.060            Prob > F = 1.000

Comment

Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#10

25 Sep 2023, 16:20

Hmmm... maybe the difference has something to do with multithreading in STATA 17.0 MP?
Comment

Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014
Posts: 346

#11

25 Sep 2023, 16:24

-sort- in Stata/MP will first split the data based on how many processors in use, then -sort- each section of the data, then perform a merge sort of the sorted sections. The tie break will not only depend on the initail randomization, but also depend on how the data is splitted. Hence different number of processors will yield different orders for ties. For the above example, in Stata 17, if we set processors to 1, we get:

Code:

. set processors 1
    The maximum number of processors or cores being used is 1.  It can be set to any number between 1 and 4.

. clear

.
. set seed 18

.
. set sortseed 18

.
.
.
. insobs 5000
(5000 observations added)

.
. gen id = _n

.
. gen r = runiform()

.
. sort r

.
. gen group = (_n - mod(_n,100))/100

.
. sort id

.
. bys group: gen n4 = _n

.
. bys group (id): gen n3 = _n

.
. pwcorr n3 n4, sig

             |       n3       n4
-------------+------------------
          n3 |   1.0000
             |
             |
          n4 |  -0.0086   1.0000
             |   0.5455
             |

.
. areg n3 n4, absorb(group)

Linear regression, absorbing indicators            Number of obs     =   5,000
Absorbed variable: group                           No. of categories =      51
                                                   F(1, 4948)        =    0.41
                                                   Prob > F          =  0.5199
                                                   R-squared         =  0.0007
                                                   Adj R-squared     = -0.0096
                                                   Root MSE          = 29.0075

------------------------------------------------------------------------------
          n3 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          n4 |  -.0091491   .0142157    -0.64   0.520    -.0370181    .0187198
       _cons |   50.94205   .8265896    61.63   0.000     49.32157    52.56253
------------------------------------------------------------------------------
F test of absorbed indicators: F(50, 4948) = 0.060            Prob > F = 1.000

.
. assert n3==n4
4,947 contradictions in 5,000 observations
assertion is false
r(9);

And if we set processors to 4, we get:

Code:

. set processors 4
    The maximum number of processors or cores being used is changed from 1 to 4.  It can be set to any number between 1 and 4

. clear

.
. set seed 18

.
. set sortseed 18

.
.
.
. insobs 5000
(5000 observations added)

.
. gen id = _n

.
. gen r = runiform()

.
. sort r

.
. gen group = (_n - mod(_n,100))/100

.
. sort id

.
. bys group: gen n4 = _n

.
. bys group (id): gen n3 = _n

.
. pwcorr n3 n4, sig

             |       n3       n4
-------------+------------------
          n3 |   1.0000
             |
             |
          n4 |   0.1336   1.0000
             |   0.0000
             |

.
. areg n3 n4, absorb(group)

Linear regression, absorbing indicators            Number of obs     =   5,000
Absorbed variable: group                           No. of categories =      51
                                                   F(1, 4948)        =   89.26
                                                   Prob > F          =  0.0000
                                                   R-squared         =  0.0183
                                                   Adj R-squared     =  0.0082
                                                   Root MSE          = 28.7506

------------------------------------------------------------------------------
          n3 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          n4 |   .1331167   .0140897     9.45   0.000     .1054946    .1607388
       _cons |   43.76044   .8192675    53.41   0.000     42.15431    45.36657
------------------------------------------------------------------------------
F test of absorbed indicators: F(50, 4948) = 0.045            Prob > F = 1.000

.
. assert n3==n4
4,934 contradictions in 5,000 observations
assertion is false
r(9);

Note: the default sort algorithm changed in Stata 17 from Stata 16. You will get similar but not exact results from above example in Stata 16 comparing to Stata 17. In Stata 17, if you would like to get the same results as Stata 16, either use verion 16 in front, or set sortmethod qsort

Last edited by Hua Peng (StataCorp); 25 Sep 2023, 16:28.

Comment

dennis timo

Join Date: Sep 2023
Posts: 7

#12

25 Sep 2023, 16:25

Yep. I still get the same (and a colleague of mine also on STATA MP 17.0 is getting the same):

Code:

clear
set seed 18
set sortseed 18
insobs 5000
gen id = _n
gen r = runiform()
sort r
gen group = (_n - mod(_n,100))/100
sort id
bys group: gen n4 = _n
bys group (id): gen n3 = _n
pwcorr n3 n4, sig
areg n3 n4, absorb(group)

Code:

. pwcorr n3 n4, sig

             |       n3       n4
-------------+------------------
          n3 |   1.0000
             |
             |
          n4 |   0.1391   1.0000
             |   0.0000
             |

.
. areg n3 n4, absorb(group)

Linear regression, absorbing indicators            Number of obs     =   5,000
Absorbed variable: group                           No. of categories =      51
                                                   F(1, 4948)        =   96.94
                                                   Prob > F          =  0.0000
                                                   R-squared         =  0.0198
                                                   Adj R-squared     =  0.0097
                                                   Root MSE          = 28.7287

------------------------------------------------------------------------------
          n3 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          n4 |    .138617    .014079     9.85   0.000     .1110159    .1662181
       _cons |   43.48279    .818644    53.12   0.000     41.87788    45.08769
------------------------------------------------------------------------------
F test of absorbed indicators: F(50, 4948) = 0.044            Prob > F = 1.000

Last edited by dennis timo; 25 Sep 2023, 16:33.

Comment

dennis timo

Join Date: Sep 2023

Posts: 7
#13

25 Sep 2023, 16:36

Yes I checked the 'set processor' as well. This did the trick. Ok thank you all so much! I learned a lot!
Comment

Announcement