Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sorting not unique, and not random (internal sort randomizer)

    I have a question regarding the randomisation engine used in combination with the ‘sort’ command. I want to sample randomly within strata, and to do this, I first sort uniquely (by hh_id, which is unique in the dataset), and create a random variable r8 to sort. Below is the code extract, and output:

    set seed 28479019
    set sortseed 4874290
    sort hh_id
    gen r8 = runiform()
    label var r8 "random variable for drawing sample"
    gen n1 = _n
    bys sample_strata: gen n4 = _n
    bys sample_strata (n1): gen n3 = _n
    corr n3 n4
    areg n4 n3, absorb(sample_strata) cluster(sample_strata)

    ---------------------

    . corr n3 n4
    (obs=3,936)

    | n3 n4
    -------------+------------------
    n3 | 1.0000
    n4 | 0.6843 1.0000


    . areg n4 n3, absorb(sample_strata) cluster(sample_strata)

    Linear regression, absorbing indicators Number of obs = 3,172
    Absorbed variable: sample_strata No. of categories = 69
    F(1, 68) = 5.02
    Prob > F = 0.0283
    R-squared = 0.4676
    Adj R-squared = 0.4558
    Root MSE = 33.0733

    (Std. err. adjusted for 69 clusters in sample_strata)
    ------------------------------------------------------------------------------
    | Robust
    n4 | Coefficient std. err. t P>|t| [95% conf. interval]
    -------------+----------------------------------------------------------------
    n3 | .0848091 .0378442 2.24 0.028 .0092921 .1603261
    _cons | 44.35502 1.834133 24.18 0.000 40.69507 48.01497
    ------------------------------------------------------------------------------

    ---------------------------

    n3 should give the order within sample_strata (hh_id) which it does. But, n4 is not identical, even though previously the data was sorted by hh_id. Stata documentation (here and here and here) seems to suggest that every time I use a ‘sort’ or ‘bysort’ command where there are multiple values within a group (here ‘sample_strata’), STATA randomly orders observations within. However, the correlation and areg results suggest that the sorting within group was:
    1. NOT the previous sorting (by hh_id)
    2. NOT fully random, since the correlation between the new sorting and the previous sorting by hh_id is still quite high (and I've tried this with various seeds, and it gives the same conclusion)
    So, I was wondering how exactly the random ordering within group after ‘sort’ or ‘bysort’ is implemented? Is it fully random? If not, how is it not fully random?

  • #2
    The 'sort' helpfile says the following, but it would be great to learn exactly what the algorithm does:

    ''It turns out that our first results were also randomly ordered. That is true because sort performs a
    quick randomized jumbling before sorting. We were already getting a randomized order within the ties.
    Do not use this in practice. The randomization performed by sort is designed solely to make sort
    faster by preventing any possibility of an initial ordering that defeats the sort algorithm and makes the
    sorting much slower. If you want a random ordering within ties, then use a random-number generator
    with good properties like the one implemented in runiform(). For more about the random-number
    generator, see [R] set seed and the references therein.''

    Comment


    • #3
      You have a few things going on with your dataset that make it difficult to follow what exactly is happening without seeing your data.

      Originally posted by dennis timo View Post
      I first sort uniquely (by hh_id, which is unique in the dataset),....
      Ok, but then you immediately resort by strata. My guess is that sorting by hh_id alone would not be sorted in such as way as to form discrete blocks that directly map to one stratum.

      Originally posted by dennis timo View Post
      ... and create a random variable r8 to sort.
      Ok, but you never use it.

      As I understand it, Stata resolves ties internally by random order only if there is a tie in the sort key(s). Further, many people on this forum feel that it is bad practice to sort on a non-unique key. It is worse practice to sort by non-unique keys only to then rely on data that will not be guaranteed to exist in some expected order.

      Can you distill this to a minimal working example?


      Originally posted by dennis timo View Post
      I want to sample randomly within strata
      Then you can use -gsample- for this if you don't want to do it by hand. If you want to do it by hand, why not push your unique stratum and hh_id identifiers to a separate dataset, randomly select them, then merge with your exist data?

      Comment


      • #4
        Thanks Leonardo. I agree it is bad practice, and I would not use this code. But -- if this happens by mistake, I would like to know what exactly the sorting algorithm does in the case of ties. It clearly does not seem to fully randomly order within ties, but it also clearly does not maintain the previous ordering within the ties. That's what I'm trying to find out.

        Comment


        • #5
          Here is a minimal example. I create random groups, first sort by id, then do a bysort group: command. The bysort does neither maintain the previous ranking by id (n3), nor is it fully random (since the ranking n4 is still systematically correlated with n3, the previous ranking by id). So the question: What exactly is the sorting algorithm doing when there are ties?

          insobs 5000
          gen id = _n
          gen r = runiform()
          sort r
          gen group = (_n - mod(_n,100))/100
          sort id
          bys group: gen n4 = _n
          bys group (id): gen n3 = _n
          corr n3 n4
          areg n3 n4, absorb(group)

          Comment


          • #6
            Thanks for the example. I took the liberty to make some slight changes to create the (intended) group size of 100.

            First, let's take the random uniform out of the equation. The code below creates identical sort orders based on n3 and n4.

            Code:
            clear *
            cls
            set seed 18
            set sortseed 18
            
            set obs 5000
            
            gen id = _n
            gen group = (_n - (mod(_n-1, 100) + 1) )/100 + 1
            
            sort id
            bys group: gen n4 = _n
            bys group (id): gen n3 = _n
            assert n3==n4
            Now let's look at the example closer to your own in #5.

            Code:
            clear *
            cls
            set seed 18
            set sortseed 18
            
            set obs 5000
            
            gen id = _n
            gen double r = runiform()
            
            sort r
            gen group = (_n - (mod(_n-1, 100) + 1) )/100 + 1
            
            sort id
            bys group: gen n4 = _n
            list in 1/10 // notice the random sort order if you omit the -set sortseed- statement and run this code multiple times
            bys group (id): gen n3 = _n
            assert n3==n4
            This code creates a group id based on a randomized id as determined by the random number, -r-. This order is uniquely determined, so it's order is deterministic.
            Second, the data are sorted by -id-, resulting in randomized group numbers in the resulting dataset. Again, this is deterministic -- id is unique.
            Next, the data are sorted by group. Here we have our first non-deterministic sort because there are ties among -group-, namely by all the members in the same group. This is where Stata uses the sortseed to break ties arbitrarily, and essentially uses the same technique as we used with -r- to randomly sort data within tied groups to determine the output order.
            The final bysort is a gain deterministic, first sorting by group and id, then performing operations by group.
            There is no reason to expect n3 and n4 to be identical, hence the error after the -assert- statement.

            Comment


            • #7
              I think there may be a small misunderstanding in Leonardo's otherwise excellent explanation in #6. I don't think OP necessarily expects n3 and n4 to be the same: OP is surprised that n3 and n4 are strongly correlated in #1 and uses the -areg- command to try to control for any between group variance. The relationship is still significant in the -areg-.

              I just want to take a moment to point out that neither the code in #5 nor the code in #6 produce strong correlations between n3 and n4 on my machine. n4 does technically have a result with a significant (alpha < 0.05) correlation, but I tried a few different seeds and the significant result goes away on every other seed I tried, so this appears to be a consequence of random chance.

              So, OP, it seems like neither your example code nor Leonardo's code can reproduce the strong correlation shown in #1. I'm skeptical that the tie breaking algorithm is anything but pseudorandom. There must be something else going on in your data or your code that isn't accounted for here.

              Code:
              clear
              insobs 5000
              gen id = _n
              gen r = runiform()
              sort r
              gen group = (_n - mod(_n,100))/100
              sort id
              bys group: gen n4 = _n
              bys group (id): gen n3 = _n
              pwcorr n3 n4, sig
              areg n3 n4, absorb(group)
              assert n3==n4
              Code:
              . pwcorr n3 n4, sig
              
                           |       n3       n4
              -------------+------------------
                        n3 |   1.0000
                           |
                           |
                        n4 |   0.0085   1.0000
                           |   0.5494
                           |
              
              . areg n3 n4, absorb(group)
              
              Linear regression, absorbing indicators         Number of obs     =      5,000
              Absorbed variable: group                        No. of categories =         51
                                                              F(   1,   4948)   =       0.31
                                                              Prob > F          =     0.5795
                                                              R-squared         =     0.0007
                                                              Adj R-squared     =    -0.0096
                                                              Root MSE          =    29.0078
              
              ------------------------------------------------------------------------------
                        n3 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
              -------------+----------------------------------------------------------------
                        n4 |   .0078784   .0142158     0.55   0.579    -.0199909    .0357477
                     _cons |    50.0825   .8265985    60.59   0.000       48.462      51.703
              ------------------------------------------------------------------------------
              F test of absorbed indicators: F(50, 4948) = 0.058            Prob > F = 1.000
              
              . assert n3==n4
              4,946 contradictions in 5,000 observations
              assertion is false
              Code:
              clear *
              cls
              set seed 18
              set sortseed 18
              
              set obs 5000
              
              gen id = _n
              gen double r = runiform()
              
              sort r
              gen group = (_n - (mod(_n-1, 100) + 1) )/100 + 1
              
              sort id
              bys group: gen n4 = _n
              bys group (id): gen n3 = _n
              
              pwcorr n3 n4, sig
              areg n3 n4, absorb(group)
              assert n3==n4
              Code:
              . pwcorr n3 n4, sig
              
                           |       n3       n4
              -------------+------------------
                        n3 |   1.0000
                           |
                           |
                        n4 |   0.0285   1.0000
                           |   0.0439
                           |
              
              . areg n3 n4, absorb(group)
              
              Linear regression, absorbing indicators         Number of obs     =      5,000
              Absorbed variable: group                        No. of categories =         50
                                                              F(   1,   4949)   =       4.02
                                                              Prob > F          =     0.0450
                                                              R-squared         =     0.0008
                                                              Adj R-squared     =    -0.0093
                                                              Root MSE          =    29.0026
              
              ------------------------------------------------------------------------------
                        n3 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
              -------------+----------------------------------------------------------------
                        n4 |   .0284925    .014209     2.01   0.045     .0006365    .0563486
                     _cons |   49.06113   .8265098    59.36   0.000      47.4408    50.68145
              ------------------------------------------------------------------------------
              F test of absorbed indicators: F(49, 4949) = -0.000           Prob > F = 1.000
              
              . assert n3==n4
              4,954 contradictions in 5,000 observations
              assertion is false

              Comment


              • #8
                Thanks Daniel. This is now very strange. I've run your code on my machine (and with other seeds), and I still get t-stats of 8 / 9 in the areg command. Could this be something to do with my machine or version? I'm using STATA 17.0 MP 8-core.

                Code:
                Code:
                clear
                set seed 18
                set sortseed 18
                
                insobs 5000
                gen id = _n
                gen r = runiform()
                sort r
                gen group = (_n - mod(_n,100))/100
                sort id
                bys group: gen n4 = _n
                bys group (id): gen n3 = _n
                pwcorr n3 n4, sig
                areg n3 n4, absorb(group)
                assert n3==n4
                Output:

                Code:
                . pwcorr n3 n4, sig
                
                             |       n3       n4
                -------------+------------------
                          n3 |   1.0000 
                             |
                             |
                          n4 |   0.1391   1.0000 
                             |   0.0000
                             |
                
                . 
                . areg n3 n4, absorb(group)
                
                Linear regression, absorbing indicators            Number of obs     =   5,000
                Absorbed variable: group                           No. of categories =      51
                                                                   F(1, 4948)        =   96.94
                                                                   Prob > F          =  0.0000
                                                                   R-squared         =  0.0198
                                                                   Adj R-squared     =  0.0097
                                                                   Root MSE          = 28.7287
                
                ------------------------------------------------------------------------------
                          n3 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                -------------+----------------------------------------------------------------
                          n4 |    .138617    .014079     9.85   0.000     .1110159    .1662181
                       _cons |   43.48279    .818644    53.12   0.000     41.87788    45.08769
                ------------------------------------------------------------------------------
                F test of absorbed indicators: F(50, 4948) = 0.044            Prob > F = 1.000
                Last edited by dennis timo; 25 Sep 2023, 16:07.

                Comment


                • #9
                  I've tried this on Stata/SE 17.0 for Windows (64-bit x86-64) and Stata/SE 16.1 for Windows (64-bit x86-64). Notably, the same seed produces different results across versions. However, there don't appear to be significant correlations on my end. Suppose we make sure to set the seed for your minimal example. Do we get the same results? Here are mine (Stata/SE 17.0 for Windows (64-bit x86-64)):

                  Code:
                  clear
                  set seed 18
                  set sortseed 18
                  insobs 5000
                  gen id = _n
                  gen r = runiform()
                  sort r
                  gen group = (_n - mod(_n,100))/100
                  sort id
                  bys group: gen n4 = _n
                  bys group (id): gen n3 = _n
                  pwcorr n3 n4, sig
                  areg n3 n4, absorb(group)
                  Code:
                  . pwcorr n3 n4, sig
                  
                               |       n3       n4
                  -------------+------------------
                            n3 |   1.0000
                               |
                               |
                            n4 |  -0.0086   1.0000
                               |   0.5455
                               |
                  
                  . areg n3 n4, absorb(group)
                  
                  Linear regression, absorbing indicators            Number of obs     =   5,000
                  Absorbed variable: group                           No. of categories =      51
                                                                     F(1, 4948)        =    0.41
                                                                     Prob > F          =  0.5199
                                                                     R-squared         =  0.0007
                                                                     Adj R-squared     = -0.0096
                                                                     Root MSE          = 29.0075
                  
                  ------------------------------------------------------------------------------
                            n3 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                  -------------+----------------------------------------------------------------
                            n4 |  -.0091491   .0142157    -0.64   0.520    -.0370181    .0187198
                         _cons |   50.94205   .8265896    61.63   0.000     49.32157    52.56253
                  ------------------------------------------------------------------------------
                  F test of absorbed indicators: F(50, 4948) = 0.060            Prob > F = 1.000

                  Comment


                  • #10
                    Hmmm... maybe the difference has something to do with multithreading in STATA 17.0 MP?

                    Comment


                    • #11
                      -sort- in Stata/MP will first split the data based on how many processors in use, then -sort- each section of the data, then perform a merge sort of the sorted sections. The tie break will not only depend on the initail randomization, but also depend on how the data is splitted. Hence different number of processors will yield different orders for ties. For the above example, in Stata 17, if we set processors to 1, we get:

                      Code:
                      . set processors 1
                          The maximum number of processors or cores being used is 1.  It can be set to any number between 1 and 4.
                      
                      . clear
                      
                      .
                      . set seed 18
                      
                      .
                      . set sortseed 18
                      
                      .
                      .
                      .
                      . insobs 5000
                      (5000 observations added)
                      
                      .
                      . gen id = _n
                      
                      .
                      . gen r = runiform()
                      
                      .
                      . sort r
                      
                      .
                      . gen group = (_n - mod(_n,100))/100
                      
                      .
                      . sort id
                      
                      .
                      . bys group: gen n4 = _n
                      
                      .
                      . bys group (id): gen n3 = _n
                      
                      .
                      . pwcorr n3 n4, sig
                      
                                   |       n3       n4
                      -------------+------------------
                                n3 |   1.0000
                                   |
                                   |
                                n4 |  -0.0086   1.0000
                                   |   0.5455
                                   |
                      
                      .
                      . areg n3 n4, absorb(group)
                      
                      Linear regression, absorbing indicators            Number of obs     =   5,000
                      Absorbed variable: group                           No. of categories =      51
                                                                         F(1, 4948)        =    0.41
                                                                         Prob > F          =  0.5199
                                                                         R-squared         =  0.0007
                                                                         Adj R-squared     = -0.0096
                                                                         Root MSE          = 29.0075
                      
                      ------------------------------------------------------------------------------
                                n3 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                      -------------+----------------------------------------------------------------
                                n4 |  -.0091491   .0142157    -0.64   0.520    -.0370181    .0187198
                             _cons |   50.94205   .8265896    61.63   0.000     49.32157    52.56253
                      ------------------------------------------------------------------------------
                      F test of absorbed indicators: F(50, 4948) = 0.060            Prob > F = 1.000
                      
                      .
                      . assert n3==n4
                      4,947 contradictions in 5,000 observations
                      assertion is false
                      r(9);
                      And if we set processors to 4, we get:

                      Code:
                      . set processors 4
                          The maximum number of processors or cores being used is changed from 1 to 4.  It can be set to any number between 1 and 4
                      
                      . clear
                      
                      .
                      . set seed 18
                      
                      .
                      . set sortseed 18
                      
                      .
                      .
                      .
                      . insobs 5000
                      (5000 observations added)
                      
                      .
                      . gen id = _n
                      
                      .
                      . gen r = runiform()
                      
                      .
                      . sort r
                      
                      .
                      . gen group = (_n - mod(_n,100))/100
                      
                      .
                      . sort id
                      
                      .
                      . bys group: gen n4 = _n
                      
                      .
                      . bys group (id): gen n3 = _n
                      
                      .
                      . pwcorr n3 n4, sig
                      
                                   |       n3       n4
                      -------------+------------------
                                n3 |   1.0000
                                   |
                                   |
                                n4 |   0.1336   1.0000
                                   |   0.0000
                                   |
                      
                      .
                      . areg n3 n4, absorb(group)
                      
                      Linear regression, absorbing indicators            Number of obs     =   5,000
                      Absorbed variable: group                           No. of categories =      51
                                                                         F(1, 4948)        =   89.26
                                                                         Prob > F          =  0.0000
                                                                         R-squared         =  0.0183
                                                                         Adj R-squared     =  0.0082
                                                                         Root MSE          = 28.7506
                      
                      ------------------------------------------------------------------------------
                                n3 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                      -------------+----------------------------------------------------------------
                                n4 |   .1331167   .0140897     9.45   0.000     .1054946    .1607388
                             _cons |   43.76044   .8192675    53.41   0.000     42.15431    45.36657
                      ------------------------------------------------------------------------------
                      F test of absorbed indicators: F(50, 4948) = 0.045            Prob > F = 1.000
                      
                      .
                      . assert n3==n4
                      4,934 contradictions in 5,000 observations
                      assertion is false
                      r(9);
                      Note: the default sort algorithm changed in Stata 17 from Stata 16. You will get similar but not exact results from above example in Stata 16 comparing to Stata 17. In Stata 17, if you would like to get the same results as Stata 16, either use verion 16 in front, or set sortmethod qsort
                      Last edited by Hua Peng (StataCorp); 25 Sep 2023, 16:28.

                      Comment


                      • #12
                        Yep. I still get the same (and a colleague of mine also on STATA MP 17.0 is getting the same):

                        Code:
                        clear
                        set seed 18
                        set sortseed 18
                        insobs 5000
                        gen id = _n
                        gen r = runiform()
                        sort r
                        gen group = (_n - mod(_n,100))/100
                        sort id
                        bys group: gen n4 = _n
                        bys group (id): gen n3 = _n
                        pwcorr n3 n4, sig
                        areg n3 n4, absorb(group)


                        Code:
                        . pwcorr n3 n4, sig
                        
                                     |       n3       n4
                        -------------+------------------
                                  n3 |   1.0000
                                     |
                                     |
                                  n4 |   0.1391   1.0000
                                     |   0.0000
                                     |
                        
                        .
                        . areg n3 n4, absorb(group)
                        
                        Linear regression, absorbing indicators            Number of obs     =   5,000
                        Absorbed variable: group                           No. of categories =      51
                                                                           F(1, 4948)        =   96.94
                                                                           Prob > F          =  0.0000
                                                                           R-squared         =  0.0198
                                                                           Adj R-squared     =  0.0097
                                                                           Root MSE          = 28.7287
                        
                        ------------------------------------------------------------------------------
                                  n3 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                        -------------+----------------------------------------------------------------
                                  n4 |    .138617    .014079     9.85   0.000     .1110159    .1662181
                               _cons |   43.48279    .818644    53.12   0.000     41.87788    45.08769
                        ------------------------------------------------------------------------------
                        F test of absorbed indicators: F(50, 4948) = 0.044            Prob > F = 1.000
                        Last edited by dennis timo; 25 Sep 2023, 16:33.

                        Comment


                        • #13
                          Yes I checked the 'set processor' as well. This did the trick. Ok thank you all so much! I learned a lot!

                          Comment

                          Working...
                          X