Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reasonable randomly shuffling values within groups

    Hi everyone!

    I have data on students exam scores and dates. Here is a subset of the data:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long id int test_date float(day_order week n_week n_day)
    1 16560 1 283 1 1
    1 16579 1 286 1 1
    1 16589 1 287 2 2
    1 16589 2 287 2 2
    1 16602 1 289 1 1
    1 16615 1 291 2 1
    1 16618 1 291 2 1
    4 15465 1 127 3 1
    4 15467 1 127 3 2
    4 15467 2 127 3 2
    4 15481 1 129 1 1
    4 15494 1 131 1 1
    4 15500 1 132 1 1
    4 15517 1 134 1 1
    8 15833 1 179 1 1
    8 15847 1 181 1 1
    8 15858 1 183 2 2
    8 15858 2 183 2 2
    8 15864 1 184 2 1
    8 15866 1 184 2 1
    8 15871 1 185 3 1
    8 15873 1 185 3 2
    8 15873 2 185 3 2
    8 15878 1 186 2 1
    8 15882 1 186 2 1
    end
    format %td test_date
    Variables description:
    • id: student identifier
    • test_date: the date of the exam
    • day_order: the ordinary number of the exam in the day. For example, if two exams were taken on the same date, the first gets day_order == 1 and the second gets day_order == 2
    • week: the ordinary number of the week
    • n_week: number of exams in the week
    • n_day: number of exams in the day
    I want to randomly shuffle the values of n_week and n_day within each student. For instance, for id == 4, assign randomly the values 1 or 3 for n_week and the values 1 or 2 to n_day. Note that the n_week and n_day should be shuffled independently, i.e., the shuffling is not done on the "real" combinations of n_week and n_day obtained by the original data. Here is an example of such a shuffle:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long id int test_date float(day_order week n_week n_day)
    1 16560 1 283 2 1
    1 16579 1 286 2 2
    1 16589 1 287 2 1
    1 16589 2 287 1 1
    1 16602 1 289 1 1
    1 16615 1 291 1 2
    1 16618 1 291 2 1
    4 15465 1 127 3 1
    4 15467 1 127 1 1
    4 15467 2 127 1 2
    4 15481 1 129 3 1
    4 15494 1 131 1 1
    4 15500 1 132 3 1
    4 15517 1 134 1 2
    8 15833 1 179 3 1
    8 15847 1 181 2 1
    8 15858 1 183 2 1
    8 15858 2 183 2 2
    8 15864 1 184 2 2
    8 15866 1 184 3 1
    8 15871 1 185 1 2
    8 15873 1 185 3 1
    8 15873 2 185 2 1
    8 15878 1 186 1 2
    8 15882 1 186 2 1
    end
    format %td test_date
    However, I want the shuffle to generate reasonable combinations of n_week and n_day. For example, the shuffled combination in the sixth row (id == 1 and test_date == 16615) does not make sense because the number of weekly exams is 1 (n_week == 1) and the number of daily exams is 2 (n_day == 2). Put it differently, n_day should not exceed n_week.


    I'd appreciate your help!

  • #2
    Thanks for the data example!

    Sorting columns independent of other columns is surprisingly difficult in Stata. That said, you should be able to do this with frames:

    Code:
    gen long obsnum = _n
    foreach var in n_week n_day{
        frame put id `var', into(`var'_frame)
        frame `var'_frame {
            gen randnum = runiform()
            rename `var' `var'_rand
            sort id randnum
            gen long obsnum = _n
        }
        frlink 1:1 obsnum, frame(`var'_frame)
        frget `var'_rand, from(`var'_frame)
        frame drop `var'_frame
        drop `var'_frame
    }
    drop obsnum
    list, clean noobs
    Code:
    . list, clean noobs
    
        id   test_date   day_or~r   week   n_week   n_day   n_week~d   n_day_~d  
         1   04may2005          1    283        1       1          1          1  
         1   23may2005          1    286        1       1          1          2  
         1   02jun2005          1    287        2       2          2          2  
         1   02jun2005          2    287        2       2          2          1  
         1   15jun2005          1    289        1       1          2          1  
         1   28jun2005          1    291        2       1          2          1  
         1   01jul2005          1    291        2       1          1          1  
         4   05may2002          1    127        3       1          3          1  
         4   07may2002          1    127        3       2          1          1  
         4   07may2002          2    127        3       2          3          2  
         4   21may2002          1    129        1       1          1          2  
         4   03jun2002          1    131        1       1          1          1  
         4   09jun2002          1    132        1       1          1          1  
         4   26jun2002          1    134        1       1          3          1  
         8   08may2003          1    179        1       1          1          1  
         8   22may2003          1    181        1       1          2          1  
         8   02jun2003          1    183        2       2          3          1  
         8   02jun2003          2    183        2       2          2          1  
         8   08jun2003          1    184        2       1          2          2  
         8   10jun2003          1    184        2       1          1          2  
         8   15jun2003          1    185        3       1          2          2  
         8   17jun2003          1    185        3       2          2          1  
         8   17jun2003          2    185        3       2          3          1  
         8   22jun2003          1    186        2       1          2          1  
         8   26jun2003          1    186        2       1          3          2
    Edit: I may not have read carefully enough. I'm not sure what you mean by this:

    However, I want the shuffle to generate reasonable combinations of n_week and n_day. For example, the shuffled combination in the sixth row (id == 1 and test_date == 16615) does not make sense because the number of weekly exams is 1 (n_week == 1) and the number of daily exams is 2 (n_day == 2). Put it differently, n_day should not exceed n_week.
    Hopefully what I have above gets you a little closer to what you want.
    Last edited by Daniel Schaefer; 04 Dec 2023, 15:02.

    Comment


    • #3
      Okay, I think I see what you mean. So you definitely don't want n_week and n_day strictly independent: you want to guarantee that n_day is less than n_week. Okay, that complicates things a bit. The following should meet your needs. Make sure the assertion at the end passes.

      Code:
      // STEP 1: shuffle n_week randomly
      gen long obsnum = _n
      frame put id n_week, into(n_week_frame)
      frame n_week_frame {
          gen randnum = runiform()
          rename n_week n_week_rand
          sort id randnum
          gen long obsnum = _n
      }
      frlink 1:1 obsnum, frame(n_week_frame)
      frget n_week_rand, from(n_week_frame)
      frame drop n_week_frame
      drop n_week_frame
      drop obsnum
      
      // STEP 2: Sort n_day so that it does not exceed n_week
      sort id n_week_rand
      gen long obsnum = _n
      frame put id n_day, into(n_day_sort_frame)
      frame n_day_sort_frame {
          sort id n_day
          gen long obsnum = _n
      }
      drop n_day
      frlink 1:1 obsnum, frame(n_day_sort_frame)
      frget n_day, from(n_day_sort_frame)
      frame drop n_day_sort_frame
      drop n_day_sort_frame
      drop obsnum
      
      // STEP 3: Sort n_day randomly within id and n_weeks_rand
      gen long obsnum = _n
      frame put id n_day n_week_rand, into(n_day_frame)
      frame n_day_frame {
          gen randnum = runiform()
          rename n_day n_day_rand
          sort id n_week_rand randnum
          gen long obsnum = _n
      }
      frlink 1:1 obsnum, frame(n_day_frame)
      frget n_day_rand, from(n_day_frame)
      frame drop n_day_frame
      drop n_day_frame
      drop obsnum
      
      assert n_day_rand <= n_week_rand
      Code:
      . list, clean noobs
      
          id   test_date   day_or~r   week   n_week   n_week~d   n_day   n_day_~d  
           1   04may2005          1    283        1          1       1          1  
           1   02jun2005          1    287        2          1       1          1  
           1   23may2005          1    286        1          1       1          1  
           1   01jul2005          1    291        2          2       1          1  
           1   15jun2005          1    289        1          2       1          1  
           1   28jun2005          1    291        2          2       2          2  
           1   02jun2005          2    287        2          2       2          2  
           4   05may2002          1    127        3          1       1          1  
           4   03jun2002          1    131        1          1       1          1  
           4   07may2002          2    127        3          1       1          1  
           4   26jun2002          1    134        1          1       1          1  
           4   07may2002          1    127        3          3       1          2  
           4   09jun2002          1    132        1          3       2          1  
           4   21may2002          1    129        1          3       2          2  
           8   10jun2003          1    184        2          1       1          1  
           8   17jun2003          2    185        3          1       1          1  
           8   02jun2003          1    183        2          2       1          2  
           8   02jun2003          2    183        2          2       1          1  
           8   22may2003          1    181        1          2       1          1  
           8   26jun2003          1    186        2          2       1          1  
           8   17jun2003          1    185        3          2       1          1  
           8   15jun2003          1    185        3          2       2          1  
           8   08may2003          1    179        1          3       2          2  
           8   08jun2003          1    184        2          3       2          2  
           8   22jun2003          1    186        2          3       2          2
      Edit: I keep your original n_day/week variables and create new variables called n_day/week_rand. You could alternatively clean up the data like this at the end:

      Code:
      drop n_day n_week
      rename n_day_rand n_day
      rename n_week_rand n_week
      Last edited by Daniel Schaefer; 04 Dec 2023, 15:49.

      Comment


      • #4
        Does this do what you want (for whatever reason)?

        Code:
        assert n_day <= n_week
        
        tempvar sort_order
        generate `c(obs_t)' `sort_order' = _n
        
        preserve
        
        drop if n_day == n_week 
        
        tempvar random n
        
        foreach var in n_week n_day {
            
            generate double `random' = runiform()
            sort id `random'
            by id (`random') : generate long `n' = _n
            sort id `sort_order'
            by id : generate `var'_random = `var'[`n']
            drop `n'
            drop `random'
            
        }
        
        tempfile tmp
        save "`tmp'"
        
        restore
        
        merge 1:1 `sort_order' id using "`tmp'" ///
            , keepusing(*random) nogenerate assert(master match)
        
        replace n_week_random = n_week if mi(n_week_random)
        replace n_day_random  = n_day  if mi(n_day_random)
        
        sort id `sort_order'
        
        assert n_day <= n_week

        Comment

        Working...
        X