Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Simulating partially crossed (or non-nested) data

    I am interested in simulating data in which two factors are partially as opposed to fully crossed.

    In data with fully crossed factors, all individuals are exposed to all cases of the other factor. An example of this would be a psychology study in which individual raters (factor 1) view the same set of targets (factor 2). Say there are 10 targets total. Each rater views all 10 targets. Simulating such fully crossed data has been shown in presentations by Isabel Canette (StataCorp) here and Joseph Coveney in this thread.

    The simulation I'm currently trying to figure out is the partially crossed case. Going back to the psychology study example, say rater 1 views targets 1-4, rater 2 views targets 3-6, rater 3 views targets 5-8, rater 4 views targets 7-10, rater 5 views targets 1-2 and 6-7. With this few raters, it is easy enough to code up, but beginning to be tricky b/c you have to code each rater separately:
    Code:
    clear* 
    set version 16
    set seed 1357 
    set sortseed 793 
    
    * raters (rid)
    set obs 5                             // number of raters
    gen rid = _n                        // rater identifier 
    
    *targets (tid)
    expand 4 
    replace tid = _n-2 if rid==2
    replace tid = _n-4 if rid==3
    replace tid = _n-6 if rid==4
    replace tid = _n-16 if rid==5
    replace tid = _n-13 if rid==5 & rating_num>=3
    My question how do I generalize the code such that I can simulate a partially crossed structure for hundreds of individual raters and 50+ targets. Is there a way to narrow the possibilities (e.g., 30% of raters share at least two targets) to make the coding more tractable?

    Any insights are greatly appreciated.

  • #2
    Assuming you have enough memory to do this, here's the approach I would take. I would first create an n_raters * n_targets matrix in which the r,t element is 1 if rater r evaluates target t and 0 otherwise.

    Then I'd generate a fully crossed raters * targets data set, and then iterate over the elements of the matrix, dropping those observations where the rater target pair has a 0 in the corresponding matrix element. This may not be the most efficient way to do it, but I think it is the most transparent and most easily maintained or modified way.

    Code:
    clear*
    
    matrix M = J(5, 10, 0) // 5 raters & 10 targets
    forvalues t = 1/4 {
        matrix M[1, `t'] = 1
    }
    forvalues t = 3/6 {
        matrix M[2, `t'] = 1
    }
    forvalues t = 5/8 {
        matrix M[3, `t'] = 1
    }
    forvalues t = 7/10 {
        matrix M[4, `t'] = 1
    }
    foreach t of numlist 1 2 6 7 {
        matrix M[5, `t'] = 1
    }
    
    set obs 5
    gen rid = _n
    expand 10
    by rid, sort: gen tid = _n
    forvalues r = 1/5 {
        forvalues t = 1/10 {
            drop if M[`r', `t'] == 0 & rid == `r' & tid == `t'
        }
    }
    
    tab rid tid
    From there you can simulate random draws of the resulting ratings according to whatever the rating model you want to simulate is.

    Comment


    • #3
      It dawns on me that without sacrificing transparency, the efficiency of the above if the number of observations in the data set is large, could be improved:
      Code:
      clear*
      
      matrix M = J(5, 10, 0) // 5 raters & 10 targets
      forvalues t = 1/4 {
          matrix M[1, `t'] = 1
      }
      forvalues t = 3/6 {
          matrix M[2, `t'] = 1
      }
      forvalues t = 5/8 {
          matrix M[3, `t'] = 1
      }
      forvalues t = 7/10 {
          matrix M[4, `t'] = 1
      }
      foreach t of numlist 1 2 6 7 {
          matrix M[5, `t'] = 1
      }
      
      set obs 5
      gen rid = _n
      expand 10
      by rid, sort: gen tid = _n
      forvalues r = 1/5 {
          forvalues t = 1/10 {
              if M[`r', `t'] == 0 {
                  drop if rid == `r' & tid == `t'
              }
          }
      }
      
      tab rid tid
      By "factoring out" the interrogation of M[`r', `t'] = 0, it is not so much that we evaluate that expression only once instead of once per observation (which is nice, but would seldom be noticeably faster), but we also can skip the interrogation of rid == `r' & tid == `t' in every observation of the entire data set when M[`r', `t'] == 1. That part would be a noticeable time-saving if the data set is very large. (A very large data set could result if the number of raters and targets were both large, or if there were a large number of replications of rating for each rater-target pair.)

      An even faster approach in a huge data set, but less transparent, would be to replace the loops with a program executed under control of -runby-.
      Code:
      clear*
      
      matrix M = J(5, 10, 0) // 5 raters & 10 targets
      forvalues t = 1/4 {
          matrix M[1, `t'] = 1
      }
      forvalues t = 3/6 {
          matrix M[2, `t'] = 1
      }
      forvalues t = 5/8 {
          matrix M[3, `t'] = 1
      }
      forvalues t = 7/10 {
          matrix M[4, `t'] = 1
      }
      foreach t of numlist 1 2 6 7 {
          matrix M[5, `t'] = 1
      }
      
      set obs 5
      gen rid = _n
      expand 10
      by rid, sort: gen tid = _n
      
      capture program drop one_r_t_pair
      program define one_r_t_pair
          if M[rid[1], tid[1]] == 0 {
              drop _all
          }
          exit
      end
      
      runby one_r_t_pair, by(rid tid)
      
      tab rid tid
      -runby- is written by Robert Picard and me, and is available from SSC.

      Comment


      • #4
        Thank you so much, Clyde! What an elegant solution. I am going to play with it when I have more time today and will get back to you with any further questions about implementation. If I wanted to set up random effects for the rid and tid factors, is it similar to how we do it in the nested case?
        Code:
        bysort rid: gen re_rid = rnormal(0,3)        // rater id random intercept 
        bysort tid: gen re_tid = rnormal(0,2)        // target id random intercept

        Comment


        • #5
          Well, that isn't exactly the way it is done in the nested case either. You have to make sure that each random effect is constant at its appropriate level.

          So, for the nested case, if e (residual) is nested in u (next level up):
          Code:
          by u_identifier, sort: gen re_u = rnormal(0, `sd_u') if _n == 1
          by u_identifier: replace re_u = re_u[1]
          gen e = rnormal(0, `sd_e')
          For crossed random effects (or multiple membership, aka partially crossed) at the rid and tid levels:
          Code:
          by rid, sort: gen rid_re = rnormal(0, `sd_rid') if _n == 1
          by rid: replace rid_re = rid_re[1]
          by tid, sort: gen tid_re = rnormal(0, `sd_tid') if _n == 1
          by tid: replace tid_re = tid_re[1]
          Note: The above code assumes that the standard deviations for the random effects at each level have been pre-specified in correspondingly named local macros.

          Comment

          Working...
          X