Simulating partially crossed (or non-nested) data

Erik Ruzek

Join Date: Oct 2017

Posts: 441
#1

Simulating partially crossed (or non-nested) data

25 Sep 2023, 15:38

I am interested in simulating data in which two factors are partially as opposed to fully crossed.

In data with fully crossed factors, all individuals are exposed to all cases of the other factor. An example of this would be a psychology study in which individual raters (factor 1) view the same set of targets (factor 2). Say there are 10 targets total. Each rater views all 10 targets. Simulating such fully crossed data has been shown in presentations by Isabel Canette (StataCorp) here and Joseph Coveney in this thread.

The simulation I'm currently trying to figure out is the partially crossed case. Going back to the psychology study example, say rater 1 views targets 1-4, rater 2 views targets 3-6, rater 3 views targets 5-8, rater 4 views targets 7-10, rater 5 views targets 1-2 and 6-7. With this few raters, it is easy enough to code up, but beginning to be tricky b/c you have to code each rater separately:

Code:

clear* set version 16 set seed 1357 set sortseed 793 * raters (rid) set obs 5 // number of raters gen rid = _n // rater identifier *targets (tid) expand 4 replace tid = _n-2 if rid==2 replace tid = _n-4 if rid==3 replace tid = _n-6 if rid==4 replace tid = _n-16 if rid==5 replace tid = _n-13 if rid==5 & rating_num>=3

My question how do I generalize the code such that I can simulate a partially crossed structure for hundreds of individual raters and 50+ targets. Is there a way to narrow the possibilities (e.g., 30% of raters share at least two targets) to make the coding more tractable?

Any insights are greatly appreciated.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#2

25 Sep 2023, 16:50

Assuming you have enough memory to do this, here's the approach I would take. I would first create an n_raters * n_targets matrix in which the r,t element is 1 if rater r evaluates target t and 0 otherwise.

Then I'd generate a fully crossed raters * targets data set, and then iterate over the elements of the matrix, dropping those observations where the rater target pair has a 0 in the corresponding matrix element. This may not be the most efficient way to do it, but I think it is the most transparent and most easily maintained or modified way.

Code:

clear* matrix M = J(5, 10, 0) // 5 raters & 10 targets forvalues t = 1/4 { matrix M[1, `t'] = 1 } forvalues t = 3/6 { matrix M[2, `t'] = 1 } forvalues t = 5/8 { matrix M[3, `t'] = 1 } forvalues t = 7/10 { matrix M[4, `t'] = 1 } foreach t of numlist 1 2 6 7 { matrix M[5, `t'] = 1 } set obs 5 gen rid = _n expand 10 by rid, sort: gen tid = _n forvalues r = 1/5 { forvalues t = 1/10 { drop if M[`r', `t'] == 0 & rid == `r' & tid == `t' } } tab rid tid

From there you can simulate random draws of the resulting ratings according to whatever the rating model you want to simulate is.
1 like
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30147

25 Sep 2023, 19:20

It dawns on me that without sacrificing transparency, the efficiency of the above if the number of observations in the data set is large, could be improved:

Code:

clear*

matrix M = J(5, 10, 0) // 5 raters & 10 targets
forvalues t = 1/4 {
    matrix M[1, `t'] = 1
}
forvalues t = 3/6 {
    matrix M[2, `t'] = 1
}
forvalues t = 5/8 {
    matrix M[3, `t'] = 1
}
forvalues t = 7/10 {
    matrix M[4, `t'] = 1
}
foreach t of numlist 1 2 6 7 {
    matrix M[5, `t'] = 1
}

set obs 5
gen rid = _n
expand 10
by rid, sort: gen tid = _n
forvalues r = 1/5 {
    forvalues t = 1/10 {
        if M[`r', `t'] == 0 {
            drop if rid == `r' & tid == `t'
        }
    }
}

tab rid tid

By "factoring out" the interrogation of M[`r', `t'] = 0, it is not so much that we evaluate that expression only once instead of once per observation (which is nice, but would seldom be noticeably faster), but we also can skip the interrogation of rid == `r' & tid == `t' in every observation of the entire data set when M[`r', `t'] == 1. That part would be a noticeable time-saving if the data set is very large. (A very large data set could result if the number of raters and targets were both large, or if there were a large number of replications of rating for each rater-target pair.)

An even faster approach in a huge data set, but less transparent, would be to replace the loops with a program executed under control of -runby-.

Code:

clear*

matrix M = J(5, 10, 0) // 5 raters & 10 targets
forvalues t = 1/4 {
    matrix M[1, `t'] = 1
}
forvalues t = 3/6 {
    matrix M[2, `t'] = 1
}
forvalues t = 5/8 {
    matrix M[3, `t'] = 1
}
forvalues t = 7/10 {
    matrix M[4, `t'] = 1
}
foreach t of numlist 1 2 6 7 {
    matrix M[5, `t'] = 1
}

set obs 5
gen rid = _n
expand 10
by rid, sort: gen tid = _n

capture program drop one_r_t_pair
program define one_r_t_pair
    if M[rid[1], tid[1]] == 0 {
        drop _all
    }
    exit
end

runby one_r_t_pair, by(rid tid)

tab rid tid

-runby- is written by Robert Picard and me, and is available from SSC.

Comment

Erik Ruzek

Join Date: Oct 2017

Posts: 441
#4

26 Sep 2023, 06:31

Thank you so much, Clyde! What an elegant solution. I am going to play with it when I have more time today and will get back to you with any further questions about implementation. If I wanted to set up random effects for the rid and tid factors, is it similar to how we do it in the nested case?

Code:

bysort rid: gen re_rid = rnormal(0,3) // rater id random intercept bysort tid: gen re_tid = rnormal(0,2) // target id random intercept
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#5

26 Sep 2023, 09:23

Well, that isn't exactly the way it is done in the nested case either. You have to make sure that each random effect is constant at its appropriate level.

So, for the nested case, if e (residual) is nested in u (next level up):

Code:

by u_identifier, sort: gen re_u = rnormal(0, `sd_u') if _n == 1 by u_identifier: replace re_u = re_u[1] gen e = rnormal(0, `sd_e')

For crossed random effects (or multiple membership, aka partially crossed) at the rid and tid levels:

Code:

by rid, sort: gen rid_re = rnormal(0, `sd_rid') if _n == 1 by rid: replace rid_re = rid_re[1] by tid, sort: gen tid_re = rnormal(0, `sd_tid') if _n == 1 by tid: replace tid_re = tid_re[1]

Note: The above code assumes that the standard deviations for the random effects at each level have been pre-specified in correspondingly named local macros.
1 like
Comment

Announcement