Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to simulate clustered data with a specific intra-class correlation

    Dear Statalist,

    I am trying to simulate clustered data with a specific intra-class correlation. One approach I came up with is:

    Code:
    clear
    set seed 1
    matrix C = (1, 0.5 \ 0.5 , 1)
    drawnorm y1 y2, n(100) corr(C)
    gen i = _n
    reshape long y, i(i) j(j)
    qui mixed y || i:
    estat icc
    This creates j = 2 observations per i = 100 clusters which have an ICC of 0.5. Is there an approach that allows me to vary the number of observations j per cluster in a random fashion?

    Thanks so much for your consideration
    KS

  • #2
    Code:
    clear
    set seed 1
    set obs 100
    
    
    gen cluster_num = _n
    gen obs_per_cluster = rpoisson(5)
    expand obs_per_cluster
    by cluster_num, sort: gen member_num = _n
    
    local rho = 0.5
    local sd_u = sqrt(`rho')
    local sd_e = sqrt(1-`rho')
    
    by cluster_num (member_num), sort: gen u = rnormal(0, `sd_u') if _n == 1
    by cluster_num (member_num): replace u = u[1]
    gen e = rnormal(0, `sd_e')
    gen y = u + e
    
    mixed y || cluster_num:
    estat icc
    Here, I have randomized the number of observations per cluster using a Poisson distribution to illustrate the approach, but you can do that part any way you like.

    The logic is simple. The ICC is, by definition, var u/(var u + var e), where u is the cluster-level intercept, and e is the residual. In your example, the total variance var u + var e was set to 1, so I assumed the same here. A little algebra then says that the variance of u must be the desired value of the ICC. So sample u with a standard deviation equal to the square root of the desired ICC. And sample e with standard deviation equal to the square root of 1 - the desired ICC. Then add up u and e, your variable y will have the desired icc.

    Comment


    • #3
      I know this is an older post, but I has led me to a question. I was wonder how this might be generalized so that 2 normally distributed random variables may covary with y (e.g., y=u + x1 + x2 + e), and still be able to retain the ability to control the clustering using the ICC? If this is a question already asked, please point me to the correct post. Thank you.

      Comment


      • #4
        I'm afraid I don't understand what is being asked in #3. Can you clarify?

        Comment


        • #5
          Dear community,
          I have a curiosity regarding this quote:

          Originally posted by Clyde Schechter View Post
          In your example, the total variance var u + var e was set to 1, so I assumed the same here. A little algebra then says that the variance of u must be the desired value of the ICC.
          What if the total variance was different, say 17. How should I proceed then?


          Comment


          • #6
            I would use the exact same code, and then, at the end add:
            Code:
            replace y = y*sqrt(17)

            Comment


            • #7
              Then you would need to figure out how you want the total variance to be distributed across the levels, and then specify the variances accordingly. For example, if you wanted an ICC in a two level model to be .50, then you would want the variance to be equally divided between the levels. My setup is a little different than Clyde's as I specify the standard deviations of u and e immediately proceeding creating the identifiers for those levels, but the end result is the same:
              Code:
              clear*
              version 16
              set seed 1
              set obs 100
              
              *clusters
              gen cluster_num = _n
              gen obs_per_cluster = rpoisson(5) // randomize number of obs/cluster with Poisson distribution w/ a mean of 5
              gen u = rnormal(0,.292) // sqrt(.17/2)
              
              *members within clusters
              expand obs_per_cluster
              by cluster_num, sort: gen member_num = _n
              gen e = rnormal(0,.292) // sqrt(.17/2)
              
              gen y = u + e
              
              mixed y || cluster_num:
              estat icc

              Comment


              • #8
                Thank you Clyde Schechter and Erik Ruzek. Very useful insights

                Comment

                Working...
                X