Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • cluster vs clustermat

    I'm trying to understand the behaviour of cluster vs clustermat, on the assumption that if I create a pair-wise matrix of the distance between the variables (squared Euclidean for Ward's linkage), the results from clustermat on the distances should be identical to those from cluster on the variables.

    For cases where the variables are continuous this is the case, but where they have discrete values (and thus a greater risk of ties) the results from cluster and clustermat differ, sometimes dramatically.

    Given C1-C3 with a random uniform distribution from 1-10 (discrete), and J1 = C1 +rnormal()/100 I get this:
    Code:
    . // Test 1: categorical, cluster vs clustermat
    . tab ccv5 ccd5
    
               |                          ccd5
          ccv5 |         1          2          3          4          5 |     Total
    -----------+-------------------------------------------------------+----------
             1 |        74          6        175          6          0 |       261
             2 |         0          0          0        149         49 |       198
             3 |         0          0          0          0        121 |       121
             4 |        73        109          0          0          0 |       182
             5 |         0         41          0        196          1 |       238
    -----------+-------------------------------------------------------+----------
         Total |       147        156        175        351        171 |     1,000
    
    
    . // Test 2: with jitter, cluster vs clustermat
    . tab jcv5 jcd5
    
               |                          jcd5
          jcv5 |         1          2          3          4          5 |     Total
    -----------+-------------------------------------------------------+----------
             1 |       258          0          0          0          0 |       258
             2 |         0        180          0          0          0 |       180
             3 |         0          0        147          0          0 |       147
             4 |         0          0          0        157          0 |       157
             5 |         0          0          0          0        258 |       258
    -----------+-------------------------------------------------------+----------
         Total |       258        180        147        157        258 |     1,000
    
    
    . // Test 3: with and without jitter, variables
    . tab ccv5 jcv5
    
               |                          jcv5
          ccv5 |         1          2          3          4          5 |     Total
    -----------+-------------------------------------------------------+----------
             1 |        80          3        139          3         36 |       261
             2 |         4         36          2        148          8 |       198
             3 |         0          0          0          6        115 |       121
             4 |       113          0          0          0         69 |       182
             5 |        61        141          6          0         30 |       238
    -----------+-------------------------------------------------------+----------
         Total |       258        180        147        157        258 |     1,000
    That is, when there's a little jitter the results are identical, but with ties they are very different. What's also disturbing is that the results with the small amount of jitter are very different. Complete code is shown below.

    This is pushing clustering hard, because there is no structure in the data. I also understand that ties mean there is no single deterministic solution. But I would like to understand what's going on, and why cluster and clustermat differ: is it in their treatment of ties, or is there something about the calculation of distances that I'm getting wrong?

    Brendan

    Code to replicate:
    Code:
    set matsize 2000
    set obs 1000
    gen c1 = 1 + int(runiform()*10)
    gen c2 = 1 + int(runiform()*10)
    gen c3 = 1 + int(runiform()*10)
    gen j1 = c1 + rnormal()/100
    
    matrix dissimilarity cc1 = c1 c2 c3, L2
    mata: st_matrix("cc2", st_matrix("cc1") :^ 2)
    
    matrix dissimilarity jc1 = j1 c2 c3, L2
    mata: st_matrix("jc2", st_matrix("jc1") :^ 2)
    
    // Cluster variables: categorical only
    cluster wards c1 c2 c3
    cluster gen ccv5 = groups(5)
    
    // Cluster variables, with slight jitter on one
    cluster wards j1 c2 c3
    cluster gen jcv5 = groups(5)
    
    // Cluster distance matrix, categorical only
    clustermat wards cc2, add
    cluster gen ccd5 = groups(5)
    
    // Cluster distance matrix, with slight jitter
    clustermat wards jc2, add
    cluster gen jcd5 = groups(5)
    
    // Test 1: categorical, cluster vs clustermat
    tab ccv5 ccd5
    // Test 2: with jitter, cluster vs clustermat
    tab jcv5 jcd5
    // Test 3: with and without jitter, variables
    tab ccv5 jcv5
    Last edited by Brendan Halpin; 04 Jan 2016, 11:04.
Working...
X