I'm trying to understand the behaviour of cluster vs clustermat, on the assumption that if I create a pair-wise matrix of the distance between the variables (squared Euclidean for Ward's linkage), the results from clustermat on the distances should be identical to those from cluster on the variables.
For cases where the variables are continuous this is the case, but where they have discrete values (and thus a greater risk of ties) the results from cluster and clustermat differ, sometimes dramatically.
Given C1-C3 with a random uniform distribution from 1-10 (discrete), and J1 = C1 +rnormal()/100 I get this:
That is, when there's a little jitter the results are identical, but with ties they are very different. What's also disturbing is that the results with the small amount of jitter are very different. Complete code is shown below.
This is pushing clustering hard, because there is no structure in the data. I also understand that ties mean there is no single deterministic solution. But I would like to understand what's going on, and why cluster and clustermat differ: is it in their treatment of ties, or is there something about the calculation of distances that I'm getting wrong?
Brendan
Code to replicate:
For cases where the variables are continuous this is the case, but where they have discrete values (and thus a greater risk of ties) the results from cluster and clustermat differ, sometimes dramatically.
Given C1-C3 with a random uniform distribution from 1-10 (discrete), and J1 = C1 +rnormal()/100 I get this:
Code:
. // Test 1: categorical, cluster vs clustermat . tab ccv5 ccd5 | ccd5 ccv5 | 1 2 3 4 5 | Total -----------+-------------------------------------------------------+---------- 1 | 74 6 175 6 0 | 261 2 | 0 0 0 149 49 | 198 3 | 0 0 0 0 121 | 121 4 | 73 109 0 0 0 | 182 5 | 0 41 0 196 1 | 238 -----------+-------------------------------------------------------+---------- Total | 147 156 175 351 171 | 1,000 . // Test 2: with jitter, cluster vs clustermat . tab jcv5 jcd5 | jcd5 jcv5 | 1 2 3 4 5 | Total -----------+-------------------------------------------------------+---------- 1 | 258 0 0 0 0 | 258 2 | 0 180 0 0 0 | 180 3 | 0 0 147 0 0 | 147 4 | 0 0 0 157 0 | 157 5 | 0 0 0 0 258 | 258 -----------+-------------------------------------------------------+---------- Total | 258 180 147 157 258 | 1,000 . // Test 3: with and without jitter, variables . tab ccv5 jcv5 | jcv5 ccv5 | 1 2 3 4 5 | Total -----------+-------------------------------------------------------+---------- 1 | 80 3 139 3 36 | 261 2 | 4 36 2 148 8 | 198 3 | 0 0 0 6 115 | 121 4 | 113 0 0 0 69 | 182 5 | 61 141 6 0 30 | 238 -----------+-------------------------------------------------------+---------- Total | 258 180 147 157 258 | 1,000
This is pushing clustering hard, because there is no structure in the data. I also understand that ties mean there is no single deterministic solution. But I would like to understand what's going on, and why cluster and clustermat differ: is it in their treatment of ties, or is there something about the calculation of distances that I'm getting wrong?
Brendan
Code to replicate:
Code:
set matsize 2000 set obs 1000 gen c1 = 1 + int(runiform()*10) gen c2 = 1 + int(runiform()*10) gen c3 = 1 + int(runiform()*10) gen j1 = c1 + rnormal()/100 matrix dissimilarity cc1 = c1 c2 c3, L2 mata: st_matrix("cc2", st_matrix("cc1") :^ 2) matrix dissimilarity jc1 = j1 c2 c3, L2 mata: st_matrix("jc2", st_matrix("jc1") :^ 2) // Cluster variables: categorical only cluster wards c1 c2 c3 cluster gen ccv5 = groups(5) // Cluster variables, with slight jitter on one cluster wards j1 c2 c3 cluster gen jcv5 = groups(5) // Cluster distance matrix, categorical only clustermat wards cc2, add cluster gen ccd5 = groups(5) // Cluster distance matrix, with slight jitter clustermat wards jc2, add cluster gen jcd5 = groups(5) // Test 1: categorical, cluster vs clustermat tab ccv5 ccd5 // Test 2: with jitter, cluster vs clustermat tab jcv5 jcd5 // Test 3: with and without jitter, variables tab ccv5 jcv5