I'm trying to understand the behaviour of cluster vs clustermat, on the assumption that if I create a pair-wise matrix of the distance between the variables (squared Euclidean for Ward's linkage), the results from clustermat on the distances should be identical to those from cluster on the variables.
For cases where the variables are continuous this is the case, but where they have discrete values (and thus a greater risk of ties) the results from cluster and clustermat differ, sometimes dramatically.
Given C1-C3 with a random uniform distribution from 1-10 (discrete), and J1 = C1 +rnormal()/100 I get this:
That is, when there's a little jitter the results are identical, but with ties they are very different. What's also disturbing is that the results with the small amount of jitter are very different. Complete code is shown below.
This is pushing clustering hard, because there is no structure in the data. I also understand that ties mean there is no single deterministic solution. But I would like to understand what's going on, and why cluster and clustermat differ: is it in their treatment of ties, or is there something about the calculation of distances that I'm getting wrong?
Brendan
Code to replicate:
For cases where the variables are continuous this is the case, but where they have discrete values (and thus a greater risk of ties) the results from cluster and clustermat differ, sometimes dramatically.
Given C1-C3 with a random uniform distribution from 1-10 (discrete), and J1 = C1 +rnormal()/100 I get this:
Code:
. // Test 1: categorical, cluster vs clustermat
. tab ccv5 ccd5
| ccd5
ccv5 | 1 2 3 4 5 | Total
-----------+-------------------------------------------------------+----------
1 | 74 6 175 6 0 | 261
2 | 0 0 0 149 49 | 198
3 | 0 0 0 0 121 | 121
4 | 73 109 0 0 0 | 182
5 | 0 41 0 196 1 | 238
-----------+-------------------------------------------------------+----------
Total | 147 156 175 351 171 | 1,000
. // Test 2: with jitter, cluster vs clustermat
. tab jcv5 jcd5
| jcd5
jcv5 | 1 2 3 4 5 | Total
-----------+-------------------------------------------------------+----------
1 | 258 0 0 0 0 | 258
2 | 0 180 0 0 0 | 180
3 | 0 0 147 0 0 | 147
4 | 0 0 0 157 0 | 157
5 | 0 0 0 0 258 | 258
-----------+-------------------------------------------------------+----------
Total | 258 180 147 157 258 | 1,000
. // Test 3: with and without jitter, variables
. tab ccv5 jcv5
| jcv5
ccv5 | 1 2 3 4 5 | Total
-----------+-------------------------------------------------------+----------
1 | 80 3 139 3 36 | 261
2 | 4 36 2 148 8 | 198
3 | 0 0 0 6 115 | 121
4 | 113 0 0 0 69 | 182
5 | 61 141 6 0 30 | 238
-----------+-------------------------------------------------------+----------
Total | 258 180 147 157 258 | 1,000
This is pushing clustering hard, because there is no structure in the data. I also understand that ties mean there is no single deterministic solution. But I would like to understand what's going on, and why cluster and clustermat differ: is it in their treatment of ties, or is there something about the calculation of distances that I'm getting wrong?
Brendan
Code to replicate:
Code:
set matsize 2000
set obs 1000
gen c1 = 1 + int(runiform()*10)
gen c2 = 1 + int(runiform()*10)
gen c3 = 1 + int(runiform()*10)
gen j1 = c1 + rnormal()/100
matrix dissimilarity cc1 = c1 c2 c3, L2
mata: st_matrix("cc2", st_matrix("cc1") :^ 2)
matrix dissimilarity jc1 = j1 c2 c3, L2
mata: st_matrix("jc2", st_matrix("jc1") :^ 2)
// Cluster variables: categorical only
cluster wards c1 c2 c3
cluster gen ccv5 = groups(5)
// Cluster variables, with slight jitter on one
cluster wards j1 c2 c3
cluster gen jcv5 = groups(5)
// Cluster distance matrix, categorical only
clustermat wards cc2, add
cluster gen ccd5 = groups(5)
// Cluster distance matrix, with slight jitter
clustermat wards jc2, add
cluster gen jcd5 = groups(5)
// Test 1: categorical, cluster vs clustermat
tab ccv5 ccd5
// Test 2: with jitter, cluster vs clustermat
tab jcv5 jcd5
// Test 3: with and without jitter, variables
tab ccv5 jcv5
