cluster vs clustermat

Brendan Halpin

Join Date: Mar 2014
Posts: 152

cluster vs clustermat

04 Jan 2016, 11:00

I'm trying to understand the behaviour of cluster vs clustermat, on the assumption that if I create a pair-wise matrix of the distance between the variables (squared Euclidean for Ward's linkage), the results from clustermat on the distances should be identical to those from cluster on the variables.

For cases where the variables are continuous this is the case, but where they have discrete values (and thus a greater risk of ties) the results from cluster and clustermat differ, sometimes dramatically.

Given C1-C3 with a random uniform distribution from 1-10 (discrete), and J1 = C1 +rnormal()/100 I get this:

Code:

. // Test 1: categorical, cluster vs clustermat
. tab ccv5 ccd5

           |                          ccd5
      ccv5 |         1          2          3          4          5 |     Total
-----------+-------------------------------------------------------+----------
         1 |        74          6        175          6          0 |       261
         2 |         0          0          0        149         49 |       198
         3 |         0          0          0          0        121 |       121
         4 |        73        109          0          0          0 |       182
         5 |         0         41          0        196          1 |       238
-----------+-------------------------------------------------------+----------
     Total |       147        156        175        351        171 |     1,000


. // Test 2: with jitter, cluster vs clustermat
. tab jcv5 jcd5

           |                          jcd5
      jcv5 |         1          2          3          4          5 |     Total
-----------+-------------------------------------------------------+----------
         1 |       258          0          0          0          0 |       258
         2 |         0        180          0          0          0 |       180
         3 |         0          0        147          0          0 |       147
         4 |         0          0          0        157          0 |       157
         5 |         0          0          0          0        258 |       258
-----------+-------------------------------------------------------+----------
     Total |       258        180        147        157        258 |     1,000


. // Test 3: with and without jitter, variables
. tab ccv5 jcv5

           |                          jcv5
      ccv5 |         1          2          3          4          5 |     Total
-----------+-------------------------------------------------------+----------
         1 |        80          3        139          3         36 |       261
         2 |         4         36          2        148          8 |       198
         3 |         0          0          0          6        115 |       121
         4 |       113          0          0          0         69 |       182
         5 |        61        141          6          0         30 |       238
-----------+-------------------------------------------------------+----------
     Total |       258        180        147        157        258 |     1,000

That is, when there's a little jitter the results are identical, but with ties they are very different. What's also disturbing is that the results with the small amount of jitter are very different. Complete code is shown below.

This is pushing clustering hard, because there is no structure in the data. I also understand that ties mean there is no single deterministic solution. But I would like to understand what's going on, and why cluster and clustermat differ: is it in their treatment of ties, or is there something about the calculation of distances that I'm getting wrong?

Brendan

Code to replicate:

Code:

set matsize 2000
set obs 1000
gen c1 = 1 + int(runiform()*10)
gen c2 = 1 + int(runiform()*10)
gen c3 = 1 + int(runiform()*10)
gen j1 = c1 + rnormal()/100

matrix dissimilarity cc1 = c1 c2 c3, L2
mata: st_matrix("cc2", st_matrix("cc1") :^ 2)

matrix dissimilarity jc1 = j1 c2 c3, L2
mata: st_matrix("jc2", st_matrix("jc1") :^ 2)

// Cluster variables: categorical only
cluster wards c1 c2 c3
cluster gen ccv5 = groups(5)

// Cluster variables, with slight jitter on one
cluster wards j1 c2 c3
cluster gen jcv5 = groups(5)

// Cluster distance matrix, categorical only
clustermat wards cc2, add
cluster gen ccd5 = groups(5)

// Cluster distance matrix, with slight jitter
clustermat wards jc2, add
cluster gen jcd5 = groups(5)

// Test 1: categorical, cluster vs clustermat
tab ccv5 ccd5
// Test 2: with jitter, cluster vs clustermat
tab jcv5 jcd5
// Test 3: with and without jitter, variables
tab ccv5 jcv5

Last edited by Brendan Halpin; 04 Jan 2016, 11:04.

Tags: None

Announcement

cluster vs clustermat