Using a Stopping Rule with R² in a cluster analysis

Alina Loesche

Join Date: Oct 2015

Posts: 2
#1

Using a Stopping Rule with R² in a cluster analysis

26 Oct 2015, 01:14

I performed a cluster analysis with various variables and applied the statistical stopping rules available in Stata to determine the appropriate number of clusters:
Calinski and Harabaz (1974) index and Duda and Hart´s (1973) Je(2)/Je(1) Index.
However, as those two stopping rules suggest different cluster solutions I want to apply another stopping rule, suggested by the strategic management literature.
This stopping rule examines the development of the tightness of the group structures in terms of the contribution that an additional group would make to the overall fit of the clusters.
This “tightness of group structure” is measured in R², which measures the proportion of variance of the explanatory variables accounted for by the cluster solutions.
The rule suggests the following:
the clusters obtained explain at least 65 percent of the overall variance (R² >= 0,65)

stop if when an additional cluster adds less than 5 percent to the overall fit of the cluster model (Delta R² < 0,05)

I was wondering how I can obtain this R² so I can apply that stopping rule to my analysis.
My idea is to make a regression of the different cluster solutions (different in terms of number of groups) and the clustering variables ($xlist) and compare the R² Stata gives me:

regress cluster_2groups $xlist
regress cluster_3groups $xlist
regress clus2006_4groups $xlist

Unfortunately I have no idea if that idea is correct or if I´m completely mistaken. Any help would be great!
Tags: None
wbuchanan

Join Date: Mar 2014

Posts: 1361
#2

26 Oct 2015, 06:03

What are you hoping to use the clusters for? In your example above, you are essentially treating the clusters as continuous where a multinomial logit/probit might be better suited to modeling the cluster indicators. You also need to consider the scale on which your indicators are measured (e.g., nominal, ordinal, intervallic, ratio, discrete, etc...) since they use different metrics for correlations and require different algorithms when estimating the clusters. More importantly, how many clusters does the literature suggest exists? If your sample is wildly different you may want to consider digging into the results a bit more first.
Comment

Brendan Halpin

Join Date: Mar 2014
Posts: 152

01 Apr 2016, 09:27

A late response is likely hardly better than none at all, but here goes.

You can achieve this with the discrepancy module (you can also calculate the CH and Duda-Hart rules with it). Whether your rule is much better than CH or DH is another question. It effectively does the same calculation as CH without taking into account the number of clusters.

This example shows how, with the demo NLSW88 data, using four variables.

Regards,

Brendan

Code:

// Install discrepancy module
ssc install discrepancy

// Load NLSW88 demo dataset and keep only complete cases
sysuse nlsw88
foreach var of varlist ttl age grade wage {
  drop if missing(`var')
}

// Cluster and look at Calinski-Harabasz stopping statistics
cluster wards ttl age grade wage
cluster stop


// Create matrix of squared Euclidean distances, to replicate what
// -cluster stop- does
set matsize 3000
matrix dissim xx = ttl age grade wage, L2squared


// Do discrepancy analysis for 2 to 15 groups
// Discrepancy insists that the data is sorted by ID
gen id=_n
sort id
cluster generate g = groups(2/15)
forvalues x = 2/15 {
  discrepancy g`x', distmat(xx) id(id) niter(1)
}

// Note that the F-stat is exactly the same as the Calisnki-Harabasz F
// The R2 is based on the same data, but it doesn't take number of clusters
// into account (can only go up, unlike F)

Announcement

Using a Stopping Rule with R² in a cluster analysis

Comment

Comment