I performed a cluster analysis with various variables and applied the statistical stopping rules available in Stata to determine the appropriate number of clusters:
Calinski and Harabaz (1974) index and Duda and Hart´s (1973) Je(2)/Je(1) Index.
However, as those two stopping rules suggest different cluster solutions I want to apply another stopping rule, suggested by the strategic management literature.
This stopping rule examines the development of the tightness of the group structures in terms of the contribution that an additional group would make to the overall fit of the clusters.
This “tightness of group structure” is measured in R², which measures the proportion of variance of the explanatory variables accounted for by the cluster solutions.
The rule suggests the following:
My idea is to make a regression of the different cluster solutions (different in terms of number of groups) and the clustering variables ($xlist) and compare the R² Stata gives me:
regress cluster_2groups $xlist
regress cluster_3groups $xlist
regress clus2006_4groups $xlist
Unfortunately I have no idea if that idea is correct or if I´m completely mistaken. Any help would be great!
Calinski and Harabaz (1974) index and Duda and Hart´s (1973) Je(2)/Je(1) Index.
However, as those two stopping rules suggest different cluster solutions I want to apply another stopping rule, suggested by the strategic management literature.
This stopping rule examines the development of the tightness of the group structures in terms of the contribution that an additional group would make to the overall fit of the clusters.
This “tightness of group structure” is measured in R², which measures the proportion of variance of the explanatory variables accounted for by the cluster solutions.
The rule suggests the following:
- the clusters obtained explain at least 65 percent of the overall variance (R² >= 0,65)
- stop if when an additional cluster adds less than 5 percent to the overall fit of the cluster model (Delta R² < 0,05)
My idea is to make a regression of the different cluster solutions (different in terms of number of groups) and the clustering variables ($xlist) and compare the R² Stata gives me:
regress cluster_2groups $xlist
regress cluster_3groups $xlist
regress clus2006_4groups $xlist
Unfortunately I have no idea if that idea is correct or if I´m completely mistaken. Any help would be great!
Comment