Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using a Stopping Rule with R² in a cluster analysis

    I performed a cluster analysis with various variables and applied the statistical stopping rules available in Stata to determine the appropriate number of clusters:
    Calinski and Harabaz (1974) index and Duda and Hart´s (1973) Je(2)/Je(1) Index.
    However, as those two stopping rules suggest different cluster solutions I want to apply another stopping rule, suggested by the strategic management literature.
    This stopping rule examines the development of the tightness of the group structures in terms of the contribution that an additional group would make to the overall fit of the clusters.
    This “tightness of group structure” is measured in R², which measures the proportion of variance of the explanatory variables accounted for by the cluster solutions.
    The rule suggests the following:
    1. the clusters obtained explain at least 65 percent of the overall variance (R² >= 0,65)
    2. stop if when an additional cluster adds less than 5 percent to the overall fit of the cluster model (Delta R² < 0,05)
    I was wondering how I can obtain this R² so I can apply that stopping rule to my analysis.
    My idea is to make a regression of the different cluster solutions (different in terms of number of groups) and the clustering variables ($xlist) and compare the R² Stata gives me:

    regress cluster_2groups $xlist
    regress cluster_3groups $xlist
    regress clus2006_4groups $xlist

    Unfortunately I have no idea if that idea is correct or if I´m completely mistaken. Any help would be great!

  • #2
    What are you hoping to use the clusters for? In your example above, you are essentially treating the clusters as continuous where a multinomial logit/probit might be better suited to modeling the cluster indicators. You also need to consider the scale on which your indicators are measured (e.g., nominal, ordinal, intervallic, ratio, discrete, etc...) since they use different metrics for correlations and require different algorithms when estimating the clusters. More importantly, how many clusters does the literature suggest exists? If your sample is wildly different you may want to consider digging into the results a bit more first.

    Comment


    • #3
      A late response is likely hardly better than none at all, but here goes.

      You can achieve this with the discrepancy module (you can also calculate the CH and Duda-Hart rules with it). Whether your rule is much better than CH or DH is another question. It effectively does the same calculation as CH without taking into account the number of clusters.

      This example shows how, with the demo NLSW88 data, using four variables.

      Regards,

      Brendan

      Code:
      // Install discrepancy module
      ssc install discrepancy
      
      // Load NLSW88 demo dataset and keep only complete cases
      sysuse nlsw88
      foreach var of varlist ttl age grade wage {
        drop if missing(`var')
      }
      
      // Cluster and look at Calinski-Harabasz stopping statistics
      cluster wards ttl age grade wage
      cluster stop
      
      
      // Create matrix of squared Euclidean distances, to replicate what
      // -cluster stop- does
      set matsize 3000
      matrix dissim xx = ttl age grade wage, L2squared
      
      
      // Do discrepancy analysis for 2 to 15 groups
      // Discrepancy insists that the data is sorted by ID
      gen id=_n
      sort id
      cluster generate g = groups(2/15)
      forvalues x = 2/15 {
        discrepancy g`x', distmat(xx) id(id) niter(1)
      }
      
      // Note that the F-stat is exactly the same as the Calisnki-Harabasz F
      // The R2 is based on the same data, but it doesn't take number of clusters
      // into account (can only go up, unlike F)

      Comment

      Working...
      X