Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to employ centroids of an initial cluster solution for kmeans method

    I am performing a cluster analysis and want to combine hierarchical and non-hierarchical techniques.
    I want to proceed as follow:
    1. Hierarchical Cluster analysis using Wards method to determine the appropriate number of clusters and the cluster centroids
    2. Kmeans cluster analysis, by employing the number of groups and their centroids generated by the solution of wards method.
    I´m now struggeling on how to employ the centroids of the initial cluster solution for the kmeans method.
    I know that I can obtain the means of all clustering variables for the different clusters with the following code:
    tabstat varlist, by (clusterXY)
    Is this vector of means then the cluster centroid? And how can I employ these centroids in kmeans?
    I found the random [(seed#)] and prandom [(seed#)] option, where one can define a random number seed. Is that the option I need? And if yes, how do I proceed? As far as I understood it, I can only type in one number as a “seed”, but I have 12 variable means as my cluster centroids.


  • #2
    I realise this answer might be too late for you but hope it may help others. I've applied this approach many times because I tend to work with very large data sets, albeit previously in SPSS. It turns out that it is deceptively simple to do in Stata - and I say this only after spending a good amount of time working it out!
    When running the hierarchical clustering, we need to include an option for saving our preferred cluster solution from our cluster analysis results. Stata sees this as creating a grouping variable. So, at its simplest, in the example below, I save my cluster analysis results with the name CL1, and from CL1 I generate a variable for the preferred 5-cluster solution, CL1R5, giving me my variable with 5 groups.
    cluster s var1 var2 var3 var4 var5, measure(L2) name(CL1)
    cluster gen CL1R = groups(5), name(CL1)
    I match this variable CL1R5 back onto the full sample and then run the K-means method specifying CL1R5 as the starting values/centroids:
    cluster kmeans var1 var2 var3 var4 var5, k(5) start(g(CL1R5))
    I hope I've understood your problem correctly.

    Comment

    Working...
    X