Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating statistics to determine the optimal number of clusters

    Hello,

    I am using k-means clustering to partition observations into clusters, based on a number of similar variables. I have done lots of reading on different ways of determining an appropriate number of clusters in the data, so my question does not concern that. I have settled on assessing a number of cluster options and comparing them against one another to determine the optimal solution. So, I plan to run the cluster analysis with 2, 3, 4, 5, etc. clusters and compare the sum of squared distances between the members and the centroids. Then I plan to choose the option at which the squared distances or sums of squared error stop substantially decreasing.

    I am fairly unfamiliar with the cluster kmeans function in Stata, so I don't know how to get these statistics. So, my question is: how can I get the sum of squared distances from the cluster centers or the sums of squared error in Stata?

    I decided to post here, as I usually get such great insight from users, so I would appreciate any help. Thank you in advance.

    - JC7821

  • #2
    Unless you've followed in the footsteps of FM-2030 and legally changed your name (and can prove it!), please be reminded that the preference in this forum is for full real names.

    I don't use the cluster commands, but there is an option that allows keeping the centers once the clustering process is over. The task is then to set up the data appropriately so you can compute whatever it is you want. Below an example, but can be done in other ways.

    Code:
    clear
    set more off
    
    *----- example data -----
    
    webuse labtech
    
    *----- what you want ? -----
    
    // unique identifiers
    gen obs = _n
    
    // cluster
    local km = 8
    set seed 13586
    
    cluster kmeans x1 x2 x3 x4, k(`km') name(k8) keepcenters
    
    // save original + centers
    tempfile orig
    save "`orig'"
    
    // save centers and rename
    keep in -`km'/L
    replace k8 = _n
    drop labtech obs
    
    rename x* x*m
    
    tempfile centers
    save "`centers'"
    
    // merge centers with original
    merge 1:m k8 using "`orig'", ///
        assert(match) keep(match) nogenerate
    
    // reshape to facilitate computations
    reshape long x x@m, i(obs) j(char)
    
    // some computation
    // don't know if this makes sense!
    bysort k8 : egen ssdist = total((x - xm)^2)
    I set the data to long form but this is not strictly necessary. Variables x1, x2, ... are then identified with the variable char (for characteristic); for example, an observation with char == 1 has values for x and xm that correspond to x1 (original variable) and xm1 (group center for x1), and so on.

    You can devise other ways of computing maintaining the wide form. The decision should be a function of the number of variables and observations involved in the analysis (reshaping affects both), further computations and personal preference.
    You should:

    1. Read the FAQ carefully.

    2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

    3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

    4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

    Comment


    • #3
      Thank you for your response and I have written to the site administrator to change my username.

      I have used the keepcenters command, but I think I might need to go back and do some more reading about k-means clustering, because your response totally lost me. I don't know if it was the long file formation or what, but I think I'm missing something. I'll think about this some more, but I think I need to understand more before dong anything here.

      Thanks, again.

      Comment


      • #4
        Thanks for changing your user name.

        There seems to be a typo in my previous code (which I fix here).

        Try looking at the data set before the -reshape-:
        Code:
        clear
        set more off
        
        *----- example data -----
        
        webuse labtech
        
        *----- what you want ? -----
        
        // unique identifiers
        gen obs = _n
        
        // cluster
        local km = 8
        set seed 13586
        
        cluster kmeans x1 x2 x3 x4, k(`km') name(k8) keepcenters
        
        // save original + centers
        tempfile orig
        save "`orig'"
        
        // keep centers and rename
        keep in -`km'/L
        replace k8 = _n
        drop labtech obs
        
        rename x* x*m
        
        // merge centers with original
        merge 1:m k8 using "`orig'", ///
            assert(match using) keep(match) nogenerate
            
        sort obs
        order obs x? x??
        list in 1/10
        Variables x1m, x2m, ... , are the centers for the corresponding group assigned to that observation (variable k8). Maybe you find it less confusing this way.
        You should:

        1. Read the FAQ carefully.

        2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

        3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

        4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

        Comment


        • #5
          Thanks again, for writing back.

          I think that this makes a bit more sense to me. Now, I just have to figure out whether I should convert my data file to long format for these purposes or apply this to my wide file, somehow. I'll post back with my results.

          Comment

          Working...
          X