Generating statistics to determine the optimal number of clusters

Jean Charles

Join Date: Dec 2014

Posts: 6
#1

Generating statistics to determine the optimal number of clusters

10 Mar 2015, 12:30

Hello,

I am using k-means clustering to partition observations into clusters, based on a number of similar variables. I have done lots of reading on different ways of determining an appropriate number of clusters in the data, so my question does not concern that. I have settled on assessing a number of cluster options and comparing them against one another to determine the optimal solution. So, I plan to run the cluster analysis with 2, 3, 4, 5, etc. clusters and compare the sum of squared distances between the members and the centroids. Then I plan to choose the option at which the squared distances or sums of squared error stop substantially decreasing.

I am fairly unfamiliar with the cluster kmeans function in Stata, so I don't know how to get these statistics. So, my question is: how can I get the sum of squared distances from the cluster centers or the sums of squared error in Stata?

I decided to post here, as I usually get such great insight from users, so I would appreciate any help. Thank you in advance.

- JC7821
Tags: None
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#2

10 Mar 2015, 22:10

Unless you've followed in the footsteps of FM-2030 and legally changed your name (and can prove it!), please be reminded that the preference in this forum is for full real names.

I don't use the cluster commands, but there is an option that allows keeping the centers once the clustering process is over. The task is then to set up the data appropriately so you can compute whatever it is you want. Below an example, but can be done in other ways.

Code:

clear set more off *----- example data ----- webuse labtech *----- what you want ? ----- // unique identifiers gen obs = _n // cluster local km = 8 set seed 13586 cluster kmeans x1 x2 x3 x4, k(`km') name(k8) keepcenters // save original + centers tempfile orig save "`orig'" // save centers and rename keep in -`km'/L replace k8 = _n drop labtech obs rename x* x*m tempfile centers save "`centers'" // merge centers with original merge 1:m k8 using "`orig'", /// assert(match) keep(match) nogenerate // reshape to facilitate computations reshape long x x@m, i(obs) j(char) // some computation // don't know if this makes sense! bysort k8 : egen ssdist = total((x - xm)^2)

I set the data to long form but this is not strictly necessary. Variables x1, x2, ... are then identified with the variable char (for characteristic); for example, an observation with char == 1 has values for x and xm that correspond to x1 (original variable) and xm1 (group center for x1), and so on.

You can devise other ways of computing maintaining the wide form. The decision should be a function of the number of variables and observations involved in the analysis (reshaping affects both), further computations and personal preference.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment
Jean Charles

Join Date: Dec 2014

Posts: 6
#3

11 Mar 2015, 04:21

Thank you for your response and I have written to the site administrator to change my username.

I have used the keepcenters command, but I think I might need to go back and do some more reading about k-means clustering, because your response totally lost me. I don't know if it was the long file formation or what, but I think I'm missing something. I'll think about this some more, but I think I need to understand more before dong anything here.

Thanks, again.
Comment
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#4

11 Mar 2015, 10:19

Thanks for changing your user name.

There seems to be a typo in my previous code (which I fix here).

Try looking at the data set before the -reshape-:

Code:

clear set more off *----- example data ----- webuse labtech *----- what you want ? ----- // unique identifiers gen obs = _n // cluster local km = 8 set seed 13586 cluster kmeans x1 x2 x3 x4, k(`km') name(k8) keepcenters // save original + centers tempfile orig save "`orig'" // keep centers and rename keep in -`km'/L replace k8 = _n drop labtech obs rename x* x*m // merge centers with original merge 1:m k8 using "`orig'", /// assert(match using) keep(match) nogenerate sort obs order obs x? x?? list in 1/10

Variables x1m, x2m, ... , are the centers for the corresponding group assigned to that observation (variable k8). Maybe you find it less confusing this way.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment
Jean Charles

Join Date: Dec 2014

Posts: 6
#5

11 Mar 2015, 21:53

Thanks again, for writing back.

I think that this makes a bit more sense to me. Now, I just have to figure out whether I should convert my data file to long format for these purposes or apply this to my wide file, somehow. I'll post back with my results.
Comment

Announcement

Generating statistics to determine the optimal number of clusters

Comment

Comment

Comment

Comment