Cluster stopping rule

Timo Leise

Join Date: Jan 2021

Posts: 20
#1

Cluster stopping rule

08 Jun 2021, 23:30

Dear Forum,

I am conducting a cluster analysis with six binary criteria

Code:

cluster wardslinkage criteria1 criteria2 criteria3 criteria4 criteria5 criteria6, measure(L2) name(CLUSTER1)

Now I would like to examine the number of clusters I should use.
I am conducting the Duda Hart stopping rule and the Calinski-Harabasz stopping rule but I a receiving different results.
The Duda Hart stopping rule (Je(2)/Je(1)) indicates 4 clusters. The pseudo T squared is the lowest at 13 clusters...

The Calinski-Harabasz pseudo F always increases and is the highest at 15 clusters

How should I interpret the results?
Tags: categorical, cluster, cluster analysis, data
Felix Bittmann

Join Date: Aug 2018

Posts: 736
#2

09 Jun 2021, 02:05

Personally, I do not have much faith in these statistical tests and would prefer a theoretical argument. When you have good reason to believe that some groups are important to explain your findings, this is a much better argument. And 15 clusters is really a lot, do you really think you can explain the main differences between them all?

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
2 likes
Comment
Timo Leise

Join Date: Jan 2021

Posts: 20
#3

09 Jun 2021, 03:55

Felix Bittmann No exactly that is my point
so 15 clusters for me does not make a lot of sense

Only the Duda Hart stopping rule (Je(2)/Je(1)) supports my initial idea of 4 clusters

I was just wondering if something is wrong with my analysis?

If nothing is wrong, how would you proceed? Would the following make sense for you after I detected I have 4 cluster through the command above?

Code:

cluster gen CL1R = groups(4), name(CLUSTER1) cluster kmeans criteria1 criteria2 criteria3 criteria4 criteria5 criteria6, k(4) start(g(CL1R)) name(clus)
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 736
#4

09 Jun 2021, 04:07

If you are happy with 4 clusters then you can simply generate them as described using

Code:

cluster gen CL1R = groups(4), name(CLUSTER1)

No need to conduct a second clustering. Did you look at a dendrogram after the ward clustering? This can also be very interesting and helpful.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
2 likes
Comment
Timo Leise

Join Date: Jan 2021

Posts: 20
#5

09 Jun 2021, 04:45

Thanks Felix Bittmann

No I did not have a look at the dendrogram yet

But since my literature stream does that, I would like to go with kmeans clustering

Is the code above suitable for that?
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 736
#6

09 Jun 2021, 04:59

You should be aware of the fact that kmeans and ward will not always give the same solution. In this case you might want to compare how different the clusters are that are generated by the two algorithms.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Comment
Timo Leise

Join Date: Jan 2021

Posts: 20
#7

09 Jun 2021, 05:09

Thanks Felix Bittmann

In the literature, some researchers use a two-step approach
First hierarchical clustering (ward) to determine the number of clusters and then kmeans to actually "do" the clustering
that's why I want to do it this way

Do you think the code mentioned in the posts above suits this purpose?
I am especially not sure about the
start(g(CL1R)) in #3
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 736
#8

09 Jun 2021, 05:18

On that I cannot give any advice. Does this affect your findings a lot? What I read from the literature is that there are so many algorithms and options and in most cases it is not clear which process is best. I would rely on the results and see that gives the highest validity for my own research question.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
1 like
Comment
Timo Leise

Join Date: Jan 2021

Posts: 20
#9

09 Jun 2021, 05:39

Thanks!

No not really

Does anyone else have an opinion on that?
Comment

Announcement

Cluster stopping rule

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment