Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cluster stopping rule

    Dear Forum,

    I am conducting a cluster analysis with six binary criteria

    Code:
        cluster wardslinkage criteria1 criteria2 criteria3 criteria4 criteria5 criteria6, measure(L2) name(CLUSTER1)
    Now I would like to examine the number of clusters I should use.
    I am conducting the Duda Hart stopping rule and the Calinski-Harabasz stopping rule but I a receiving different results.
    The Duda Hart stopping rule (Je(2)/Je(1)) indicates 4 clusters. The pseudo T squared is the lowest at 13 clusters...

    The Calinski-Harabasz pseudo F always increases and is the highest at 15 clusters


    How should I interpret the results?

  • #2
    Personally, I do not have much faith in these statistical tests and would prefer a theoretical argument. When you have good reason to believe that some groups are important to explain your findings, this is a much better argument. And 15 clusters is really a lot, do you really think you can explain the main differences between them all?
    Best wishes

    Stata 18.0 MP | ORCID | Google Scholar

    Comment


    • #3
      Felix Bittmann No exactly that is my point
      so 15 clusters for me does not make a lot of sense

      Only the Duda Hart stopping rule (Je(2)/Je(1)) supports my initial idea of 4 clusters

      I was just wondering if something is wrong with my analysis?

      If nothing is wrong, how would you proceed? Would the following make sense for you after I detected I have 4 cluster through the command above?

      Code:
      cluster gen CL1R = groups(4), name(CLUSTER1)
      
          cluster kmeans criteria1 criteria2 criteria3 criteria4 criteria5 criteria6, k(4) start(g(CL1R)) name(clus)

      Comment


      • #4
        If you are happy with 4 clusters then you can simply generate them as described using

        Code:
        cluster gen CL1R = groups(4), name(CLUSTER1)
        No need to conduct a second clustering. Did you look at a dendrogram after the ward clustering? This can also be very interesting and helpful.
        Best wishes

        Stata 18.0 MP | ORCID | Google Scholar

        Comment


        • #5
          Thanks Felix Bittmann

          No I did not have a look at the dendrogram yet

          But since my literature stream does that, I would like to go with kmeans clustering

          Is the code above suitable for that?

          Comment


          • #6
            You should be aware of the fact that kmeans and ward will not always give the same solution. In this case you might want to compare how different the clusters are that are generated by the two algorithms.
            Best wishes

            Stata 18.0 MP | ORCID | Google Scholar

            Comment


            • #7
              Thanks Felix Bittmann

              In the literature, some researchers use a two-step approach
              First hierarchical clustering (ward) to determine the number of clusters and then kmeans to actually "do" the clustering
              that's why I want to do it this way

              Do you think the code mentioned in the posts above suits this purpose?
              I am especially not sure about the
              start(g(CL1R)) in #3

              Comment


              • #8
                On that I cannot give any advice. Does this affect your findings a lot? What I read from the literature is that there are so many algorithms and options and in most cases it is not clear which process is best. I would rely on the results and see that gives the highest validity for my own research question.
                Best wishes

                Stata 18.0 MP | ORCID | Google Scholar

                Comment


                • #9
                  Thanks!

                  No not really

                  Does anyone else have an opinion on that?

                  Comment

                  Working...
                  X