Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • "out of sample prediction" in cluster analysis

    I know very little about cluster analysis (so please excuse if this question has an obvious answer), but before I embark on a considerable learning journey, I would like to know if something is possible. I would like to use 1/2 of a data set to carry out a cluster analysis and then use the clustering algorithm to assign observations from the 2nd half of the data set to groups created at some level of the dendogram created by the analysis of the first half..

    I carried out a very simple (single linkage) cluster analysis using 1/2 of my data and then used cluster generate to generate groups, but groups were only created for the observations used in the analysis. Is what I want to do possible ... and if so, can someone point in me in the direction of how to do it?

    Thanks you for your assistance.
    Ian

  • #2
    I suspect that the answer is "no" for most (if not all) forms of cluster analysis; at least in Stata. Certainly I couldn't see a way when I looked. Discriminant analysis has features to handle training and validation sets, so may be a better bet for you.
    Last edited by Paul T Seed; 19 Nov 2014, 10:33.

    Comment


    • #3
      I agree with Paul.

      In addition, it could make sense to compare classifications from different subsets, and/or to plot classifications on something like a plot of principal component (PC) scores. Or even to plot PC scores and see whether there are groups. For PCs, read correspondence analysis results, or whatever is appropriate.

      In general, cluster analysis does what it's told, find clusters to the best of its ability. It's not so obvious how to work out independently whether they are genuine or reproducible, which indeed may be a rewording of your question.

      Comment


      • #4
        To extend the above suggestions you can take your cluster solution, obtain cluster membership, and use the same variables on which you implemented the clustering to train discrim and then extend those rules to the new data.

        For instance (using a randomly split-half dataset)

        Code:
        webuse auto
        
        generate random = runiform()
        
        cluster singlelinkage price trunk turn if rand > .5
        
        cluster generate clusters = groups(2)
        
        discrim lda price trunk turn, group(clusters)
        
        predict cv_cluster if random <= .5
        That idea (very) is loosely based on Fraley & Raftery (2002).

        Fraley, C., & Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97(458), 611-631.
        Joseph Nicholas Luchman, Ph.D., PStatĀ® (American Statistical Association)
        ----
        Research Fellow
        Fors Marsh

        ----
        Version 18.0 MP

        Comment


        • #5
          Thank you to all who have responded. This feedback has been very helpful.
          Ian

          Comment

          Working...
          X