"out of sample prediction" in cluster analysis

Ian Dohoo

Join Date: Oct 2014

Posts: 5
#1

"out of sample prediction" in cluster analysis

19 Nov 2014, 09:01

I know very little about cluster analysis (so please excuse if this question has an obvious answer), but before I embark on a considerable learning journey, I would like to know if something is possible. I would like to use 1/2 of a data set to carry out a cluster analysis and then use the clustering algorithm to assign observations from the 2nd half of the data set to groups created at some level of the dendogram created by the analysis of the first half..

I carried out a very simple (single linkage) cluster analysis using 1/2 of my data and then used cluster generate to generate groups, but groups were only created for the observations used in the analysis. Is what I want to do possible ... and if so, can someone point in me in the direction of how to do it?

Thanks you for your assistance.
Ian
Tags: None
Paul T Seed

Join Date: Apr 2014

Posts: 66
#2

19 Nov 2014, 10:28

I suspect that the answer is "no" for most (if not all) forms of cluster analysis; at least in Stata. Certainly I couldn't see a way when I looked. Discriminant analysis has features to handle training and validation sets, so may be a better bet for you.

Last edited by Paul T Seed; 19 Nov 2014, 10:33.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35610
#3

19 Nov 2014, 10:56

I agree with Paul.

In addition, it could make sense to compare classifications from different subsets, and/or to plot classifications on something like a plot of principal component (PC) scores. Or even to plot PC scores and see whether there are groups. For PCs, read correspondence analysis results, or whatever is appropriate.

In general, cluster analysis does what it's told, find clusters to the best of its ability. It's not so obvious how to work out independently whether they are genuine or reproducible, which indeed may be a rewording of your question.
Comment
Joseph Luchman

Join Date: Mar 2014

Posts: 114
#4

19 Nov 2014, 12:35

To extend the above suggestions you can take your cluster solution, obtain cluster membership, and use the same variables on which you implemented the clustering to train discrim and then extend those rules to the new data.

For instance (using a randomly split-half dataset)

Code:

webuse auto generate random = runiform() cluster singlelinkage price trunk turn if rand > .5 cluster generate clusters = groups(2) discrim lda price trunk turn, group(clusters) predict cv_cluster if random <= .5

That idea (very) is loosely based on Fraley & Raftery (2002).

Fraley, C., & Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97(458), 611-631.

Joseph Nicholas Luchman, Ph.D., PStat® (American Statistical Association)
----
Research Fellow
Fors Marsh
----
Version 18.0 MP
Comment
Ian Dohoo

Join Date: Oct 2014

Posts: 5
#5

19 Nov 2014, 23:00

Thank you to all who have responded. This feedback has been very helpful.
Ian
Comment

Announcement

"out of sample prediction" in cluster analysis

Comment

Comment

Comment

Comment