Variable selection for cluster analysis

Steven Ho

Join Date: Jul 2018

Posts: 3
#1

Variable selection for cluster analysis

30 Jul 2018, 07:37

Hi everyone!
I am trying to perform cluster analysis on identify subtypes of a disease. I have more than 50 variables for the analysis, with both continous and categorical variables. I am planning to use Ward’s linkage methods with Gower’s dissimilarity coefficient. However, after reading relevant papers, I got to know that some of the candidate variables could be noise variables, and including them in the cluster analysis will mask the true cluster structure. Therefore, variable selection is recommended before cluster analysis.
My question is: Is there any module in STATA that can perform variable selection for cluster analysis? Any algorithm of variable selection is fine for me.

Last edited by Steven Ho; 30 Jul 2018, 07:40.
Tags: cluster analysis, variable selection
Nick Cox

Join Date: Mar 2014

Posts: 35632
#2

30 Jul 2018, 08:38

I don't see how there could be. For all any automated method can know or tell, one variable not at all related to any others might be really useful for a cluster analysis, so even PCA could miss something crucial. My only rather feeble advice is not to throw them all in, but to get expert advice on which are important and to keep plotting the data. Ward's method may find isolated balls in data space when they exist but it is fairly useless for bananas or do[ugh]nuts or more complicated shapes, let alone continua of variation. .

PS You evidently didn't read the FAQ Advice to the end as requested: https://www.statalist.org/forums/help#spelling

Last edited by Nick Cox; 30 Jul 2018, 08:40.
Comment
Steven Ho

Join Date: Jul 2018

Posts: 3
#3

30 Jul 2018, 09:09

Thank you Mr. Nick Cox. My applogies for using STATA instead of Stata. Actually I have asked the expert about the important variables, and the 50 variables I mentioned in the question is what I kept following experts' advice. You kindly advice me to keep plotting the data. My question is how should I judge the importance of variables by plotting them? Could you please give me more details?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35632
#4

30 Jul 2018, 09:40

Not easily. 50 variables is more than I advise playing with at once. If the experts can't give indications of very important, important and less important variables then they are not experts or they're guessing too and the project is a fishing expedition. A desperate solution is to throw variables into a PCA and look at say all possible scatter plots for the most important components. But even then it could be that the low-order components distinguish better? Much depends on whether your variables refer to patients, their circumstances and measures more directly related to the condition(s) concerned.
Comment
Steven Ho

Join Date: Jul 2018

Posts: 3
#5

30 Jul 2018, 09:49

I get the picture. Thank you again Mr. Nick Cox.
Comment

Announcement

Variable selection for cluster analysis

Comment

Comment

Comment

Comment