Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Variable selection for cluster analysis

    Hi everyone!
    I am trying to perform cluster analysis on identify subtypes of a disease. I have more than 50 variables for the analysis, with both continous and categorical variables. I am planning to use Ward’s linkage methods with Gower’s dissimilarity coefficient. However, after reading relevant papers, I got to know that some of the candidate variables could be noise variables, and including them in the cluster analysis will mask the true cluster structure. Therefore, variable selection is recommended before cluster analysis.
    My question is: Is there any module in STATA that can perform variable selection for cluster analysis? Any algorithm of variable selection is fine for me.
    Last edited by Steven Ho; 30 Jul 2018, 07:40.

  • #2
    I don't see how there could be. For all any automated method can know or tell, one variable not at all related to any others might be really useful for a cluster analysis, so even PCA could miss something crucial. My only rather feeble advice is not to throw them all in, but to get expert advice on which are important and to keep plotting the data. Ward's method may find isolated balls in data space when they exist but it is fairly useless for bananas or do[ugh]nuts or more complicated shapes, let alone continua of variation. .

    PS You evidently didn't read the FAQ Advice to the end as requested: https://www.statalist.org/forums/help#spelling
    Last edited by Nick Cox; 30 Jul 2018, 08:40.

    Comment


    • #3
      Thank you Mr. Nick Cox. My applogies for using STATA instead of Stata. Actually I have asked the expert about the important variables, and the 50 variables I mentioned in the question is what I kept following experts' advice. You kindly advice me to keep plotting the data. My question is how should I judge the importance of variables by plotting them? Could you please give me more details?

      Comment


      • #4
        Not easily. 50 variables is more than I advise playing with at once. If the experts can't give indications of very important, important and less important variables then they are not experts or they're guessing too and the project is a fishing expedition. A desperate solution is to throw variables into a PCA and look at say all possible scatter plots for the most important components. But even then it could be that the low-order components distinguish better? Much depends on whether your variables refer to patients, their circumstances and measures more directly related to the condition(s) concerned.

        Comment


        • #5
          I get the picture. Thank you again Mr. Nick Cox.

          Comment

          Working...
          X