Cluster analysis on large dataset.

Gabriel Fernandez

Join Date: Apr 2014

Posts: 12
#1

Cluster analysis on large dataset.

23 May 2014, 05:11

Hello,
I'm having problems with cluster analysis.
I'm still working on the subject i mentionned here. (http://www.statalist.org/forums/foru...osite-variable)

People above me specifically asked me to try to regroup variables by theme (eating well, socio-economic, school life, ...), using multiple correspondence analysis (MCA) and hierarchical classification.
For each theme, i have 3-10 dummy variables, sometimes a categorical variable with 3 categories. No continuous variable.
Around 12000 observations for each variable.

I first did MCA , allowing me to confirm that my variables were linked. Everything went well.

For the hierarchical classification, i tried the cluster linkage commands, using 5 variables of the same theme (3 of them are strongly linked together, the 2 others are linked between them but not to the others).
Only the singlelinkage option seems to work (takes 1 min for stata to generate the cluster data), other linkage option took more than 10 min and i abort the command. I used the option
"measure(matching)"

I then wanted to use the "cluster stop" command, and got no result (the command works well, but display nothing). I managed to get a result if i do a cluster on 30 observations only.
Can you explain me why?

"cluster list" gives me :
. cluster list
_clus_7 (type: hierarchical, method: single, similarity: matching)
vars: _clus_7_id (id variable)
_clus_7_ord (order variable)
_clus_7_hgt (height variable)
other: cmd: cluster singlelinkage v1 v2 v3 v4 v5,
measure(matching)
varlist: v1 v2 v3 v4 v5
range: 1 0

I asked for a "cluster tree" and got the "too many leaves; consider using the cutvalue() or cutnumber() options" error message.
I tried to use the cutnumber option and got "cannot cut exactly x groups because of ties in the dendrogram" whatever x is (except 32, probably because 32=2^5, so 1 group for each possible combination of variable).
I tried to use cutvalue option, which worked only from 0.8 to 0.999... .
the dendrogram i managed to get always had the same look:

This doesnt look at all like a normal tree...

Lastly, i wanted to do cluster analysis via a matrix, but i could not generate one because of matsize too small (even set at 11000).

I remind you that my ultimate goal is to generate composite variables using dummy variables that are closely linked. Is there a way do do it via the cluster command? How?

Thank you for your help.

Attached Files
Tags: None
Paul T Seed

Join Date: Apr 2014

Posts: 66
#2

23 May 2014, 07:37

I have generally been disappointed by cluster analysis, so am not surprised by your problems. I have also had people above me telling me to use it in situations where it is completely inappropriate, as you seem to have done. I don't know why it has such an undeserved reputation. It is designed to group observations, not variables, so even a completely successful cluster analysis would not take you noticeably nearer to your goal of grouping the variables.

There are two methods that generally work:
1) For variables with a theme (as you describe them) just group them by theme.

2) If you want to use the statistical structure, use factor analysis (or principal component analysis). They generally give similar results, but factor analysis has a better theoretical basis, and attempts to adjust for measurement error in arriving at the latent variables.)

I know less about Multiple Correspondence Analysis and hierarchical clustering of variables. But I think they are very different from cluster analysis as traditionally defined.

BW
Comment
Gabriel Fernandez

Join Date: Apr 2014

Posts: 12
#3

23 May 2014, 10:07

I think you just showed me my main mistake, i was clustering observations and not variables.

About the 2 methods you propose:
1) That's what i want to do, but just summing the variables of a theme is not very interesting. However this is the easiest way and i may do it if I dont find a more "statistical" way.

2) I think MCA is a kind of factor analysis (i was told about MCFA, multiple component factor analysis, but find nothing about it in STATA), and I tried it too, but I dont know what to do with the results: having 35 factors for 35 variables, each factor with coefficients that never go beyond 0.6.
I tried keeping only most significant factors but again, my goal is to develop a variable "eating well", not a factor variable that is mostly linked to the variables of this theme but still linked to all the other variables.
I dont understand how it is even possible to interpret this kind of factors.
Comment

Announcement

Cluster analysis on large dataset.

Comment

Comment