Hello,
I'm having problems with cluster analysis.
I'm still working on the subject i mentionned here. (http://www.statalist.org/forums/foru...osite-variable)
People above me specifically asked me to try to regroup variables by theme (eating well, socio-economic, school life, ...), using multiple correspondence analysis (MCA) and hierarchical classification.
For each theme, i have 3-10 dummy variables, sometimes a categorical variable with 3 categories. No continuous variable.
Around 12000 observations for each variable.
I first did MCA , allowing me to confirm that my variables were linked. Everything went well.
For the hierarchical classification, i tried the cluster linkage commands, using 5 variables of the same theme (3 of them are strongly linked together, the 2 others are linked between them but not to the others).
Only the singlelinkage option seems to work (takes 1 min for stata to generate the cluster data), other linkage option took more than 10 min and i abort the command. I used the option
"measure(matching)"
I then wanted to use the "cluster stop" command, and got no result (the command works well, but display nothing). I managed to get a result if i do a cluster on 30 observations only.
Can you explain me why?
"cluster list" gives me :
. cluster list
_clus_7 (type: hierarchical, method: single, similarity: matching)
vars: _clus_7_id (id variable)
_clus_7_ord (order variable)
_clus_7_hgt (height variable)
other: cmd: cluster singlelinkage v1 v2 v3 v4 v5,
measure(matching)
varlist: v1 v2 v3 v4 v5
range: 1 0
I asked for a "cluster tree" and got the "too many leaves; consider using the cutvalue() or cutnumber() options" error message.
I tried to use the cutnumber option and got "cannot cut exactly x groups because of ties in the dendrogram" whatever x is (except 32, probably because 32=2^5, so 1 group for each possible combination of variable).
I tried to use cutvalue option, which worked only from 0.8 to 0.999... .
the dendrogram i managed to get always had the same look:

This doesnt look at all like a normal tree...
Lastly, i wanted to do cluster analysis via a matrix, but i could not generate one because of matsize too small (even set at 11000).
I remind you that my ultimate goal is to generate composite variables using dummy variables that are closely linked. Is there a way do do it via the cluster command? How?
Thank you for your help.
I'm having problems with cluster analysis.
I'm still working on the subject i mentionned here. (http://www.statalist.org/forums/foru...osite-variable)
People above me specifically asked me to try to regroup variables by theme (eating well, socio-economic, school life, ...), using multiple correspondence analysis (MCA) and hierarchical classification.
For each theme, i have 3-10 dummy variables, sometimes a categorical variable with 3 categories. No continuous variable.
Around 12000 observations for each variable.
I first did MCA , allowing me to confirm that my variables were linked. Everything went well.
For the hierarchical classification, i tried the cluster linkage commands, using 5 variables of the same theme (3 of them are strongly linked together, the 2 others are linked between them but not to the others).
Only the singlelinkage option seems to work (takes 1 min for stata to generate the cluster data), other linkage option took more than 10 min and i abort the command. I used the option
"measure(matching)"
I then wanted to use the "cluster stop" command, and got no result (the command works well, but display nothing). I managed to get a result if i do a cluster on 30 observations only.
Can you explain me why?
"cluster list" gives me :
. cluster list
_clus_7 (type: hierarchical, method: single, similarity: matching)
vars: _clus_7_id (id variable)
_clus_7_ord (order variable)
_clus_7_hgt (height variable)
other: cmd: cluster singlelinkage v1 v2 v3 v4 v5,
measure(matching)
varlist: v1 v2 v3 v4 v5
range: 1 0
I asked for a "cluster tree" and got the "too many leaves; consider using the cutvalue() or cutnumber() options" error message.
I tried to use the cutnumber option and got "cannot cut exactly x groups because of ties in the dendrogram" whatever x is (except 32, probably because 32=2^5, so 1 group for each possible combination of variable).
I tried to use cutvalue option, which worked only from 0.8 to 0.999... .
the dendrogram i managed to get always had the same look:
This doesnt look at all like a normal tree...
Lastly, i wanted to do cluster analysis via a matrix, but i could not generate one because of matsize too small (even set at 11000).
I remind you that my ultimate goal is to generate composite variables using dummy variables that are closely linked. Is there a way do do it via the cluster command? How?
Thank you for your help.
Comment