Bootstrapping K-means cluster analysis + canonical correlation analysis

Bart Lubberts

Join Date: Sep 2017

Posts: 5
#1

Bootstrapping K-means cluster analysis + canonical correlation analysis

21 Sep 2017, 04:19

Hi -

I would like to Bootstrap a K-means cluster analysis.

my dataset includes 110 clients. On each client I did 4 measurements (x,y,z,b).
Based on the measurements I aimed to divide these clients into two groups (A and B).
Thereafter, I propose a formula that can predict if a client belongs to group A or B.
In addition I calculate for each measurement a cut-off point that will put a client into group A or B.

The following syntax for the cluster analysis I used in STATA:
cluster kmeans x y z b k(2) measure(L2) name(name) start(krandom)

I then performed a canonical correlation analysis to create coefficients for each of the 4 measurements. Syntax:
candisc x y z b, group(name)
To create raw canonical coefficients. Syntax:
matrix list e(L_unstd)

No questions so far...
Now I worry that the cluster analysis and related formula/cut-off points, are sample dependent and therefore I want to create a sample independent formula.

Hence,
1. I would like to create multiple bootstrapped samples (n=1000) with replacement of my dataset.
2. On each of these bootstrapped samples I will run my previous K-means clustering analysis --> cluster kmeans x y z b k(2) measure(L2) name(name) start(random)
3. For each new cluster, I will then calculate the raw canonical coefficients and make the formula
4. I will then calculate the cut-off points.
5. Lastly, i will then pool the results of 3. and 4. to get an sample independent formula and sample independent cut-off points.

My questions are:
What should be the STATA syntax for 1. and 2. ?
And can (or should) it be combined with 3. ?

All help is much appreciated.
Thank you,
Bart

Last edited by Bart Lubberts; 21 Sep 2017, 04:37.
Tags: None
Bart Lubberts

Join Date: Sep 2017

Posts: 5
#2

21 Sep 2017, 06:40

I now have the following syntax written - combining 1.2.3. Unfortunately it gives an error: "invalid reps"
bootstrap cluster kmeans x y z b, k(2) measure(L2) start(krandom)=r(cluster kmeans x y z b, k(2) measure(L2) start(krandom)), reps (100) cluster(x y z b) idcluster(newcluster): candisc x y z b, group(newcluster)

- Thank you for your help -
Comment

Announcement

Bootstrapping K-means cluster analysis + canonical correlation analysis

Comment