Hi -
I would like to Bootstrap a K-means cluster analysis.
my dataset includes 110 clients. On each client I did 4 measurements (x,y,z,b).
Based on the measurements I aimed to divide these clients into two groups (A and B).
Thereafter, I propose a formula that can predict if a client belongs to group A or B.
In addition I calculate for each measurement a cut-off point that will put a client into group A or B.
The following syntax for the cluster analysis I used in STATA:
cluster kmeans x y z b k(2) measure(L2) name(name) start(krandom)
I then performed a canonical correlation analysis to create coefficients for each of the 4 measurements. Syntax:
candisc x y z b, group(name)
To create raw canonical coefficients. Syntax:
matrix list e(L_unstd)
No questions so far...
Now I worry that the cluster analysis and related formula/cut-off points, are sample dependent and therefore I want to create a sample independent formula.
Hence,
1. I would like to create multiple bootstrapped samples (n=1000) with replacement of my dataset.
2. On each of these bootstrapped samples I will run my previous K-means clustering analysis --> cluster kmeans x y z b k(2) measure(L2) name(name) start(random)
3. For each new cluster, I will then calculate the raw canonical coefficients and make the formula
4. I will then calculate the cut-off points.
5. Lastly, i will then pool the results of 3. and 4. to get an sample independent formula and sample independent cut-off points.
My questions are:
What should be the STATA syntax for 1. and 2. ?
And can (or should) it be combined with 3. ?
All help is much appreciated.
Thank you,
Bart
I would like to Bootstrap a K-means cluster analysis.
my dataset includes 110 clients. On each client I did 4 measurements (x,y,z,b).
Based on the measurements I aimed to divide these clients into two groups (A and B).
Thereafter, I propose a formula that can predict if a client belongs to group A or B.
In addition I calculate for each measurement a cut-off point that will put a client into group A or B.
The following syntax for the cluster analysis I used in STATA:
cluster kmeans x y z b k(2) measure(L2) name(name) start(krandom)
I then performed a canonical correlation analysis to create coefficients for each of the 4 measurements. Syntax:
candisc x y z b, group(name)
To create raw canonical coefficients. Syntax:
matrix list e(L_unstd)
No questions so far...
Now I worry that the cluster analysis and related formula/cut-off points, are sample dependent and therefore I want to create a sample independent formula.
Hence,
1. I would like to create multiple bootstrapped samples (n=1000) with replacement of my dataset.
2. On each of these bootstrapped samples I will run my previous K-means clustering analysis --> cluster kmeans x y z b k(2) measure(L2) name(name) start(random)
3. For each new cluster, I will then calculate the raw canonical coefficients and make the formula
4. I will then calculate the cut-off points.
5. Lastly, i will then pool the results of 3. and 4. to get an sample independent formula and sample independent cut-off points.
My questions are:
What should be the STATA syntax for 1. and 2. ?
And can (or should) it be combined with 3. ?
All help is much appreciated.
Thank you,
Bart
Comment