Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bootstrapping K-means cluster analysis + canonical correlation analysis

    Hi -

    I would like to Bootstrap a K-means cluster analysis.

    my dataset includes 110 clients. On each client I did 4 measurements (x,y,z,b).
    Based on the measurements I aimed to divide these clients into two groups (A and B).
    Thereafter, I propose a formula that can predict if a client belongs to group A or B.
    In addition I calculate for each measurement a cut-off point that will put a client into group A or B.

    The following syntax for the cluster analysis I used in STATA:
    cluster kmeans x y z b k(2) measure(L2) name(name) start(krandom)

    I then performed a canonical correlation analysis to create coefficients for each of the 4 measurements. Syntax:
    candisc x y z b, group(name)
    To create raw canonical coefficients. Syntax:
    matrix list e(L_unstd)

    No questions so far...
    Now I worry that the cluster analysis and related formula/cut-off points, are sample dependent and therefore I want to create a sample independent formula.

    Hence,
    1. I would like to create multiple bootstrapped samples (n=1000) with replacement of my dataset.
    2. On each of these bootstrapped samples I will run my previous K-means clustering analysis --> cluster kmeans x y z b k(2) measure(L2) name(name) start(random)
    3. For each new cluster, I will then calculate the raw canonical coefficients and make the formula
    4. I will then calculate the cut-off points.
    5. Lastly, i will then pool the results of 3. and 4. to get an sample independent formula and sample independent cut-off points.

    My questions are:
    What should be the STATA syntax for 1. and 2. ?
    And can (or should) it be combined with 3. ?

    All help is much appreciated.
    Thank you,
    Bart

    Last edited by Bart Lubberts; 21 Sep 2017, 04:37.

  • #2
    I now have the following syntax written - combining 1.2.3. Unfortunately it gives an error: "invalid reps"
    bootstrap cluster kmeans x y z b, k(2) measure(L2) start(krandom)=r(cluster kmeans x y z b, k(2) measure(L2) start(krandom)), reps (100) cluster(x y z b) idcluster(newcluster): candisc x y z b, group(newcluster)

    - Thank you for your help -

    Comment

    Working...
    X