Bootstrap random samples (with replacement) for cluster analysis - bsample or bootstrap code?

Bart Lubberts

Join Date: Sep 2017

Posts: 5
#1

Bootstrap random samples (with replacement) for cluster analysis - bsample or bootstrap code?

22 Sep 2017, 17:21

Hi,

I try to bootstrap random samples (with replacement) for my K-means cluster analysis.
My dataset contains 110 samples and 4 variables.
I use the following syntax for my cluster analysis:

cluster means var1 var2 var3 var4, k(2) measure(L2) name(clustervar) start(krandom)

First I want to draw a new dataset (of the same size as the original) by resampling the original dataset with replacement, and then cluster the new dataset - this i will repeat several times.

Should I use the bsample or bootstrap code?
Is there an extensive description of how to write the syntax? I found the STATA bootstrap and bsample documents helpful but not sufficient.

Thank you
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#2

22 Sep 2017, 18:50

What exactly is it you intend to bootstrap here? Generally when we bootstrap a command, whether through the convenience of -bootstrap-, or directly with -bsample-, the command returns some scalar statistic(s) whose values we calculate in various bootstrap replicates, and then we examine the distribution of that (those) statistic(s) across replicates. But -cluster kmeans- doesn't calculate any scalar statistics. So I don't know what you're hoping to do here.

Ordinarily people use -bootstrap- because it takes care of the "dirty work" of looping over replications, calling -bsample- to get the samples, running the command you are bootstrapping, and accumulating and summarizing the results. But one thing is clear: -bootstrap- expects you to tell it what scalar results you want to sample. And for your -cluster kmeans- command, there isn't anything suitable. If you are trying to bootstrap the cluster assignments themselves in some way, then, because that is a vector, not a scalar, you will not be able to use -bootstrap- and you will have to write your own loop, calls to -bsample-, and management of a -postfile- to accumulate your results. What you will do with those results, in the end, isn't clear to me as I don't see a way to interpret them.
1 like
Comment
Bart Lubberts

Join Date: Sep 2017

Posts: 5
#3

23 Sep 2017, 03:26

Thank you Clyde Schechter, I appreciate your help!

The overall goal is to create a sample independent formula that can divide my dataset into two groups (group1 and group2).

My dataset contains 110 samples and 4 variables.
1. I have clustered the dataset into two groups using: " cluster means var1 var2 var3 var4, k(2) measure(L2) name(clustervar) start(random) "
2. I then performed a canonical correlation analysis to create coefficients for each of the 4 measurements, using: " candisc var1 var2 var3 var4, group(clustervar) "
3. And then created the raw canonical coefficients, using: " matrix list e(L_unstd) "

This leads to the following sample dependent formula:
0.76*var1 + 0.82*var2 + 1.17*var3 – 0.20*var4 = y

Now I want to create a sample independent formula.
Therefore I was thinking I can create multiple bootstrapped samples (with replacement) of my dataset.
On each of these bootstrapped samples I can run my previous K-means clustering analysis and thereafter do the canonical correlation analysis.
For each cluster: including Group1 and Group2, I will create a new formula and then take the averages.
Hope this helps clarifying what i try to intend.

Do you have a suggestion programming this in STATA? or where to start (bsample?)
Thanks again,
Bart
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#4

23 Sep 2017, 09:33

Well, I can get you partly there. I'm not familiar with -candisc- and I really have only the barest understanding of canonical correlation analysis. But here's the gist of what you could do:

Code:

clear* capture program drop one_sample program define one_sample, rclass cluster kmeans var1 var2 var3 var4, k(2) measure(L2) /// name(clustervar) start(random) candisc var1 var2 var3 var4, group(clustervar) matrix M = e(L_unstd) forvalues i = 1/4 { return scalar coef`i' = M[`i', 1] } exit end use my_dataset bootstrap coef1 = r(coef1) coef2 = r(coef2) coef3 = r(coef3) coef4 = r(coef4), /// reps(100) saving(coefficients, replace) seed(1234): one_sample

This will get you a new data set, coefficients.dta, with 100 sets of coefficients for var1-var4 obtained from bootstrap samples. You can then -use- that and average them or, whatever else you want to do. Evidently you can pick the number of replications you want and use your favorite random number seed.

Notes: 1. I have not tested this code. Beware of typos, unbalanced parens and quotes, etc. I haven't thought about issues that may arise if either -cluster kmeans- or -candisc- fails to execute correctly for some of the samples, etc. But this will at least give you a start.
2. I'm not sure that your idea of averaging these coefficients over an ensemble of bootstrap samples actually makes sense. My understanding of bootstrap, limited as it is, is that it generates usable estimates of standard errors in situations where analytic calculation of those is difficult or impossible. My understanding is that the means of bootstrap estimates are in fact not an improvement over the original-sample estimates in any useful sense. But, I don't want to push too hard on this argument as it is at the fringes of my understanding.
Comment
Bart Lubberts

Join Date: Sep 2017

Posts: 5
#5

24 Sep 2017, 02:15

Thank you Clyde Schechter for your help
Comment

Announcement

Bootstrap random samples (with replacement) for cluster analysis - bsample or bootstrap code?

Comment

Comment

Comment

Comment