I would like to calculate and plot the gap statistic introduced by Tibshirani et al (2001) as a form of robustness check to test the cluster validity of a clustering exercise based on Gower method and Wards linkage.
I have successfully calculated the "within-cluster dispersion" on my real data (number of clusters k =10) implementing these lines of code (step 1):
I am struggling to calculate the within cluster dispersion on simulated data (number of clusters k=10) using these lines of codes (step 2):
With the above mentioned lines of code I get this error:
right after this:
I have noticed that the first step of the analysis works only if the max number of clusters to be tested is inputed in numerical form and I am applying this very same rule also to the second step of the analysis.
I wish I could perform then the comparison between real data and simulated data using these final lines of code (step 3):
I am very new to MATA and I would really appreciate if anybody could help me identifying which commands should be changed/improved to perform a complete and successful gap statistic using Stata 19.
Many thanks in advance.
Anna
I have successfully calculated the "within-cluster dispersion" on my real data (number of clusters k =10) implementing these lines of code (step 1):
Code:
use "mydb", clear
local vars "cat_var_1 cat_var_2 cat_var_3 cat_var_4 num_var_w_1 num_var_w_2 num_var_w_3 num_var_w_4 num_var_w_5 "
local K_max = 10
local B = 10
matrix dissimilarity G_real = `vars', gower
mata:
D = st_matrix("G_real")
n = rows(D)
K_max = st_numscalar("K_max") // pass local to Mata
results = J(10, 2, .)
for (k=1; k<=10; k++) {
// generate clusters using Ward's linkage in Mata (vectorized)
g = J(n,1,1) // placeholder: assign all to cluster 1 (replace with clustering in Mata)
// compute Wk safely using D and g
Wk = 0
for (i=1; i<=n; i++) {
for (j=i+1; j<=n; j++) {
if (g[i] == g[j]) Wk = Wk + D[i,j]
}
}
if (Wk > 0) results[k,2] = ln(Wk)
results[k,1] = k
}
st_matrix("results", results)
end
matrix list results
Code:
clear all
macro drop _all
local B = 10
local K_max = 10
local vars "cat_var_1 cat_var_2 cat_var_3 cat_var_4 num_var_w_1 num_var_w_2 num_var_w_3 num_var_w_4 num_var_w_5"
* Load your dataset
use "mydb", clear
* Number of observations
summarize `vars'
local n = r(N)
* Store min/max for each variable
matrix mins = J(1, `: word count `vars'', .)
matrix maxs = J(1, `: word count `vars'', .)
local i = 1
foreach v of local vars {
quietly summarize `v'
matrix mins[1,`i'] = r(min)
matrix maxs[1,`i'] = r(max)
local ++i
}
* Initialize cumulative logWk matrix
matrix logWk_sim_total = J(10,1,0)
*-----------------------------------------------
* Run simulations fully in Mata
*-----------------------------------------------
mata:
B = st_numscalar("B")
K_max = st_numscalar("K_max")
n = st_numscalar("n")
mins = st_matrix("mins")
maxs = st_matrix("maxs")
// Import logWk_sim_total safely as column vector
logWk_sim_total = st_matrix("logWk_sim_total")
logWk_sim_total = vec(logWk_sim_total)
vars_count = cols(mins)
for (b=1; b<=B; b++) {
// Generate uniform simulated data
X = J(n, vars_count, 0)
for (j=1; j<=vars_count; j++) {
X[,j] = runiform(n,1)*(maxs[1,j]-mins[1,j]) + mins[1,j]
}
// Compute Gower distance
D = J(n,n,0)
for (i=1; i<=n; i++) {
for (j=i+1; j<=n; j++) {
D[i,j] = mean(abs(X[i,.]-X[j,.]) / (maxs - mins))
D[j,i] = D[i,j]
}
}
// Loop over cluster counts
for (k=1; k<=10; k++) {
// Simple cluster assignment: split into k equal-sized clusters
cluster = ceil((1::n)'*k/n)
// Compute Wk
Wk = 0
for (i=1; i<=n; i++) {
for (j=i+1; j<=n; j++) {
if (cluster[i,1]==cluster[j,1]) Wk = Wk + D[i,j]
}
}
// Add safely using single index
if (Wk>0) logWk_sim_total[k] = logWk_sim_total[k] + ln(Wk)
}
}
// Return cumulative logWk to Stata
st_matrix("logWk_sim_total", logWk_sim_total)
end
*-----------------------------------------------
* View cumulative simulated logWk
*-----------------------------------------------
matrix list logWk_sim_total
HTML Code:
<istmt>: 3200 conformability error (3 lines skipped)
Code:
// Add safely using single index > if (Wk>0) logWk_sim_total[k] = logWk_sim_total[k] + ln(Wk) > } > }
I wish I could perform then the comparison between real data and simulated data using these final lines of code (step 3):
Code:
matrix list logWk_sim_total
* Convert cumulative logWk_sim_total to average over B
matrix logWk_sim_avg = logWk_sim_total / `B'
* Retrieve logWk_real from results matrix
matrix logWk_real = J(`K_max',1,.)
forvalues k = 1/`K_max' {
logWk_real[`k',1] = results[`k',2]
}
* Compute Gap = average logWk_sim - logWk_real
matrix Gap = logWk_sim_avg - logWk_real
* Optional: store results together
matrix final = logWk_real, logWk_sim_avg, Gap
matrix colnames final = "logWk_real" "logWk_sim_avg" "Gap"
* View Gap statistic table
matrix list final
* Plot Gap vs k
twoway (line Gap 1/`K_max'), xlabel(1/`K_max') ///
title("Gap Statistic by Number of Clusters") ///
ytitle("Gap") xtitle("Number of clusters (k)")
Many thanks in advance.
Anna
