Gap statistics calculation as a robustness check for testing cluster validity using Gower dissimilarity matrix and Wards linkage

Anna Novaresio

Join Date: Jan 2021
Posts: 18

Gap statistics calculation as a robustness check for testing cluster validity using Gower dissimilarity matrix and Wards linkage

20 Mar 2026, 06:40

I would like to calculate and plot the gap statistic introduced by Tibshirani et al (2001) as a form of robustness check to test the cluster validity of a clustering exercise based on Gower method and Wards linkage.

I have successfully calculated the "within-cluster dispersion" on my real data (number of clusters k =10) implementing these lines of code (step 1):

Code:

 use "mydb", clear

local vars "cat_var_1 cat_var_2 cat_var_3 cat_var_4 num_var_w_1 num_var_w_2 num_var_w_3 num_var_w_4 num_var_w_5 "
local K_max = 10
local B = 10

matrix dissimilarity G_real = `vars', gower

mata:
D = st_matrix("G_real")
n = rows(D)
K_max = st_numscalar("K_max")  // pass local to Mata
results = J(10, 2, .)

for (k=1; k<=10; k++) {
    // generate clusters using Ward's linkage in Mata (vectorized)
    g = J(n,1,1)  // placeholder: assign all to cluster 1 (replace with clustering in Mata)
    // compute Wk safely using D and g
    Wk = 0
    for (i=1; i<=n; i++) {
        for (j=i+1; j<=n; j++) {
            if (g[i] == g[j]) Wk = Wk + D[i,j]
        }
    }
    if (Wk > 0) results[k,2] = ln(Wk)
    results[k,1] = k
}
st_matrix("results", results)
end

matrix list results

I am struggling to calculate the within cluster dispersion on simulated data (number of clusters k=10) using these lines of codes (step 2):

Code:

clear all
macro drop _all

local B = 10
local K_max = 10
local vars "cat_var_1 cat_var_2 cat_var_3 cat_var_4 num_var_w_1 num_var_w_2 num_var_w_3 num_var_w_4 num_var_w_5"

* Load your dataset
use "mydb", clear

* Number of observations
summarize `vars'
local n = r(N)

* Store min/max for each variable
matrix mins = J(1, `: word count `vars'', .)
matrix maxs = J(1, `: word count `vars'', .)
local i = 1
foreach v of local vars {
    quietly summarize `v'
    matrix mins[1,`i'] = r(min)
    matrix maxs[1,`i'] = r(max)
    local ++i
}

* Initialize cumulative logWk matrix
matrix logWk_sim_total = J(10,1,0)

*-----------------------------------------------
* Run simulations fully in Mata
*-----------------------------------------------
mata:
B = st_numscalar("B")
K_max = st_numscalar("K_max")
n = st_numscalar("n")
mins = st_matrix("mins")
maxs = st_matrix("maxs")

// Import logWk_sim_total safely as column vector
logWk_sim_total = st_matrix("logWk_sim_total")
logWk_sim_total = vec(logWk_sim_total)

vars_count = cols(mins)

for (b=1; b<=B; b++) {

    // Generate uniform simulated data
    X = J(n, vars_count, 0)
    for (j=1; j<=vars_count; j++) {
        X[,j] = runiform(n,1)*(maxs[1,j]-mins[1,j]) + mins[1,j]
    }

    // Compute Gower distance
    D = J(n,n,0)
    for (i=1; i<=n; i++) {
        for (j=i+1; j<=n; j++) {
            D[i,j] = mean(abs(X[i,.]-X[j,.]) / (maxs - mins))
            D[j,i] = D[i,j]
        }
    }

    // Loop over cluster counts
    for (k=1; k<=10; k++) {

        // Simple cluster assignment: split into k equal-sized clusters
        cluster = ceil((1::n)'*k/n)

        // Compute Wk
        Wk = 0
        for (i=1; i<=n; i++) {
            for (j=i+1; j<=n; j++) {
                if (cluster[i,1]==cluster[j,1]) Wk = Wk + D[i,j]
            }
        }

        // Add safely using single index
        if (Wk>0) logWk_sim_total[k] = logWk_sim_total[k] + ln(Wk)
    }
}

// Return cumulative logWk to Stata
st_matrix("logWk_sim_total", logWk_sim_total)
end
*-----------------------------------------------
* View cumulative simulated logWk
*-----------------------------------------------
matrix list logWk_sim_total

With the above mentioned lines of code I get this error:

HTML Code:

  <istmt>:  3200  conformability error (3 lines skipped)

right after this:

Code:

 // Add safely using single index
>         if (Wk>0) logWk_sim_total[k] = logWk_sim_total[k] + ln(Wk)
>     }
> }

I have noticed that the first step of the analysis works only if the max number of clusters to be tested is inputed in numerical form and I am applying this very same rule also to the second step of the analysis.
I wish I could perform then the comparison between real data and simulated data using these final lines of code (step 3):

Code:

matrix list logWk_sim_total

* Convert cumulative logWk_sim_total to average over B
matrix logWk_sim_avg = logWk_sim_total / `B'

* Retrieve logWk_real from results matrix
matrix logWk_real = J(`K_max',1,.)
forvalues k = 1/`K_max' {
    logWk_real[`k',1] = results[`k',2]
}

* Compute Gap = average logWk_sim - logWk_real
matrix Gap = logWk_sim_avg - logWk_real

* Optional: store results together
matrix final = logWk_real, logWk_sim_avg, Gap
matrix colnames final = "logWk_real" "logWk_sim_avg" "Gap"

* View Gap statistic table
matrix list final

* Plot Gap vs k
twoway (line Gap 1/`K_max'), xlabel(1/`K_max') ///
    title("Gap Statistic by Number of Clusters") ///
    ytitle("Gap") xtitle("Number of clusters (k)")

I am very new to MATA and I would really appreciate if anybody could help me identifying which commands should be changed/improved to perform a complete and successful gap statistic using Stata 19.
Many thanks in advance.
Anna

Tags: None

Announcement

Gap statistics calculation as a robustness check for testing cluster validity using Gower dissimilarity matrix and Wards linkage