Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Gap statistics calculation as a robustness check for testing cluster validity using Gower dissimilarity matrix and Wards linkage

    I would like to calculate and plot the gap statistic introduced by Tibshirani et al (2001) as a form of robustness check to test the cluster validity of a clustering exercise based on Gower method and Wards linkage.

    I have successfully calculated the "within-cluster dispersion" on my real data (number of clusters k =10) implementing these lines of code (step 1):
    Code:
     use "mydb", clear
    
    local vars "cat_var_1 cat_var_2 cat_var_3 cat_var_4 num_var_w_1 num_var_w_2 num_var_w_3 num_var_w_4 num_var_w_5 "
    local K_max = 10
    local B = 10
    
    matrix dissimilarity G_real = `vars', gower
    
    mata:
    D = st_matrix("G_real")
    n = rows(D)
    K_max = st_numscalar("K_max")  // pass local to Mata
    results = J(10, 2, .)
    
    for (k=1; k<=10; k++) {
        // generate clusters using Ward's linkage in Mata (vectorized)
        g = J(n,1,1)  // placeholder: assign all to cluster 1 (replace with clustering in Mata)
        // compute Wk safely using D and g
        Wk = 0
        for (i=1; i<=n; i++) {
            for (j=i+1; j<=n; j++) {
                if (g[i] == g[j]) Wk = Wk + D[i,j]
            }
        }
        if (Wk > 0) results[k,2] = ln(Wk)
        results[k,1] = k
    }
    st_matrix("results", results)
    end
    
    matrix list results
    I am struggling to calculate the within cluster dispersion on simulated data (number of clusters k=10) using these lines of codes (step 2):
    Code:
    clear all
    macro drop _all
    
    local B = 10
    local K_max = 10
    local vars "cat_var_1 cat_var_2 cat_var_3 cat_var_4 num_var_w_1 num_var_w_2 num_var_w_3 num_var_w_4 num_var_w_5"
    
    * Load your dataset
    use "mydb", clear
    
    * Number of observations
    summarize `vars'
    local n = r(N)
    
    * Store min/max for each variable
    matrix mins = J(1, `: word count `vars'', .)
    matrix maxs = J(1, `: word count `vars'', .)
    local i = 1
    foreach v of local vars {
        quietly summarize `v'
        matrix mins[1,`i'] = r(min)
        matrix maxs[1,`i'] = r(max)
        local ++i
    }
    
    * Initialize cumulative logWk matrix
    matrix logWk_sim_total = J(10,1,0)
    
    *-----------------------------------------------
    * Run simulations fully in Mata
    *-----------------------------------------------
    mata:
    B = st_numscalar("B")
    K_max = st_numscalar("K_max")
    n = st_numscalar("n")
    mins = st_matrix("mins")
    maxs = st_matrix("maxs")
    
    // Import logWk_sim_total safely as column vector
    logWk_sim_total = st_matrix("logWk_sim_total")
    logWk_sim_total = vec(logWk_sim_total)
    
    vars_count = cols(mins)
    
    for (b=1; b<=B; b++) {
    
        // Generate uniform simulated data
        X = J(n, vars_count, 0)
        for (j=1; j<=vars_count; j++) {
            X[,j] = runiform(n,1)*(maxs[1,j]-mins[1,j]) + mins[1,j]
        }
    
        // Compute Gower distance
        D = J(n,n,0)
        for (i=1; i<=n; i++) {
            for (j=i+1; j<=n; j++) {
                D[i,j] = mean(abs(X[i,.]-X[j,.]) / (maxs - mins))
                D[j,i] = D[i,j]
            }
        }
    
        // Loop over cluster counts
        for (k=1; k<=10; k++) {
    
            // Simple cluster assignment: split into k equal-sized clusters
            cluster = ceil((1::n)'*k/n)
    
            // Compute Wk
            Wk = 0
            for (i=1; i<=n; i++) {
                for (j=i+1; j<=n; j++) {
                    if (cluster[i,1]==cluster[j,1]) Wk = Wk + D[i,j]
                }
            }
    
            // Add safely using single index
            if (Wk>0) logWk_sim_total[k] = logWk_sim_total[k] + ln(Wk)
        }
    }
    
    // Return cumulative logWk to Stata
    st_matrix("logWk_sim_total", logWk_sim_total)
    end
    *-----------------------------------------------
    * View cumulative simulated logWk
    *-----------------------------------------------
    matrix list logWk_sim_total
    With the above mentioned lines of code I get this error:
    HTML Code:
      <istmt>:  3200  conformability error (3 lines skipped)
    right after this:
    Code:
     // Add safely using single index
    >         if (Wk>0) logWk_sim_total[k] = logWk_sim_total[k] + ln(Wk)
    >     }
    > }
    I have noticed that the first step of the analysis works only if the max number of clusters to be tested is inputed in numerical form and I am applying this very same rule also to the second step of the analysis.
    I wish I could perform then the comparison between real data and simulated data using these final lines of code (step 3):
    Code:
    matrix list logWk_sim_total
    
    * Convert cumulative logWk_sim_total to average over B
    matrix logWk_sim_avg = logWk_sim_total / `B'
    
    * Retrieve logWk_real from results matrix
    matrix logWk_real = J(`K_max',1,.)
    forvalues k = 1/`K_max' {
        logWk_real[`k',1] = results[`k',2]
    }
    
    * Compute Gap = average logWk_sim - logWk_real
    matrix Gap = logWk_sim_avg - logWk_real
    
    * Optional: store results together
    matrix final = logWk_real, logWk_sim_avg, Gap
    matrix colnames final = "logWk_real" "logWk_sim_avg" "Gap"
    
    * View Gap statistic table
    matrix list final
    
    * Plot Gap vs k
    twoway (line Gap 1/`K_max'), xlabel(1/`K_max') ///
        title("Gap Statistic by Number of Clusters") ///
        ytitle("Gap") xtitle("Number of clusters (k)")
    I am very new to MATA and I would really appreciate if anybody could help me identifying which commands should be changed/improved to perform a complete and successful gap statistic using Stata 19.
    Many thanks in advance.
    Anna
Working...
X