Hey folks!
I'm not a STATA pro by any means, but want to repliace results from a study looking at CVD risk factor clusters. In their methods section they did:
- a 2 stage cluster analysis combining hierachial and non-hiearchial (k-means) clustering methods
- The first stage, was based on squared Euclidean distance and Ward's minimum variance algothirm to form inital cluster centers
- These non-random starting point are then applied at the second stage of k-means clustering to identify homogenous subgroups (clusters)
- Reliability of cluser solution was examined by splitting sample into 2 random subsamples
- Clustering procedure was repeated to check for agreement (Kappa, K) in cluster solution between subsamples and total sample
Would love if someone can review or edit my code to make sure I'm doing the above correctly:
1) identified my CVD risk variables (sbp, dbp, trig, total chol, hdl, ldl, eGFR, BMI, WC, %BF)
2) Did correlation and removed total cholesterol because correlation between that and LDL was above 0.9
pwcorr varlist, sig
3) Generated z scores for all variables
foreach var in sbp_adj_bs dbp_adj_bs ldl_value_num_bs hdl_value_bs trig_value_bs bmi_calc_bs body_fat_per_bs waist_bs glucose_value_bs eGFR_bs {
egen z_`var' = std(`var')
}
4) Perform hierarchical clustering using Ward's method
cluster wardslinkage z_sbp_adj_bs z_dbp_adj_bs z_ldl_value_num_bs z_chol_hdl_value_bs z_hdl_value_bs z_trig_value_bs z_bmi_calc_bs z_body_fat_per_bs z_waist_bs z_glucose_value_bs z_eGFR_bs, name(hclust_bs) measure(L2)
5) Create a dendrogram to visualize the clusters and decide the number of clusters
cluster dendrogram hclust_bs, cutnumber(4)
6) *Generate initial cluster centers
cluster generate clusterID_hclustbs = group(4), name(hclust_bs)
7) Perform k-means clustering using the initial centers from hierarchical clustering
cluster kmeans z_sbp_adj_bs z_dbp_adj_bs z_ldl_value_num_bs z_chol_hdl_value_bs z_hdl_value_bs z_trig_value_bs z_bmi_calc_bs z_body_fat_per_bs z_waist_bs z_glucose_value_bs z_eGFR_bs, k(4) start(group(clusterID_hclust_bs)) name(kclust_bs) generate(clusterID_kmeans_bs)
cluster stop
8) Validate the clustering
* Sample 1
preserve
sample 1000, count
cluster kmeans z_sbp_adj_bs z_dbp_adj_bs z_ldl_value_num_bs z_chol_hdl_value_bs z_hdl_value_bs z_trig_value_bs z_bmi_calc_bs z_body_fat_per_bs z_waist_bs z_glucose_value_bs z_eGFR_bs, k(4) name(kclust1_4) generate(clusterID_sample1_4)
save sample1_4_temp.dta, replace
restore
* Sample 2
preserve
sample 1000, count
cluster kmeans z_sbp_adj_bs z_dbp_adj_bs z_ldl_value_num_bs z_chol_hdl_value_bs z_hdl_value_bs z_trig_value_bs z_bmi_calc_bs z_body_fat_per_bs z_waist_bs z_glucose_value_bs z_eGFR_bs, k(4) name(kclust2_4) generate(clusterID_sample2_4)
save sample2_4_temp.dta, replace
restore
9) Load samples and compute Kappa statistic
use sample1_4_temp.dta, clear
merge 1:1 _n using sample2_4_temp.dta, gen(_merge22)
kap clusterID_sample1_4 clusterID_sample2_4
10) I'm not too sure how I can validate the subgroups with the total sample and I'm not sure if I'm using the correct variable name when comparing the subgroups. Should I be using the kmeans or the initital cluster?
Any help is much appreciated!
I'm not a STATA pro by any means, but want to repliace results from a study looking at CVD risk factor clusters. In their methods section they did:
- a 2 stage cluster analysis combining hierachial and non-hiearchial (k-means) clustering methods
- The first stage, was based on squared Euclidean distance and Ward's minimum variance algothirm to form inital cluster centers
- These non-random starting point are then applied at the second stage of k-means clustering to identify homogenous subgroups (clusters)
- Reliability of cluser solution was examined by splitting sample into 2 random subsamples
- Clustering procedure was repeated to check for agreement (Kappa, K) in cluster solution between subsamples and total sample
Would love if someone can review or edit my code to make sure I'm doing the above correctly:
1) identified my CVD risk variables (sbp, dbp, trig, total chol, hdl, ldl, eGFR, BMI, WC, %BF)
2) Did correlation and removed total cholesterol because correlation between that and LDL was above 0.9
pwcorr varlist, sig
3) Generated z scores for all variables
foreach var in sbp_adj_bs dbp_adj_bs ldl_value_num_bs hdl_value_bs trig_value_bs bmi_calc_bs body_fat_per_bs waist_bs glucose_value_bs eGFR_bs {
egen z_`var' = std(`var')
}
4) Perform hierarchical clustering using Ward's method
cluster wardslinkage z_sbp_adj_bs z_dbp_adj_bs z_ldl_value_num_bs z_chol_hdl_value_bs z_hdl_value_bs z_trig_value_bs z_bmi_calc_bs z_body_fat_per_bs z_waist_bs z_glucose_value_bs z_eGFR_bs, name(hclust_bs) measure(L2)
5) Create a dendrogram to visualize the clusters and decide the number of clusters
cluster dendrogram hclust_bs, cutnumber(4)
6) *Generate initial cluster centers
cluster generate clusterID_hclustbs = group(4), name(hclust_bs)
7) Perform k-means clustering using the initial centers from hierarchical clustering
cluster kmeans z_sbp_adj_bs z_dbp_adj_bs z_ldl_value_num_bs z_chol_hdl_value_bs z_hdl_value_bs z_trig_value_bs z_bmi_calc_bs z_body_fat_per_bs z_waist_bs z_glucose_value_bs z_eGFR_bs, k(4) start(group(clusterID_hclust_bs)) name(kclust_bs) generate(clusterID_kmeans_bs)
cluster stop
8) Validate the clustering
* Sample 1
preserve
sample 1000, count
cluster kmeans z_sbp_adj_bs z_dbp_adj_bs z_ldl_value_num_bs z_chol_hdl_value_bs z_hdl_value_bs z_trig_value_bs z_bmi_calc_bs z_body_fat_per_bs z_waist_bs z_glucose_value_bs z_eGFR_bs, k(4) name(kclust1_4) generate(clusterID_sample1_4)
save sample1_4_temp.dta, replace
restore
* Sample 2
preserve
sample 1000, count
cluster kmeans z_sbp_adj_bs z_dbp_adj_bs z_ldl_value_num_bs z_chol_hdl_value_bs z_hdl_value_bs z_trig_value_bs z_bmi_calc_bs z_body_fat_per_bs z_waist_bs z_glucose_value_bs z_eGFR_bs, k(4) name(kclust2_4) generate(clusterID_sample2_4)
save sample2_4_temp.dta, replace
restore
9) Load samples and compute Kappa statistic
use sample1_4_temp.dta, clear
merge 1:1 _n using sample2_4_temp.dta, gen(_merge22)
kap clusterID_sample1_4 clusterID_sample2_4
10) I'm not too sure how I can validate the subgroups with the total sample and I'm not sure if I'm using the correct variable name when comparing the subgroups. Should I be using the kmeans or the initital cluster?
Any help is much appreciated!