Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cluster analysis help!

    Hey folks!

    I'm not a STATA pro by any means, but want to repliace results from a study looking at CVD risk factor clusters. In their methods section they did:
    - a 2 stage cluster analysis combining hierachial and non-hiearchial (k-means) clustering methods
    - The first stage, was based on squared Euclidean distance and Ward's minimum variance algothirm to form inital cluster centers
    - These non-random starting point are then applied at the second stage of k-means clustering to identify homogenous subgroups (clusters)
    - Reliability of cluser solution was examined by splitting sample into 2 random subsamples
    - Clustering procedure was repeated to check for agreement (Kappa, K) in cluster solution between subsamples and total sample

    Would love if someone can review or edit my code to make sure I'm doing the above correctly:

    1) identified my CVD risk variables (sbp, dbp, trig, total chol, hdl, ldl, eGFR, BMI, WC, %BF)
    2) Did correlation and removed total cholesterol because correlation between that and LDL was above 0.9
    pwcorr varlist, sig
    3) Generated z scores for all variables
    foreach var in sbp_adj_bs dbp_adj_bs ldl_value_num_bs hdl_value_bs trig_value_bs bmi_calc_bs body_fat_per_bs waist_bs glucose_value_bs eGFR_bs {
    egen z_`var' = std(`var')
    }

    4) Perform hierarchical clustering using Ward's method
    cluster wardslinkage z_sbp_adj_bs z_dbp_adj_bs z_ldl_value_num_bs z_chol_hdl_value_bs z_hdl_value_bs z_trig_value_bs z_bmi_calc_bs z_body_fat_per_bs z_waist_bs z_glucose_value_bs z_eGFR_bs, name(hclust_bs) measure(L2)

    5) Create a dendrogram to visualize the clusters and decide the number of clusters
    cluster dendrogram hclust_bs, cutnumber(4)
    6) *Generate initial cluster centers
    cluster generate clusterID_hclustbs = group(4), name(hclust_bs)

    7) Perform k-means clustering using the initial centers from hierarchical clustering
    cluster kmeans z_sbp_adj_bs z_dbp_adj_bs z_ldl_value_num_bs z_chol_hdl_value_bs z_hdl_value_bs z_trig_value_bs z_bmi_calc_bs z_body_fat_per_bs z_waist_bs z_glucose_value_bs z_eGFR_bs, k(4) start(group(clusterID_hclust_bs)) name(kclust_bs) generate(clusterID_kmeans_bs)
    cluster stop

    8) Validate the clustering
    * Sample 1
    preserve
    sample 1000, count
    cluster kmeans z_sbp_adj_bs z_dbp_adj_bs z_ldl_value_num_bs z_chol_hdl_value_bs z_hdl_value_bs z_trig_value_bs z_bmi_calc_bs z_body_fat_per_bs z_waist_bs z_glucose_value_bs z_eGFR_bs, k(4) name(kclust1_4) generate(clusterID_sample1_4)
    save sample1_4_temp.dta, replace
    restore


    * Sample 2
    preserve
    sample 1000, count
    cluster kmeans z_sbp_adj_bs z_dbp_adj_bs z_ldl_value_num_bs z_chol_hdl_value_bs z_hdl_value_bs z_trig_value_bs z_bmi_calc_bs z_body_fat_per_bs z_waist_bs z_glucose_value_bs z_eGFR_bs, k(4) name(kclust2_4) generate(clusterID_sample2_4)
    save sample2_4_temp.dta, replace
    restore


    9) Load samples and compute Kappa statistic
    use sample1_4_temp.dta, clear
    merge 1:1 _n using sample2_4_temp.dta, gen(_merge22)
    kap clusterID_sample1_4 clusterID_sample2_4


    10) I'm not too sure how I can validate the subgroups with the total sample and I'm not sure if I'm using the correct variable name when comparing the subgroups. Should I be using the kmeans or the initital cluster?

    Any help is much appreciated!
Working...
X