I am performing kmeans cluster analysis on several sets of variables. I would like to have the cluster analysis generate the same assignments for each distinct set of variables every time I run my do-file; however, this is not always the case (even after using set seed and specifying the seed number in the kmeans command).
Results (Run 1):
Results (Run 2):
For _clus_2 the assignments (# obs. and means) differ between the first and second runs.
I understand kmeans clustering does not always have a unique solution, but is there any way to avoid getting different solutions when I run my script multiple times? I cannot figure out why this happens, especially why it always seems to be an issue for the kmeans command that comes second in the script (and not the first).
Thank you!
Code:
sysuse nlsw88.dta, clear set seed 123 cluster kmeans age race married grade wage hours ttl_exp tenure, k(2) start(krandom(123)) cluster kmeans age race married grade wage hours ttl_exp, k(2) start(krandom(123)) bysort _clus_1: summ age if _clus_1 != . bysort _clus_2: summ age if _clus_2 != .
Code:
. bysort _clus_1: summ age if _clus_1 != . --------------------------------------------------------------------------------------------- -> _clus_1 = 1 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- age | 424 39.45991 3.057108 34 45 --------------------------------------------------------------------------------------------- -> _clus_1 = 2 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- age | 678 39.22714 2.985198 34 46 --------------------------------------------------------------------------------------------- -> _clus_1 = 3 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- age | 1,123 38.98842 3.102361 34 45 . bysort _clus_2: summ age if _clus_2 != . --------------------------------------------------------------------------------------------- -> _clus_2 = 1 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- age | 353 39.19263 3.018363 34 46 --------------------------------------------------------------------------------------------- -> _clus_2 = 2 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- age | 1,527 39.06156 3.067631 34 46 --------------------------------------------------------------------------------------------- -> _clus_2 = 3 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- age | 360 39.5 3.059302 34 45
Code:
. bysort _clus_1: summ age if _clus_1 != . --------------------------------------------------------------------------------------------- -> _clus_1 = 1 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- age | 424 39.45991 3.057108 34 45 --------------------------------------------------------------------------------------------- -> _clus_1 = 2 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- age | 678 39.22714 2.985198 34 46 --------------------------------------------------------------------------------------------- -> _clus_1 = 3 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- age | 1,123 38.98842 3.102361 34 45 . bysort _clus_2: summ age if _clus_2 != . --------------------------------------------------------------------------------------------- -> _clus_2 = 1 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- age | 369 39.23306 3.026135 34 46 --------------------------------------------------------------------------------------------- -> _clus_2 = 2 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- age | 1,511 39.0503 3.065745 34 46 --------------------------------------------------------------------------------------------- -> _clus_2 = 3 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- age | 360 39.5 3.059302 34 45
I understand kmeans clustering does not always have a unique solution, but is there any way to avoid getting different solutions when I run my script multiple times? I cannot figure out why this happens, especially why it always seems to be an issue for the kmeans command that comes second in the script (and not the first).
Thank you!
Comment