Making cluster kmeans completely reproducible

Joe Moran

Join Date: May 2021
Posts: 14

Making cluster kmeans completely reproducible

06 Jun 2023, 09:51

I am performing kmeans cluster analysis on several sets of variables. I would like to have the cluster analysis generate the same assignments for each distinct set of variables every time I run my do-file; however, this is not always the case (even after using set seed and specifying the seed number in the kmeans command).

Code:

sysuse nlsw88.dta, clear
set seed 123
cluster kmeans age race married grade wage hours ttl_exp tenure, k(2) start(krandom(123))
cluster kmeans age race married grade wage hours ttl_exp, k(2) start(krandom(123))

bysort _clus_1: summ age if _clus_1 != .
bysort _clus_2: summ age if _clus_2 != .

Results (Run 1):

Code:

. bysort _clus_1: summ age if _clus_1 != .

---------------------------------------------------------------------------------------------
-> _clus_1 = 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |        424    39.45991    3.057108         34         45

---------------------------------------------------------------------------------------------
-> _clus_1 = 2

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |        678    39.22714    2.985198         34         46

---------------------------------------------------------------------------------------------
-> _clus_1 = 3

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |      1,123    38.98842    3.102361         34         45

. bysort _clus_2: summ age if _clus_2 != .

---------------------------------------------------------------------------------------------
-> _clus_2 = 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |        353    39.19263    3.018363         34         46

---------------------------------------------------------------------------------------------
-> _clus_2 = 2

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |      1,527    39.06156    3.067631         34         46

---------------------------------------------------------------------------------------------
-> _clus_2 = 3

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |        360        39.5    3.059302         34         45

Results (Run 2):

Code:

. bysort _clus_1: summ age if _clus_1 != .

---------------------------------------------------------------------------------------------
-> _clus_1 = 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |        424    39.45991    3.057108         34         45

---------------------------------------------------------------------------------------------
-> _clus_1 = 2

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |        678    39.22714    2.985198         34         46

---------------------------------------------------------------------------------------------
-> _clus_1 = 3

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |      1,123    38.98842    3.102361         34         45


. bysort _clus_2: summ age if _clus_2 != .

---------------------------------------------------------------------------------------------
-> _clus_2 = 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |        369    39.23306    3.026135         34         46

---------------------------------------------------------------------------------------------
-> _clus_2 = 2

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |      1,511     39.0503    3.065745         34         46

---------------------------------------------------------------------------------------------
-> _clus_2 = 3

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |        360        39.5    3.059302         34         45

For _clus_2 the assignments (# obs. and means) differ between the first and second runs.

I understand kmeans clustering does not always have a unique solution, but is there any way to avoid getting different solutions when I run my script multiple times? I cannot figure out why this happens, especially why it always seems to be an issue for the kmeans command that comes second in the script (and not the first).

Thank you!

Tags: None

Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#2

06 Jun 2023, 10:19

I can't reproduce this behavior on my Stata 17 instance. When I run the code at the top of #1 multiple times I get exactly the same results, and when I set k = 3 for both clusters, I am consistently able to reproduce your first set of clusters - albeit with a slightly different order of clusters (my cluster 1 is your cluster 3, and my cluster 3 is your cluster 1 for _clust_1).
Comment

FernandoRios

Join Date: Apr 2014
Posts: 2471

06 Jun 2023, 10:24

I cannot reproduce the error you report.
see the following example:

Code:

sysuse nlsw88.dta, clear
set seed 123
cluster kmeans age race married grade wage hours ttl_exp tenure, k(3) start(krandom(123))
cluster kmeans age race married grade wage hours ttl_exp, k(3) start(krandom(123))
cluster kmeans age race married grade wage hours ttl_exp tenure, k(3) start(krandom(123))
cluster kmeans age race married grade wage hours ttl_exp, k(3) start(krandom(123))
tab _clus_2 _clus_4
tab _clus_1 _clus_3

also, your code produces 2 clusters, but you report 3, so perhaps there is something else you are doing in your program
HTH
F

Comment

Joe Moran

Join Date: May 2021

Posts: 14
#4

06 Jun 2023, 11:42

FernandoRios you are correct that I have a mistake in the number of clusters I put in my code. I can't edit my original post, but the correct number of clusters should be 3, not 2.

I myself am now having trouble replicating the issue. I have noticed it mostly when running the code on different days after closing Stata, rather than running a do-file several times consecutively. Going forward I'll use a different option that takes as much randomness out of the process as possible, such as sorting the data and then using firstk/everykth.
Comment

Daniel Schaefer

Join Date: Mar 2020
Posts: 814

06 Jun 2023, 14:56

It definitely sounds like theres some difference in the state of the data across different days. I was also thinking it might be possible that sorting the data differently leads to different results, and since you sort in the script above, I played around with it a little. Long story short, it doesn't seem like order matters. Consider the following script:

Code:

sysuse nlsw88.dta, clear
set seed 123


generate order1 = runiform()
sort order1
cluster kmeans age race married grade wage hours ttl_exp, k(3) start(krandom(123))
cluster kmeans age race married grade wage hours ttl_exp tenure, k(3) start(krandom(123))
bysort _clus_1: summ age if _clus_1 != .
scalar clust_1_obs = r(obs)
bysort _clus_2: summ age if _clus_2 != .
scalar clust_2_obs = r(obs)

generate order2 = runiform()
sort order2
cluster kmeans age race married grade wage hours ttl_exp, k(3) start(krandom(123))
cluster kmeans age race married grade wage hours ttl_exp tenure, k(3) start(krandom(123))
bysort _clus_3: summ age if _clus_3 != .
scalar clust_3_obs = r(obs)
bysort _clus_4: summ age if _clus_4 != .
scalar clust_4_obs = r(obs)

generate order3 = runiform()
sort order3
cluster kmeans age race married grade wage hours ttl_exp, k(3) start(krandom(123))
cluster kmeans age race married grade wage hours ttl_exp tenure, k(3) start(krandom(123))
bysort _clus_5: summ age if _clus_5 != .
scalar clust_5_obs = r(obs)
bysort _clus_6: summ age if _clus_6 != .
scalar clust_6_obs = r(obs)

assert clust_1_obs == clust_3_obs & clust_1_obs == clust_5_obs
assert clust_2_obs == clust_4_obs & clust_2_obs == clust_6_obs

Granted, the assertions at the end don't guarantee that the same observations are assigned to the same cluster, but it looks like the number of observations assigned to a cluster stays the same across different random orders of observations.

Announcement

Making cluster kmeans completely reproducible

Comment

Comment

Comment

Comment