Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Making cluster kmeans completely reproducible

    I am performing kmeans cluster analysis on several sets of variables. I would like to have the cluster analysis generate the same assignments for each distinct set of variables every time I run my do-file; however, this is not always the case (even after using set seed and specifying the seed number in the kmeans command).

    Code:
    sysuse nlsw88.dta, clear
    set seed 123
    cluster kmeans age race married grade wage hours ttl_exp tenure, k(2) start(krandom(123))
    cluster kmeans age race married grade wage hours ttl_exp, k(2) start(krandom(123))
    
    bysort _clus_1: summ age if _clus_1 != .
    bysort _clus_2: summ age if _clus_2 != .
    Results (Run 1):
    Code:
    . bysort _clus_1: summ age if _clus_1 != .
    
    ---------------------------------------------------------------------------------------------
    -> _clus_1 = 1
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
             age |        424    39.45991    3.057108         34         45
    
    ---------------------------------------------------------------------------------------------
    -> _clus_1 = 2
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
             age |        678    39.22714    2.985198         34         46
    
    ---------------------------------------------------------------------------------------------
    -> _clus_1 = 3
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
             age |      1,123    38.98842    3.102361         34         45
    
    . bysort _clus_2: summ age if _clus_2 != .
    
    ---------------------------------------------------------------------------------------------
    -> _clus_2 = 1
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
             age |        353    39.19263    3.018363         34         46
    
    ---------------------------------------------------------------------------------------------
    -> _clus_2 = 2
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
             age |      1,527    39.06156    3.067631         34         46
    
    ---------------------------------------------------------------------------------------------
    -> _clus_2 = 3
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
             age |        360        39.5    3.059302         34         45
    Results (Run 2):
    Code:
    . bysort _clus_1: summ age if _clus_1 != .
    
    ---------------------------------------------------------------------------------------------
    -> _clus_1 = 1
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
             age |        424    39.45991    3.057108         34         45
    
    ---------------------------------------------------------------------------------------------
    -> _clus_1 = 2
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
             age |        678    39.22714    2.985198         34         46
    
    ---------------------------------------------------------------------------------------------
    -> _clus_1 = 3
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
             age |      1,123    38.98842    3.102361         34         45
    
    
    . bysort _clus_2: summ age if _clus_2 != .
    
    ---------------------------------------------------------------------------------------------
    -> _clus_2 = 1
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
             age |        369    39.23306    3.026135         34         46
    
    ---------------------------------------------------------------------------------------------
    -> _clus_2 = 2
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
             age |      1,511     39.0503    3.065745         34         46
    
    ---------------------------------------------------------------------------------------------
    -> _clus_2 = 3
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
             age |        360        39.5    3.059302         34         45
    For _clus_2 the assignments (# obs. and means) differ between the first and second runs.

    I understand kmeans clustering does not always have a unique solution, but is there any way to avoid getting different solutions when I run my script multiple times? I cannot figure out why this happens, especially why it always seems to be an issue for the kmeans command that comes second in the script (and not the first).

    Thank you!

  • #2
    I can't reproduce this behavior on my Stata 17 instance. When I run the code at the top of #1 multiple times I get exactly the same results, and when I set k = 3 for both clusters, I am consistently able to reproduce your first set of clusters - albeit with a slightly different order of clusters (my cluster 1 is your cluster 3, and my cluster 3 is your cluster 1 for _clust_1).

    Comment


    • #3
      I cannot reproduce the error you report.
      see the following example:

      Code:
      sysuse nlsw88.dta, clear
      set seed 123
      cluster kmeans age race married grade wage hours ttl_exp tenure, k(3) start(krandom(123))
      cluster kmeans age race married grade wage hours ttl_exp, k(3) start(krandom(123))
      cluster kmeans age race married grade wage hours ttl_exp tenure, k(3) start(krandom(123))
      cluster kmeans age race married grade wage hours ttl_exp, k(3) start(krandom(123))
      tab _clus_2 _clus_4
      tab _clus_1 _clus_3
      also, your code produces 2 clusters, but you report 3, so perhaps there is something else you are doing in your program
      HTH
      F

      Comment


      • #4
        FernandoRios you are correct that I have a mistake in the number of clusters I put in my code. I can't edit my original post, but the correct number of clusters should be 3, not 2.

        I myself am now having trouble replicating the issue. I have noticed it mostly when running the code on different days after closing Stata, rather than running a do-file several times consecutively. Going forward I'll use a different option that takes as much randomness out of the process as possible, such as sorting the data and then using firstk/everykth.

        Comment


        • #5
          It definitely sounds like theres some difference in the state of the data across different days. I was also thinking it might be possible that sorting the data differently leads to different results, and since you sort in the script above, I played around with it a little. Long story short, it doesn't seem like order matters. Consider the following script:

          Code:
          sysuse nlsw88.dta, clear
          set seed 123
          
          
          generate order1 = runiform()
          sort order1
          cluster kmeans age race married grade wage hours ttl_exp, k(3) start(krandom(123))
          cluster kmeans age race married grade wage hours ttl_exp tenure, k(3) start(krandom(123))
          bysort _clus_1: summ age if _clus_1 != .
          scalar clust_1_obs = r(obs)
          bysort _clus_2: summ age if _clus_2 != .
          scalar clust_2_obs = r(obs)
          
          generate order2 = runiform()
          sort order2
          cluster kmeans age race married grade wage hours ttl_exp, k(3) start(krandom(123))
          cluster kmeans age race married grade wage hours ttl_exp tenure, k(3) start(krandom(123))
          bysort _clus_3: summ age if _clus_3 != .
          scalar clust_3_obs = r(obs)
          bysort _clus_4: summ age if _clus_4 != .
          scalar clust_4_obs = r(obs)
          
          generate order3 = runiform()
          sort order3
          cluster kmeans age race married grade wage hours ttl_exp, k(3) start(krandom(123))
          cluster kmeans age race married grade wage hours ttl_exp tenure, k(3) start(krandom(123))
          bysort _clus_5: summ age if _clus_5 != .
          scalar clust_5_obs = r(obs)
          bysort _clus_6: summ age if _clus_6 != .
          scalar clust_6_obs = r(obs)
          
          assert clust_1_obs == clust_3_obs & clust_1_obs == clust_5_obs
          assert clust_2_obs == clust_4_obs & clust_2_obs == clust_6_obs
          Granted, the assertions at the end don't guarantee that the same observations are assigned to the same cluster, but it looks like the number of observations assigned to a cluster stays the same across different random orders of observations.

          Comment

          Working...
          X