Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Clustering question

    Hello,

    I am trying to create clusters based on 2023 data, which I first standardize. However, when I compare the clusters in terms of unemployment rates and other variables, they do not seem to make much sense. I'm not sure if I'm doing something wrong. I would appreciate any help.

    These are the data:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int year byte provinceid double(URs UR EmpRatio LFratio MR)
    2018  1              12.35              5.375 30.430211867698837  32.85178320571595  .5426973157231925
    2019  1 11.658029409924163  4.752437353991695  32.05721011261588 34.563086017470404  .8540473720531603
    2020  1 12.730342460007845   6.50335637452168  32.98489487756435  35.33847901234444 1.2306175747293717
    2021  1  9.373707955818615  5.202815308821132 35.584454873869184  37.22215777258736  .8819009515600033
    2022  1  7.474195346486567   4.23872446827472  33.57272579573623 34.746707227414305  .9457833968206055
    2023  1  6.604579937262361 3.5800130679269504 31.652628456022125 32.677683981134145  .8885859699462477
    2024  1  5.090213819352838 2.4773771650518834  29.18454629495786  29.98798770355475                  .
    2018  2             10.625              4.425 26.998325032513176  28.87121583197143  .5940687153860856
    2019  2 11.397258302443111   4.66878242750713 29.032677375995096 31.237413545175848  .9121220494100961
    2020  2 13.242822876800982 7.7737876140527895 29.517196291212766 31.377913671918506  1.298947538994027
    2021  2  11.90852428936289  7.309834126005734 31.311644147147188  32.94622398335078  .9521217777559889
    2022  2  9.748424419155176  5.425124159448911 32.467840274754536 34.023139685192085 1.0121008362618305
    2023  2  8.706860063900347  4.771663842221802  32.00446487555084  33.38401923574991  .7235251074953089
    2024  2  8.470093095659127  4.038823216793128 30.439935705145338  31.91363511954331                  .
    2018  3             19.525               9.95  32.62587421704779  36.50773498906684  .4806960307931351
    2019  3 16.732333084043468  8.486406062689364 36.519613444321934  40.13612004842655  .7836311622760433
    2020  3  17.33530510286297   9.53548513717227  39.98327067005958  43.75588863295183 1.1467203630463205
    2021  3  14.86652508913857   8.72206073749032  41.54229317597175 44.540586617802276  .8095905314965542
    2022  3 13.025327426107197  7.572034102818827 38.859036335091545   41.2954895820473  .8824793726417115
    2023  3 10.901821015618438  6.411935501982576 36.320405127467865 38.150683396789056 1.0653874159786303
    2024  3  9.453383184500643  5.107710168557733  34.32088114246889  35.96806943408799                  .
    2018  4 15.624999999999998  7.574999999999999   25.5584885946506 27.996957728718012 .48914706216077314
    2019  4  6.103323503305055  2.820793601511717 29.309978291176243  30.33462456995147  .8226533335650124
    2020  4 13.845712680557586  7.272316063577074 29.602060620201076 31.860637542960557 1.1860533229535113
    2021  4 10.981501806323811  6.054023107035276  30.76654127135598 32.469574683979396  .8614781319867313
    2022  4  9.965853056743356  5.512639219405295 27.460159257237333 28.818376837250685  .9272286770499407
    2023  4  9.523446014091498  5.163216688633633 26.986186913184827  28.28670023268507  .9595429885795778
    2024  4  10.16489365797814  4.958525541005978 25.641048980966513  27.12706870461413                  .
    2018  5              7.525                3.6  27.85262324011244 29.034797300317262  .5513727154745637
    2019  5   9.27417248733516  4.107987592393874  29.80467038175889  31.50183250357534   .841617497375321
    2020  5 10.770267374889949  5.901530412045825 31.059182700522953  32.75389797536409 1.2137075521057565
    2021  5  9.833514783744137  5.230685934875252  33.24829609868632 34.945558846499154  .8612021965910904
    2022  5 7.6804407594422175  4.377384809000454  33.42635338639271   34.6222983883267  .9209674940522476
    2023  5  6.452390849083533 3.4001033404796894  32.46960006427522  33.52902376932861  .9344399861571077
    2024  5  5.472160522648359 2.4234543885742403 30.492457768997195  31.47589866384184                  .
    2018  6                 14              8.075  35.81512042015576  38.28261563514906 .48882160806925656
    2019  6 13.979615610669871  8.353215825113447  39.17925352307709  41.74188033747022  .7955533408894084
    2020  6 16.482947217667768 10.035485112995369  39.84376179482921 42.919674266851935  1.191071646303972
    2021  6 12.000707426927493  7.715183466652195  41.39910614910515   43.4152230535524  .8447515378475231
    2022  6   11.3074287647264  7.109505253296397 39.440409320414524  41.30717019204819  .9187472218222895
    2023  6 10.079168955886923   6.09419113350804   38.2748958525581  39.97110583366338  .9845643774622804
    2024  6 11.905536243177275  5.682197871566705  36.28526228143214  38.84859549727798                  .
    2018  7               15.4  9.149999999999999 39.059729811913954  41.94534814908255 .40170933558548927
    2019  7  5.731793014411348  3.583087040110987 41.851720604150906 42.805669395291446  .6926781030142755
    2020  7 16.196496586941862 10.601693232320926   42.4385110202734  45.27174727116201 1.0499162483552231
    2021  7  15.69096015318504 10.407712670312467  43.46470981261579 46.188436937589856  .7090543389948297
    2022  7 11.035023123035371 7.0255287929532395  41.21768246750106 43.075290595485185  .7883080724465613
    2023  7  7.413147720663639 4.7819804752729915  39.55979104990613  40.68401575225343 1.3024064841216072
    2024  7  7.979406440895296  3.880797939287378  36.35874146552447   37.9781642611691                  .
    2018  8             16.325              8.525 31.701885171453437  34.65706538462746  .4718150428682777
    2019  8  8.880533277374058  4.984529851973081  36.00476114430872  37.54422003047602   .782199287678461
    2020  8 14.689139692335464   8.57769505114198  36.54836957116239  39.16659820646748 1.1395496426608027
    2021  8 11.670833102952749  6.796071665082726 38.153209306230615  40.25883081258736  .7928810530675923
    2022  8 12.020921337463786  7.206251224814134  34.86741415354112  36.77553935083211  .8649305066444393
    2023  8  9.746570131689271 5.6168055739602565 31.875475723761355  33.33401541689895  1.096868091335806
    2024  8  7.311143154160881  3.435266416134927 29.059318720981913  30.27446303591142                  .
    2018  9               19.8             11.475  38.14744728508535   42.1072664702267 .36279232524139254
    2019  9 11.221446306755654 6.8409323059523555  37.01599128481703  38.84243541272009  .6213864187201215
    2020  9 14.062455446313017  9.221750564946015 29.108777643952116  30.74842190844095  .9716202453987731
    2021  9 11.402491651959423  7.599050651100434 32.530027886959004 33.926523613885195  .6221325532258855
    2022  9 11.260713621473545  7.429537539415032  32.58917665015485  33.99616198000607   .713867059014291
    2023  9 11.178906858870109  7.306630012287824  39.34563966576502  41.06096655607506  1.490878723937556
    2024  9 11.491316367861266  6.491295998264713  37.55082726335536 39.672143427002396                  .
    2018 10              18.45             12.075  40.26090916253474  43.40822119087513  .5034389131178759
    2019 10 20.750355507362947 13.731256604355586   41.8408836423139 45.546708474145284  .8195796168797317
    2020 10 15.047428433773351  10.42496627584661  41.60413852844226  43.86791409656552 1.2109163771769067
    2021 10 13.875063492902845  9.916045435633611   42.6709196149774 44.632429813153955  .8611524701988581
    2022 10 12.630198166054672  9.435613146023082  40.18387188565162  41.65315294705029  .9319953835291722
    2023 10 11.943106658334802  8.244642313306006 38.339864436712354  39.95017132162859   .942167355424693
    2024 10  8.805849565591332  5.104004519041044   36.2907493678204   37.7639001142544                  .
    2018 11              9.725              5.675  28.70745301122956 29.995353146322106  .3772994502370715
    2019 11 11.899825073190566  6.769774363324777  32.82033642141434  34.73145623810695  .6360540073729831
    2020 11 14.137163764483983    9.0248329977811  33.30670226819585  35.28980562476397  .9789095857358479
    2021 11 10.040631577645028  7.238704846020116 34.648925857816934  35.72812142450103  .6202492048979428
    2022 11 7.8529749517732155  5.813101617561455 32.445610982118964  33.16386462752958  .7083312727861943
    2023 11 6.8100833364683755  4.714870928536264 32.236969761459164 32.961761685913935  1.485751681630545
    2024 11  5.282375623833831 3.0903460561874656 30.044266498845996 30.739574483368138                  .
    2018 12                 14 7.9750000000000005  44.48625416168506 47.602878363128696  .5674514874845675
    2019 12  18.05720975077399 10.868087888463553  48.59402194475974  52.85734205477962  .9057615040531973
    2020 12  18.51236639340078 11.587765562267762 50.231965523702456  54.50054340257592  1.292301898747369
    2021 12 10.491399539248423  6.565649585548437  52.29146780444608  54.58491476100109  .9547937535895988
    2022 12 12.268441934351898  7.830361696603655  50.18800682562933  52.72686976361716 1.0382352774505745
    2023 12 12.653397019524489  8.023062746289707  48.40285851923488  50.96874439317373  .7108679896970983
    2024 12 10.728089904639276  5.268364250412569 46.919170199986496  49.78867077348445                  .
    2018 13               27.6             14.025 39.526141209919444  46.93729268677933  .2999620250230554
    2019 13 31.164065943449213 15.817669717415992  43.63214397835865  53.35956583818449  .5446477707974833
    2020 13 17.887702095495243   9.57165572949236 44.430548942443934 48.930319555391954  .8723276459268762
    2021 13  15.94120761556903   9.03609909054121  45.63227509411646 49.380792088362014  .5209015406016587
    2022 13  12.05484421636055  6.763935727259403  43.28041723077883  45.88423008320619  .6034972634601058
    2023 13  9.994355747150882  5.807391728719686  42.48867025798112  44.46519667516215 1.7342978107324887
    2024 13  6.726459462550847  3.590451914985578   41.6702756076013  43.07129778454005                  .
    end

    And, this is the code I used:

    Code:
    drop if year==2024
    
    foreach var in URs UR EmpRatio LFratio MR {
        bysort provinceid : egen mean_`var' = mean(`var')
        by provinceid : egen sd_`var'  = sd(`var')
        gen st_`var' = (`var' - mean_`var') / sd_`var'
    }
    
    keep if year==2023
    
    cluster kmeans st_URs st_UR st_EmpRatio st_LFratio st_MR, k(3)
    Thank you.

  • #2
    However, when I compare the clusters in terms of unemployment rates and other variables, they do not seem to make much sense.
    Can you provide more details? How did you compare? Which command did you use? And how do you think will make sense?

    Comment


    • #3
      I was essentially trying to update the results of an analysis conducted some time ago. The provinces were clustered using the same variables, so the first cluster consisted of those with low unemployment rates—both overall and for nationals (variables URs and UR)—and a high share of nationals in the total labor force (LFratio), for instance. The third cluster, on the other hand, exhibited the opposite characteristics, while the second cluster fell somewhere in between.When I perform the clustering for the more recent period, I can't clearly define the characteristics of the created clusters as is clear for the initial period (please see attached).
      Attached Files

      Comment


      • #4
        If I understand this correctly you want to compare 13 provinces (and territories of Canada???) and their z-scores for 2023 on 5 variables (where the z scores are relative to means and SDs for 2018-23.

        Out of curiosity I plotted each province as a graphical profile using fabplot from the Stata Journal. https://journals.sagepub.com/doi/pdf...6867X211025838

        There's almost certainly a better order for provinces (not quasi-alphabetical) and for various measures too. Are you expecting to see 3 clusters here?

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input int year byte provinceid double(URs UR EmpRatio LFratio MR)
        2018  1              12.35              5.375 30.430211867698837  32.85178320571595  .5426973157231925
        2019  1 11.658029409924163  4.752437353991695  32.05721011261588 34.563086017470404  .8540473720531603
        2020  1 12.730342460007845   6.50335637452168  32.98489487756435  35.33847901234444 1.2306175747293717
        2021  1  9.373707955818615  5.202815308821132 35.584454873869184  37.22215777258736  .8819009515600033
        2022  1  7.474195346486567   4.23872446827472  33.57272579573623 34.746707227414305  .9457833968206055
        2023  1  6.604579937262361 3.5800130679269504 31.652628456022125 32.677683981134145  .8885859699462477
        2024  1  5.090213819352838 2.4773771650518834  29.18454629495786  29.98798770355475                  .
        2018  2             10.625              4.425 26.998325032513176  28.87121583197143  .5940687153860856
        2019  2 11.397258302443111   4.66878242750713 29.032677375995096 31.237413545175848  .9121220494100961
        2020  2 13.242822876800982 7.7737876140527895 29.517196291212766 31.377913671918506  1.298947538994027
        2021  2  11.90852428936289  7.309834126005734 31.311644147147188  32.94622398335078  .9521217777559889
        2022  2  9.748424419155176  5.425124159448911 32.467840274754536 34.023139685192085 1.0121008362618305
        2023  2  8.706860063900347  4.771663842221802  32.00446487555084  33.38401923574991  .7235251074953089
        2024  2  8.470093095659127  4.038823216793128 30.439935705145338  31.91363511954331                  .
        2018  3             19.525               9.95  32.62587421704779  36.50773498906684  .4806960307931351
        2019  3 16.732333084043468  8.486406062689364 36.519613444321934  40.13612004842655  .7836311622760433
        2020  3  17.33530510286297   9.53548513717227  39.98327067005958  43.75588863295183 1.1467203630463205
        2021  3  14.86652508913857   8.72206073749032  41.54229317597175 44.540586617802276  .8095905314965542
        2022  3 13.025327426107197  7.572034102818827 38.859036335091545   41.2954895820473  .8824793726417115
        2023  3 10.901821015618438  6.411935501982576 36.320405127467865 38.150683396789056 1.0653874159786303
        2024  3  9.453383184500643  5.107710168557733  34.32088114246889  35.96806943408799                  .
        2018  4 15.624999999999998  7.574999999999999   25.5584885946506 27.996957728718012 .48914706216077314
        2019  4  6.103323503305055  2.820793601511717 29.309978291176243  30.33462456995147  .8226533335650124
        2020  4 13.845712680557586  7.272316063577074 29.602060620201076 31.860637542960557 1.1860533229535113
        2021  4 10.981501806323811  6.054023107035276  30.76654127135598 32.469574683979396  .8614781319867313
        2022  4  9.965853056743356  5.512639219405295 27.460159257237333 28.818376837250685  .9272286770499407
        2023  4  9.523446014091498  5.163216688633633 26.986186913184827  28.28670023268507  .9595429885795778
        2024  4  10.16489365797814  4.958525541005978 25.641048980966513  27.12706870461413                  .
        2018  5              7.525                3.6  27.85262324011244 29.034797300317262  .5513727154745637
        2019  5   9.27417248733516  4.107987592393874  29.80467038175889  31.50183250357534   .841617497375321
        2020  5 10.770267374889949  5.901530412045825 31.059182700522953  32.75389797536409 1.2137075521057565
        2021  5  9.833514783744137  5.230685934875252  33.24829609868632 34.945558846499154  .8612021965910904
        2022  5 7.6804407594422175  4.377384809000454  33.42635338639271   34.6222983883267  .9209674940522476
        2023  5  6.452390849083533 3.4001033404796894  32.46960006427522  33.52902376932861  .9344399861571077
        2024  5  5.472160522648359 2.4234543885742403 30.492457768997195  31.47589866384184                  .
        2018  6                 14              8.075  35.81512042015576  38.28261563514906 .48882160806925656
        2019  6 13.979615610669871  8.353215825113447  39.17925352307709  41.74188033747022  .7955533408894084
        2020  6 16.482947217667768 10.035485112995369  39.84376179482921 42.919674266851935  1.191071646303972
        2021  6 12.000707426927493  7.715183466652195  41.39910614910515   43.4152230535524  .8447515378475231
        2022  6   11.3074287647264  7.109505253296397 39.440409320414524  41.30717019204819  .9187472218222895
        2023  6 10.079168955886923   6.09419113350804   38.2748958525581  39.97110583366338  .9845643774622804
        2024  6 11.905536243177275  5.682197871566705  36.28526228143214  38.84859549727798                  .
        2018  7               15.4  9.149999999999999 39.059729811913954  41.94534814908255 .40170933558548927
        2019  7  5.731793014411348  3.583087040110987 41.851720604150906 42.805669395291446  .6926781030142755
        2020  7 16.196496586941862 10.601693232320926   42.4385110202734  45.27174727116201 1.0499162483552231
        2021  7  15.69096015318504 10.407712670312467  43.46470981261579 46.188436937589856  .7090543389948297
        2022  7 11.035023123035371 7.0255287929532395  41.21768246750106 43.075290595485185  .7883080724465613
        2023  7  7.413147720663639 4.7819804752729915  39.55979104990613  40.68401575225343 1.3024064841216072
        2024  7  7.979406440895296  3.880797939287378  36.35874146552447   37.9781642611691                  .
        2018  8             16.325              8.525 31.701885171453437  34.65706538462746  .4718150428682777
        2019  8  8.880533277374058  4.984529851973081  36.00476114430872  37.54422003047602   .782199287678461
        2020  8 14.689139692335464   8.57769505114198  36.54836957116239  39.16659820646748 1.1395496426608027
        2021  8 11.670833102952749  6.796071665082726 38.153209306230615  40.25883081258736  .7928810530675923
        2022  8 12.020921337463786  7.206251224814134  34.86741415354112  36.77553935083211  .8649305066444393
        2023  8  9.746570131689271 5.6168055739602565 31.875475723761355  33.33401541689895  1.096868091335806
        2024  8  7.311143154160881  3.435266416134927 29.059318720981913  30.27446303591142                  .
        2018  9               19.8             11.475  38.14744728508535   42.1072664702267 .36279232524139254
        2019  9 11.221446306755654 6.8409323059523555  37.01599128481703  38.84243541272009  .6213864187201215
        2020  9 14.062455446313017  9.221750564946015 29.108777643952116  30.74842190844095  .9716202453987731
        2021  9 11.402491651959423  7.599050651100434 32.530027886959004 33.926523613885195  .6221325532258855
        2022  9 11.260713621473545  7.429537539415032  32.58917665015485  33.99616198000607   .713867059014291
        2023  9 11.178906858870109  7.306630012287824  39.34563966576502  41.06096655607506  1.490878723937556
        2024  9 11.491316367861266  6.491295998264713  37.55082726335536 39.672143427002396                  .
        2018 10              18.45             12.075  40.26090916253474  43.40822119087513  .5034389131178759
        2019 10 20.750355507362947 13.731256604355586   41.8408836423139 45.546708474145284  .8195796168797317
        2020 10 15.047428433773351  10.42496627584661  41.60413852844226  43.86791409656552 1.2109163771769067
        2021 10 13.875063492902845  9.916045435633611   42.6709196149774 44.632429813153955  .8611524701988581
        2022 10 12.630198166054672  9.435613146023082  40.18387188565162  41.65315294705029  .9319953835291722
        2023 10 11.943106658334802  8.244642313306006 38.339864436712354  39.95017132162859   .942167355424693
        2024 10  8.805849565591332  5.104004519041044   36.2907493678204   37.7639001142544                  .
        2018 11              9.725              5.675  28.70745301122956 29.995353146322106  .3772994502370715
        2019 11 11.899825073190566  6.769774363324777  32.82033642141434  34.73145623810695  .6360540073729831
        2020 11 14.137163764483983    9.0248329977811  33.30670226819585  35.28980562476397  .9789095857358479
        2021 11 10.040631577645028  7.238704846020116 34.648925857816934  35.72812142450103  .6202492048979428
        2022 11 7.8529749517732155  5.813101617561455 32.445610982118964  33.16386462752958  .7083312727861943
        2023 11 6.8100833364683755  4.714870928536264 32.236969761459164 32.961761685913935  1.485751681630545
        2024 11  5.282375623833831 3.0903460561874656 30.044266498845996 30.739574483368138                  .
        2018 12                 14 7.9750000000000005  44.48625416168506 47.602878363128696  .5674514874845675
        2019 12  18.05720975077399 10.868087888463553  48.59402194475974  52.85734205477962  .9057615040531973
        2020 12  18.51236639340078 11.587765562267762 50.231965523702456  54.50054340257592  1.292301898747369
        2021 12 10.491399539248423  6.565649585548437  52.29146780444608  54.58491476100109  .9547937535895988
        2022 12 12.268441934351898  7.830361696603655  50.18800682562933  52.72686976361716 1.0382352774505745
        2023 12 12.653397019524489  8.023062746289707  48.40285851923488  50.96874439317373  .7108679896970983
        2024 12 10.728089904639276  5.268364250412569 46.919170199986496  49.78867077348445                  .
        2018 13               27.6             14.025 39.526141209919444  46.93729268677933  .2999620250230554
        2019 13 31.164065943449213 15.817669717415992  43.63214397835865  53.35956583818449  .5446477707974833
        2020 13 17.887702095495243   9.57165572949236 44.430548942443934 48.930319555391954  .8723276459268762
        2021 13  15.94120761556903   9.03609909054121  45.63227509411646 49.380792088362014  .5209015406016587
        2022 13  12.05484421636055  6.763935727259403  43.28041723077883  45.88423008320619  .6034972634601058
        2023 13  9.994355747150882  5.807391728719686  42.48867025798112  44.46519667516215 1.7342978107324887
        2024 13  6.726459462550847  3.590451914985578   41.6702756076013  43.07129778454005                  .
        end
        
        drop if year==2024
        
        foreach var in URs UR EmpRatio LFratio MR {
            bysort provinceid : egen mean_`var' = mean(`var')
            by provinceid : egen sd_`var'  = sd(`var')
            gen st_`var' = (`var' - mean_`var') / sd_`var'
        }
        
        list 
        
        keep if year==2023
        
        keep provinceid st_* 
        
        reshape long st_, i(provinceid) j(varname) string 
        
        label def which 1 URs 2 UR 3 EmpRatio 4 LFratio 5 MR
        
        encode varname, gen(which) label(which)
        
        label var st_ "z-score for 2023"
        
        fabplot line st_ which, by(provinceid, compact edgelabel) xla(1/5, glw(thick) glp(solid) labsize(medsmall) tlc(none) valuelabel) xtitle("") frontopts(lw(thick)) xsc(r(0.9 5.1))
        Click image for larger version

Name:	canada.png
Views:	1
Size:	383.0 KB
ID:	1777541

        Comment


        • #5
          Thank you very much, Nick! You're right—I didn’t first check for the right number of clusters to consider, as the initial clustering was done by someone else and I assumed it was correct. What I wanted to do was group the provinces into clusters so that I could compute certain parameters for the three province clusters, rather than for each province separately. The data are actually for Saudi Arabia.
          Last edited by Ema Davies; 17 May 2025, 12:46.

          Comment


          • #6
            Wild guesses look smart if they pay off, and not otherwise. My only extra thought is standard: every clustering method needs to be balanced with some check on variability within clusters and on how far clusters really are coherent and isolated.

            Comment


            • #7
              True. Thank you again! I haven't worked much with clustering before. Do you think it's fine to perform clustering without standardizing the variables, since all the data are ratios, and the clusters I obtain make more sense when I don't use the standardized data?

              Comment


              • #8
                Being ratios is neither here nor there as it doesn't guarantee identical level or spread. Before standardization your data look like this:

                Code:
                . moments    URs-MR
                
                                
                n = 78    mean    SD    skewness    kurtosis
                                
                URs    12.707    4.393    1.460    6.813
                UR    7.400    2.581    0.746    3.680
                EmpRatio    36.573    6.127    0.429    2.559
                LFratio    38.884    6.846    0.485    2.446
                MR    0.856    0.277    0.460    3.437
                where moments is from SSC and just a wrapper for summarize. If you don't standardize, results will just depend on whichever variables differ most in spread.

                Comment


                • #9
                  Thank you very much!

                  Comment

                  Working...
                  X