Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Manual cosine similarity

    Hi all,

    I have a dataset looking as follows

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(abw afg usa deu ita fra zwe)
     .741454        . 1.461039 1.001267 .1943143  2.02331 .0398947
           .        . .3747925 .5385166 .0849144 5.877031 .0285709
           .        . .0738599 .7603363 .0226571  .857273 .0087179
           . 18.95614  .017786 .0143572 .0284465 .4198427        .
           .        . .8904798 2.828969 .0991883 2.881671 .1284243
           . 1.416539 .9290766 .4017515 .4203883 1.874652 2.869313
           .        . 1.856129 .6188649 .6568865  1.41993 .0031837
           .        . 1.663221 .1651224 .2606736 .1124315        .
           .        .  1.92557 1.944612 .2591313 1.008808 .0009152
           .        . .0404407 .0995575 .1108418 .3217571        .
           .        . .0013033 .0409434 .5997524 1.358656        .
           . .0251377  2.57753 1.003716 .3661837 .8305855        .
    .3067514        . 1.782345 .5529616 .5006197 1.255523        .
           .        . .1589082 .9553338 .3953655   2.2924 4.690866
           .        .   1.1629 2.232825 2.180375 1.266226        .
           .        . .7648983 1.305261 7.404485 .6355171 .0048988
           .        . .2617094 .0920584 .9729702 1.179764 .8724387
           .        .  .113905 .1272192 .2726196 .7323977 .5955424
    .3801196        . 1.088369 .1138973 .0531836 .4928856 .6728174
           .        . .8000037 .2941958 .1238046 .2989464 .0403317
    1.141611        .   .11981 .7181695 .1175315 .4400395 .0048387
           .        . .4342655  .050796 .0566788 .1774852        .
           . .0076935 .5061271 .0647355 .4225548 .9126765 3.744993
           .        . .1529715 1.965783  .374045  2.88408 .0625705
           . .0043245 .9494857 .8637257 .1228782 1.304866 .0030619
           . .1076971 .2602129 2.666663 .1702058 4.490593 .0576577
           . .0218489 1.902009 1.623241 1.287235 4.142213        .
           .        . .2119714 .9972145 .2254935 1.846789        .
    .0935104        . .6060638  1.79086 3.575171 3.780148        .
           .        . 1.340025 .9433815 .2623216 1.311821        .
           .        . 1.092371 .8651692  1.55267 2.848075        .
           . .0683902 .0974375 .8052937 .4458265 .4921557  .005009
    3.068242 .3677903 .6442152 .0396263 .0620669 .2324825        .
           . 186.5314 .1593217 .4151692 1.089158 .0069329        .
           .        . .1857729 .6487441 .2865114 .0144557        .
           . 15.13819 1.584602 1.245454 .2969113 .6216406 .1240486
           .        . .6258647 .7340963 .7491329 1.190264        .
           . 8.214746 2.394617 1.728331 .2686805 1.422262  .675965
           .        . .2575313 .1947751 .1036397 .0405289 5.377438
           .        . .3082071 .0903282 1.134846 .5631537        .
           .        . 1.143275 .2491348 .9700597 .7607582        .
           . .1424412 1.693527 1.138611 .7729867 1.026993 .0788777
           .        . .0491821 .6067317 .3270745 .1916944        .
           . .0203444 .2438468 1.235431 2.697737 .4870493 1.106419
           . .8592663 .0074969 .0943188 .6325406 .0974359 3.788121
           .        . 1.029636 .4133649 3.189991 .3606794 .2055814
           . 1.286439 .7130128 1.316607 .3106628 4.895243 .0115367
           . 48.26962 .1796718 .2006245 .6979135 .7854371 .0052506
    .4129845 24.06891 .4094014 .1742994  .393596 .5824391 .0159784
           . 1.090524 1.926085 .2968288 2.077862 .9794411 .0503139
           . .0535793 2.267175 .4127252 3.831798 1.365533 .0202268
           . .0535103 .8008014 .3808078 2.658148 .7953464 .0443448
           . 61.51415 .0673397 .3424704 .1450131 .0812548        .
           .  .064489 .5492141  .184209 .3024622 1.904064  60.4346
           . .4768863  .523383 .2307394 1.528315 .5063033 .1521745
           . .3387621 .5397737 .2221521 .4519938 1.480504 .0038099
           .        . .1402389  .186204 .6723604 .2317982 .6406377
           . 1.384142 .5452676 .4905596 .4065935 .6845423  .057511
    .2126988 111.5496 1.005027 .0708469 .1580074 .5228525 .3768404
           . 4.295271 1.180351 .0402458 .0468681  .199299        .
           . .0751144 .0321736 .2368587 .0631166 .0664572        .
           . 67.47327 5.913808 .3787304 .9611498 .3215212 3.535911
           . .0043353 .0566947 .1725989 .0990791 .2482376 .2203667
      .07305 87.26086 .1884062  .160036 .0923593  .254534 2.369925
           . .5291861 .8751645 .2205954 .6575344 .1599879 12.77804
           . 319.0004 1.366433 .1868918 2.440692 .0866971 .0331624
           . 6.019503 .5271338 .2114366 1.422726 .5053566  .001738
           . 33.54792 1.429191   .16761 3.203291 1.714285        .
           . 36.38383 1.372626   .19864 1.647523 .4732858 .4145338
           .  27.8777 .7125611 .1931473 1.229108 .2157007 1.392092
           . 1.406787 .5295416 .2899002 .5235879 .3659199 .0341965
           . 58.13401 .3824869 .2608263 2.900931 .2896835 .1112577
           . 213.9636 1.147745 1.176857 .9096384 1.349026 .0470453
           .        . .3383946 .4405811 1.379996 .4751141        .
     .020292        . .3086558 .9567099 1.956649 1.175295 .1286791
           . .1459899 .1647384 .3888537  .042949 .2182894 21.37134
           . 74.12194 .1096884  .135567 .0208399 .3694384        .
           . .0311671 .1732453 .3628366 .1231593 .3596264 5.364007
           .        . .1069594 .4019192  .019809 1.101421        .
    .8981674        . .1984629 .1791211 .0449014 .3983868        .
           .        . .0225512 .0816425 .0187901 .1500897        .
           .        . .0648765 .1670168 .1232585 .1377006        .
    .3300822 295.3576 .0975025 .2492127 .5656269 .1840935        .
    .3240971 137.0345 .1970647 .5997772 .2907727  .472592 .0394539
           . .0010924  1.55123 .3663657 .0501714 3.121303 .0059488
           .        . .1344173 2.002861 .1952207 .6481689 .9819955
           . .8984936   .04635 .6621236 .0078273 5.469687        .
           .        . .1480232 .2280057 .0156805 .9013392        .
           .        . 4.866586 .0902065 .0426395 1.569099 .1629231
           .  .008196 .8692521 .0397515 .8609402 .0738182        .
           .        . 9.264128 .0210019 .0379215 .8366598 1.701094
           . 1.411607  1.13722 .5042111 .4269869 1.636323 .0285019
           .        . .3401277 .8570619 .8802016 .6029181        .
           . .0970124 1.161609 .6876258  2.06582 1.009094 .0487779
           .        . .9980388 .5694365 3.174448 2.189889 .2605656
           .        . .7770497 1.135748 .5931596 .8208499        .
           .        . 1.696591 3.626454 .7012593 .9930473        .
           .        . 1.665486 .4902525 1.161186 .9620214   1.7741
           .        . .7692504 1.177317 .0191527 4.310259 .2765781
           .        . .2500556 1.093923 .3098362 1.733142        .
    end
    I would like to construct the cosine similarity between all possible unique pairs of countries. The cosine similarity formula for a couple is as follows:

    Code:
    egen abw_dot_afg = total(abw*afg)
    egen abw_dot_abw = total(abw*abw)
    egen afg_dot_afg = total(afg*afg)
    gen cosine_similarity_abw_afg = x_dot_y/sqrt(x_dot_x*y_dot_y)
    What I would like to do is to extend this formula (in a loop maybe?) so that it covers all the possible couples of countries (variables), i.e. abw-afg; abw-usa; abw-deu; abw-ita; abw-fra; abw-zwe; afg-usa;afg-deu;afg-ita ...

    Thank you

  • #2
    Wrapping the commands that calculate cosine similarity for a pair of variables to cover all of the possible vars is straightforward:
    Code:
    ds abw-zwe
    local vbles `r(varlist)'
    local n_vbles: word count `vbles'
    
    forvalues i = 1/`n_vbles' {
        forvalues j = `=`i'+1'/`n_vbles' {
            local v: word `i' of `vbles'
            local w: word `j' of `vbles'
           // CALCULATION OF COSINE SIMILARITY FOR `v' AND `w' GOES HERE
        }
    }
    That said, your code for cosine similarity is incorrect. It would be fine if there were no missing values in the data. But, for example, when you calculate the cosine similarity of two countries that have missing values in different observations, your code calculates x_dot_x and y_dot_y on different, larger subsets of the observations than are used to calculate x_dot_y. You need to add -if !missing(abw, afg)- [or, if you are working inside the loops that I show above, -if !missing(`v', `w')-] to each of those lines of code for the dot products.

    Comment


    • #3
      I think that what is wanted here can also be done with the built-in features documented at -help matrix dissimilarity- and then -help measure option-. The "angular" option appears to be what is requested:
      Code:
      matrix dissimilarity D = abw-zwe, angular
      If the results need to be in a set of variables, -svmat- would create that.
      As Clyde noted, there are a lot of missing value issues here, and that will perhaps affect the usefulness of this approach.

      Comment


      • #4
        Thank you all for the replies!

        Comment

        Working...
        X