Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating all pair of Id-s in a group and calculating cosine similarity - Ids appear more than once as data can't be reshaped wide.

    Hi,

    I am trying to calculate cosine similarity for a large, unpbalanced dataset.

    My Data has the stracture:
    year group Id component weight
    2020 23 a 0.2
    2020 23 b 0.8
    2020 24 a 0.3
    2020 24 b 0.3
    2020 24 c 0.4
    2019 23 b 1
    2019 25 c 1
    Now I need to create all pairs of group id-s in a year and calculate from the given list and weight of components the cosine siimilarity of each pair.
    I cheked How can I create all pairs within groups? | Stata FAQ (ucla.edu) and Stata | FAQ: Expanding datasets to all possible pairs,
    but they do not seem to work in my case as they would need the same group id to appear only once in a given year (group).
    I could potentially achieve that if I was to reshape it wide and convert component to variable,I tried , but given that the list of components is very large and it changes over years...it did not work.

    Any idea how could I go about creating all pairs of group_id-s in a yera and Calculate cosine similarity among each pair?

    Thank you for any ideas on how to deal with this!
    Last edited by Da GXHI; 26 Jun 2022, 06:50.

  • #2
    I don't know what you mean when you say "...they would need the same group id to appear only once...," or "convert component to variable." Note also that "did not work" is not a popular description on StataList or other similar settings, as it doesn't tell anything about *what* the problem is, and thus leaves us trying to read your mind.

    Leaving aside those uncertainties, here is some code that will generate observations containing each possible pairing of groupId that occurs within every year, which to my understanding is what you are saying you want. :

    Code:
    clear
    input year groupId str1 component weight
    2020     23     a     0.2
    2020     23     b     0.8
    2020     24     a     0.3
    2020     24     b     0.3
    2020     24     c     0.4
    2019     23     b     1
    2019     25     c     1
    end
    //
    preserve
      rename (groupId component weight) =2
      tempfile file2
      save `file2'
    restore  
    //
    joinby year using `file2'
    // Eliminate self pairs and re-ordered pairs, which I presume you don't want.
    drop if (groupId >= groupId2)
    //
    order year groupId groupId2 // Facilitate inspection of the list
    sort groupId
    list
    I don't know anything about cosine similarity, but searching on / stata "cosine similarity" / yields a reasonable number of hits.

    Comment


    • #3
      @ #2 Thank you!
      by 'didnt work' I meant that I couldnt reshape wide as the number of unique components is larger than the maximal number of observations that stata SE allows.
      I tried your code as well. but doesnt solve my issue as the number of pairs of group id-s in my data exceedes the maximal number of observation in Stata SE. Any idea on how I can split it and still get all possible pairs of groups within a year would be welcomed.
      I will try year-by year. For a two year period for instance I have this number of unique values


      Variable Obs Unique Mean Min Max Label
      -------------------------------------------------------------------------------------------------------------------------
      group_id 1748403 1403 2340720 1 9099600
      component 1748403 66511 . . .
      -------------------------------------------------------------------------------------------------------------------------

      Comment

      Working...
      X