Creating all pair of Id-s in a group and calculating cosine similarity - Ids appear more than once as data can't be reshaped wide.

Da GXHI

Join Date: Nov 2020

Posts: 17
#1

Creating all pair of Id-s in a group and calculating cosine similarity - Ids appear more than once as data can't be reshaped wide.

26 Jun 2022, 05:55

Hi,

I am trying to calculate cosine similarity for a large, unpbalanced dataset.

My Data has the stracture:
year group Id component weight

2020 23 a 0.2

2020 23 b 0.8

2020 24 a 0.3

2020 24 b 0.3

2020 24 c 0.4

2019 23 b 1

2019 25 c 1

Now I need to create all pairs of group id-s in a year and calculate from the given list and weight of components the cosine siimilarity of each pair.
I cheked How can I create all pairs within groups? | Stata FAQ (ucla.edu) and Stata | FAQ: Expanding datasets to all possible pairs,
but they do not seem to work in my case as they would need the same group id to appear only once in a given year (group).
I could potentially achieve that if I was to reshape it wide and convert component to variable,I tried , but given that the list of components is very large and it changes over years...it did not work.

Any idea how could I go about creating all pairs of group_id-s in a yera and Calculate cosine similarity among each pair?

Thank you for any ideas on how to deal with this!

Last edited by Da GXHI; 26 Jun 2022, 06:50.
Tags: by group, cosine similarity, pairs, panel data
Mike Lacy

Join Date: Apr 2014

Posts: 2411
#2

26 Jun 2022, 08:43

I don't know what you mean when you say "...they would need the same group id to appear only once...," or "convert component to variable." Note also that "did not work" is not a popular description on StataList or other similar settings, as it doesn't tell anything about *what* the problem is, and thus leaves us trying to read your mind.

Leaving aside those uncertainties, here is some code that will generate observations containing each possible pairing of groupId that occurs within every year, which to my understanding is what you are saying you want. :

Code:

clear input year groupId str1 component weight 2020 23 a 0.2 2020 23 b 0.8 2020 24 a 0.3 2020 24 b 0.3 2020 24 c 0.4 2019 23 b 1 2019 25 c 1 end // preserve rename (groupId component weight) =2 tempfile file2 save `file2' restore // joinby year using `file2' // Eliminate self pairs and re-ordered pairs, which I presume you don't want. drop if (groupId >= groupId2) // order year groupId groupId2 // Facilitate inspection of the list sort groupId list

I don't know anything about cosine similarity, but searching on / stata "cosine similarity" / yields a reasonable number of hits.
2 likes
Comment
Da GXHI

Join Date: Nov 2020

Posts: 17
#3

28 Jun 2022, 04:39

@ #2 Thank you!
by 'didnt work' I meant that I couldnt reshape wide as the number of unique components is larger than the maximal number of observations that stata SE allows.
I tried your code as well. but doesnt solve my issue as the number of pairs of group id-s in my data exceedes the maximal number of observation in Stata SE. Any idea on how I can split it and still get all possible pairs of groups within a year would be welcomed.
I will try year-by year. For a two year period for instance I have this number of unique values

Variable Obs Unique Mean Min Max Label
-------------------------------------------------------------------------------------------------------------------------
group_id 1748403 1403 2340720 1 9099600
component 1748403 66511 . . .
-------------------------------------------------------------------------------------------------------------------------
Comment

year	group Id	component	weight
2020	23	a	0.2
2020	23	b	0.8
2020	24	a	0.3
2020	24	b	0.3
2020	24	c	0.4
2019	23	b	1
2019	25	c	1

Announcement

Creating all pair of Id-s in a group and calculating cosine similarity - Ids appear more than once as data can't be reshaped wide.

Comment

Comment