Multiply list of variables pairwise based on name suffix

Da GXHI

Join Date: Nov 2020

Posts: 17
#1

Multiply list of variables pairwise based on name suffix

04 Jul 2022, 08:26

I have a dataset that looks like this:
year id weight1_a wight1_b weight1_c weight2_a weight2_b weight2_c

2020 1 0.1 0.2 0.7 0.4 0.3 0.3

2020 2

2020 3

2020 4

I want to create a new variable
x_dot_y=weight1_a*weight2_a+ weight1_b*weight2_b+weight1_c*weight2_c
This may sound straightforward, but I have a very large number of weight variables (40000) so I would appreciate if someone can share any idea on how to write this in a more compact way.
I tried foreach but I am not able to get the list of a, b , c correct or so as it yields an incorrect value.

Any help is highly appreciated.

Last edited by Da GXHI; 04 Jul 2022, 08:31. Reason: Pairwise multiplicatiom
Tags: None
Fei Wang

Join Date: Oct 2021

Posts: 726
#2

04 Jul 2022, 09:03

Could you let us know the general formula of x_dot_y? What terms are beyond the three terms in #1? If x_dot_y = weight1_a*weight2_a + weight1_b*weight2_b + weight1_c*weight2_c + ... + weight1_z*weight2_z, then there are 52 weights. What do the 40000 weight variables look like, and what should the full formula of x_dot_y be?
Comment
Da GXHI

Join Date: Nov 2020

Posts: 17
#3

04 Jul 2022, 09:13

I am trying to calculate cosine similarity. x_dot_y is part of the multiplication of weights of the two different vectors I am comparing, refer to #2 in How to calculate cosine similarity? - Statalist.
because my data contains more than 40000 compoonents for each variable, I had to split the dataset and now this represents only part of it. the full formula loiks more like:
x_dot_y = weight1_str1*weight2_str1 + weight1_str2*weight2_str2 + weight1_str3*weight2_str3 + ... + weight1_str40000*weight2_400000
where the str1-str40000 are alphanumeric values, that do not follow a special pattern
Comment

Fei Wang

Join Date: Oct 2021
Posts: 726

04 Jul 2022, 09:39

I showed an example below, with weight1_str1, weight2_str1, ..., weight1_str5, weight2_str5.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(id year weight1_str1 weight1_str2 weight1_str3 weight1_str4 weight1_str5 weight2_str1 weight2_str2 weight2_str3 weight2_str4 weight2_str5)
1 2020 .13590056 .13270818  .4087105 .6251604   .3005246 .06694305 .3764437 .56622654 .1736788  .9537622
2 2020  .3325624  .8745816 .29777426  .634899 .007538023 .07229638 .7677861  .1778325 .6107705 .08674778
3 2020  .4675523  .2468877  .6794177 .9944572   .6701369  .6817465 .7551366 .11399896 .5754215  .8949648
4 2020 .51608807   .043255  .7124024 .7497677   .4241406 .08804953 .4476188  .5955869 .3678161  .5890286
end

list, ab(12)

     +-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
     | id   year   weight1_str1   weight1_str2   weight1_str3   weight1_str4   weight1_str5   weight2_str1   weight2_str2   weight2_str3   weight2_str4   weight2_str5 |
     |-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
  1. |  1   2020       .1359006       .1327082       .4087105       .6251604       .3005246        .066943       .3764437       .5662265       .1736788       .9537622 |
  2. |  2   2020       .3325624       .8745816       .2977743        .634899        .007538       .0722964       .7677861       .1778325       .6107705       .0867478 |
  3. |  3   2020       .4675523       .2468877       .6794177       .9944572       .6701369       .6817465       .7551366        .113999       .5754215       .8949648 |
  4. |  4   2020       .5160881        .043255       .7124024       .7497677       .4241406       .0880495       .4476188       .5955869       .3678161       .5890286 |
     +-----------------------------------------------------------------------------------------------------------------------------------------------------------------+

reshape long weight1 weight2, i(id year) j(str) string
bys id year: egen x_dot_y = total(weight1*weight2)
reshape wide weight1 weight2, i(id year) j(str) string

list id year x_dot_y

     +----------------------+
     | id   year    x_dot_y |
     |----------------------|
  1. |  1   2020   .6856836 |
  2. |  2   2020    1.13692 |
  3. |  3   2020    1.75462 |
  4. |  4   2020   1.014708 |
     +----------------------+

Reshaping may take a while if you have many weight variables in a single dataset.

ADD: If you only care about the final x_dot_y, then you may not reshape back. The code below would save some time.

Code:

reshape long weight1 weight2, i(id year) j(str) string
bys id year: egen x_dot_y = total(weight1*weight2)
keep id year x_dot_y
duplicates drop

Last edited by Fei Wang; 04 Jul 2022, 10:04.

Comment

Da GXHI

Join Date: Nov 2020

Posts: 17
#5

04 Jul 2022, 09:59

Many thanks!Unofrtunately reshapimg long does not seem to work for me as the dataset is too large and I get the error "op. sys. refuses to provide memory", I would need to split the dataset further but that creates lots of work. I was hoping there is some way to identify variable pairs and multiply them in the wide format
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30165
#6

04 Jul 2022, 10:09

Code:

gen x_dot_y = 0 forvalues i = 1/40000 { replace x_dot_y = x_dot_y + weight1_str`i'*weight2_str`i' }

does it in wide layout.

That said, you might have better luck reshaping the data set by using the -greshape- command that is part of the -gtools- package by Mauricio Caceres Bravo, available from SSC. It is faster, and, I think, also uses less memory.
Comment
Da GXHI

Join Date: Nov 2020

Posts: 17
#7

04 Jul 2022, 10:29

#5 Thanks. the challange is that str1-str40000 are alphanumeric values, that do not follow a special pattern. Is there any way I could extract the list of all them from the variable name so as to use the foreach? I did try fastshape and greshape, but they do not seem to work either.
Comment

Fei Wang

Join Date: Oct 2021
Posts: 726

04 Jul 2022, 10:33

How about this?

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(id year weight1_str1 weight1_str2 weight1_str3 weight1_str4 weight1_str5 weight2_str1 weight2_str2 weight2_str3 weight2_str4 weight2_str5)
1 2020 .13590056 .13270818  .4087105 .6251604   .3005246 .06694305 .3764437 .56622654 .1736788  .9537622
2 2020  .3325624  .8745816 .29777426  .634899 .007538023 .07229638 .7677861  .1778325 .6107705 .08674778
3 2020  .4675523  .2468877  .6794177 .9944572   .6701369  .6817465 .7551366 .11399896 .5754215  .8949648
4 2020 .51608807   .043255  .7124024 .7497677   .4241406 .08804953 .4476188  .5955869 .3678161  .5890286
end

qui d weight1_*, varlist

local wlist = r(varlist)
local wlist: subinstr local wlist "weight1_" "", all

gen x_dot_y = 0
foreach l of local wlist {
    replace x_dot_y = x_dot_y + weight1_`l'*weight2_`l'
}

Comment

Da GXHI

Join Date: Nov 2020

Posts: 17
#9

05 Jul 2022, 04:47

#8 That works perfectly. Thank you, thank you, thank you!
Comment

year	id	weight1_a	wight1_b	weight1_c	weight2_a	weight2_b	weight2_c
2020	1	0.1	0.2	0.7	0.4	0.3	0.3
2020	2
2020	3
2020	4

Announcement

Multiply list of variables pairwise based on name suffix

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment