Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiply list of variables pairwise based on name suffix

    I have a dataset that looks like this:
    year id weight1_a wight1_b weight1_c weight2_a weight2_b weight2_c
    2020 1 0.1 0.2 0.7 0.4 0.3 0.3
    2020 2
    2020 3
    2020 4
    I want to create a new variable
    x_dot_y=weight1_a*weight2_a+ weight1_b*weight2_b+weight1_c*weight2_c
    This may sound straightforward, but I have a very large number of weight variables (40000) so I would appreciate if someone can share any idea on how to write this in a more compact way.
    I tried foreach but I am not able to get the list of a, b , c correct or so as it yields an incorrect value.

    Any help is highly appreciated.
    Last edited by Da GXHI; 04 Jul 2022, 08:31. Reason: Pairwise multiplicatiom

  • #2
    Could you let us know the general formula of x_dot_y? What terms are beyond the three terms in #1? If x_dot_y = weight1_a*weight2_a + weight1_b*weight2_b + weight1_c*weight2_c + ... + weight1_z*weight2_z, then there are 52 weights. What do the 40000 weight variables look like, and what should the full formula of x_dot_y be?

    Comment


    • #3
      I am trying to calculate cosine similarity. x_dot_y is part of the multiplication of weights of the two different vectors I am comparing, refer to #2 in How to calculate cosine similarity? - Statalist.
      because my data contains more than 40000 compoonents for each variable, I had to split the dataset and now this represents only part of it. the full formula loiks more like:
      x_dot_y = weight1_str1*weight2_str1 + weight1_str2*weight2_str2 + weight1_str3*weight2_str3 + ... + weight1_str40000*weight2_400000
      where the str1-str40000 are alphanumeric values, that do not follow a special pattern

      Comment


      • #4
        I showed an example below, with weight1_str1, weight2_str1, ..., weight1_str5, weight2_str5.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input float(id year weight1_str1 weight1_str2 weight1_str3 weight1_str4 weight1_str5 weight2_str1 weight2_str2 weight2_str3 weight2_str4 weight2_str5)
        1 2020 .13590056 .13270818  .4087105 .6251604   .3005246 .06694305 .3764437 .56622654 .1736788  .9537622
        2 2020  .3325624  .8745816 .29777426  .634899 .007538023 .07229638 .7677861  .1778325 .6107705 .08674778
        3 2020  .4675523  .2468877  .6794177 .9944572   .6701369  .6817465 .7551366 .11399896 .5754215  .8949648
        4 2020 .51608807   .043255  .7124024 .7497677   .4241406 .08804953 .4476188  .5955869 .3678161  .5890286
        end
        
        list, ab(12)
        
             +-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
             | id   year   weight1_str1   weight1_str2   weight1_str3   weight1_str4   weight1_str5   weight2_str1   weight2_str2   weight2_str3   weight2_str4   weight2_str5 |
             |-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
          1. |  1   2020       .1359006       .1327082       .4087105       .6251604       .3005246        .066943       .3764437       .5662265       .1736788       .9537622 |
          2. |  2   2020       .3325624       .8745816       .2977743        .634899        .007538       .0722964       .7677861       .1778325       .6107705       .0867478 |
          3. |  3   2020       .4675523       .2468877       .6794177       .9944572       .6701369       .6817465       .7551366        .113999       .5754215       .8949648 |
          4. |  4   2020       .5160881        .043255       .7124024       .7497677       .4241406       .0880495       .4476188       .5955869       .3678161       .5890286 |
             +-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
        
        reshape long weight1 weight2, i(id year) j(str) string
        bys id year: egen x_dot_y = total(weight1*weight2)
        reshape wide weight1 weight2, i(id year) j(str) string
        
        list id year x_dot_y
        
             +----------------------+
             | id   year    x_dot_y |
             |----------------------|
          1. |  1   2020   .6856836 |
          2. |  2   2020    1.13692 |
          3. |  3   2020    1.75462 |
          4. |  4   2020   1.014708 |
             +----------------------+
        Reshaping may take a while if you have many weight variables in a single dataset.

        ADD: If you only care about the final x_dot_y, then you may not reshape back. The code below would save some time.

        Code:
        reshape long weight1 weight2, i(id year) j(str) string
        bys id year: egen x_dot_y = total(weight1*weight2)
        keep id year x_dot_y
        duplicates drop
        Last edited by Fei Wang; 04 Jul 2022, 10:04.

        Comment


        • #5
          Many thanks!Unofrtunately reshapimg long does not seem to work for me as the dataset is too large and I get the error "op. sys. refuses to provide memory", I would need to split the dataset further but that creates lots of work. I was hoping there is some way to identify variable pairs and multiply them in the wide format

          Comment


          • #6
            Code:
            gen x_dot_y = 0
            forvalues i = 1/40000 {
                replace x_dot_y = x_dot_y + weight1_str`i'*weight2_str`i'
            }
            does it in wide layout.

            That said, you might have better luck reshaping the data set by using the -greshape- command that is part of the -gtools- package by Mauricio Caceres Bravo, available from SSC. It is faster, and, I think, also uses less memory.

            Comment


            • #7
              #5 Thanks. the challange is that str1-str40000 are alphanumeric values, that do not follow a special pattern. Is there any way I could extract the list of all them from the variable name so as to use the foreach? I did try fastshape and greshape, but they do not seem to work either.

              Comment


              • #8
                How about this?

                Code:
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input float(id year weight1_str1 weight1_str2 weight1_str3 weight1_str4 weight1_str5 weight2_str1 weight2_str2 weight2_str3 weight2_str4 weight2_str5)
                1 2020 .13590056 .13270818  .4087105 .6251604   .3005246 .06694305 .3764437 .56622654 .1736788  .9537622
                2 2020  .3325624  .8745816 .29777426  .634899 .007538023 .07229638 .7677861  .1778325 .6107705 .08674778
                3 2020  .4675523  .2468877  .6794177 .9944572   .6701369  .6817465 .7551366 .11399896 .5754215  .8949648
                4 2020 .51608807   .043255  .7124024 .7497677   .4241406 .08804953 .4476188  .5955869 .3678161  .5890286
                end
                
                qui d weight1_*, varlist
                
                local wlist = r(varlist)
                local wlist: subinstr local wlist "weight1_" "", all
                
                gen x_dot_y = 0
                foreach l of local wlist {
                    replace x_dot_y = x_dot_y + weight1_`l'*weight2_`l'
                }

                Comment


                • #9
                  #8 That works perfectly. Thank you, thank you, thank you!

                  Comment

                  Working...
                  X