Problem with vector correlation to create relatedness measure

Laura Irrgang

Join Date: Dec 2022

Posts: 4
#1

Problem with vector correlation to create relatedness measure

17 Dec 2022, 09:56

Hello all,

I am currently trying to create a technological relatedness measure using International Patent Classification Codes (IPC). I have already created vectors for each firm in a given year which represent the number of IPC classes the firm has filed patents in.

The technological relatedness measure which I am trying to recreate is:

where f represents the vectors of the firms i and j (apex indicates the transposed vector). TechRel should represent the uncentered correlation between the two vectors and assumes value =1 if the patent activities coincide and 0 if the vectors are orthogonal.

My data currently looks like this:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input double fyear long(gvkey gvkey2) float(vec1sum vec2sum vec1sum2 vec2sum2) str12 d1 2012 10016 10000 0 0 0 0 "1000010016" 2012 10115 10000 0 0 0 0 "1000010115" 2012 10000 10453 0 0 0 0 "1000010453" 2012 10000 10519 0 0 0 0 "1000010519" 2012 10860 10000 0 0 0 0 "1000010860" 2012 10983 10000 0 0 0 0 "1000010983" 2012 110566 10000 0 0 0 0 "10000110566" end

Since I have patents in 184 IPC classes in my dataset I only included the first two variables of my vectors in this example.

I don't really know how to compute this measure. In the end I want to have the variable technological relatedness for each firm-pair.

Any help would be greatly appreciated!

Last edited by Laura Irrgang; 17 Dec 2022, 09:58.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30169
#2

17 Dec 2022, 11:49

The obstacle here is that you have your data in wide layout, which makes this computation very difficult, and, I would think, with 184 IPC classes, not at all feasible. In this regard, your situation is like so many in Stata: going to long layout makes simple that which is very difficult in wide layout. The way in which the variables are named is also a barrier to -reshape-ing the data to long layout.

There is an unclarity in the way you have presented your data that cautions me against giving you specific advice on how to do this. Your vec#sum# variables contain two numbers. It is unclear whether, for example, vec1sum2 refers to IPC class 1 and the firm identified in gvkey2, or to IPC class 2 and the firm identified in gvkey. Nevertheless, since your first two variables lack final numbers, and you also coded the two firms as gvkey (no number) and gvkey2, I'm going to guess that vec1sum2 refers to gvkey2 and IPC class 1. In other words, I'm assuming that in vec#sum#, the first # refers to the IPC class and ranges from 1 to 184, and the second # refers to which firm and is always 1 (or blank in your original variable names) or 2.

It is also unclear from your description whether you want to compute this measure separately for each fyear, or once across all fyears for each firm. I will here guess it's the former.

Finally, I assume that foreach combination of gvkey and gvkey2 in your data there is only a single observation for any given fyear, or perhaps none at all. This is not guesswork on my part: it is a necessary condition for the calculations to make sense, and if it is not true in your data, it suggests that one of us misunderstands what is needed here.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input double fyear long(gvkey gvkey2) float(vec1sum vec2sum vec1sum2 vec2sum2) str12 d1 2012 10016 10000 0 0 0 0 "1000010016" 2012 10115 10000 0 0 0 0 "1000010115" 2012 10000 10453 0 0 0 0 "1000010453" 2012 10000 10519 0 0 0 0 "1000010519" 2012 10860 10000 0 0 0 0 "1000010860" 2012 10983 10000 0 0 0 0 "1000010983" 2012 110566 10000 0 0 0 0 "10000110566" end // RENAME VARIABLE TO SIMPLIFY RESHAPING BY PUTTING THE IPC # AT THE END rename gvkey gvkey1 rename (vec1sum vec2sum) =1 rename vec#sum# vecsum_#[2]_#[1] reshape long vecsum_1_ vecsum_2_, i(gvkey1 gvkey2 fyear) j(ipc) rename *_ * by fyear gvkey1 gvkey2, sort: egen cross_product = total(vecsum_1*vecsum_2) forvalues i = 1/2 { by fyear gvkey1 gvkey2: egen self_product_`i' = total(vecsum_`i'*vecsum_`i') } gen tech_rel = cross_product/sqrt(self_product_1 * self_product_2)

Note: If my assumptions about the numbers in the variable names are wrong, the code must be thoroughly changed, as the existing version will scramble the data. If my assumption about doing this separately by fyear is wrong, there you can fix it simply by removing the references to fyear that occur after -by- throughout. (Do not remove the reference to fyear in the -reshape- command, however.)

Added: In the example data, only missing values are calculated for tech_rel because all of the values of the vec#sum* variables are zero, so the denominator is always zero. Presumably this is not the case in the real data set.

Last edited by Clyde Schechter; 17 Dec 2022, 11:54.
1 like
Comment
Laura Irrgang

Join Date: Dec 2022

Posts: 4
#3

18 Dec 2022, 11:02

Thank you so much!

Since all of your assumptions were correct, the code worked perfectly.

I just had to adjust the second rename code into rename (vec#sum) =1 and drop all fyear gvkey1 gvkey2 duplicates in the end to have a single observation for each firm-pair for a given fyear.
Comment
Jeffery Bondzie

Join Date: Mar 2022

Posts: 1
#4

19 May 2023, 11:15

Hi Laura, I am trying to use your approach to create a cultural similarity measure. Can you please help with how you created vectors for each firm in a given year? My data is a M&A data with bidders and targets. for each firm per year, I have cultural variables ( 124 of them) and I intend to put them into vectors and use the same formula you use to find the cultural relatedness.
Comment

Announcement

Problem with vector correlation to create relatedness measure

Comment

Comment

Comment