Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with vector correlation to create relatedness measure

    Hello all,

    I am currently trying to create a technological relatedness measure using International Patent Classification Codes (IPC). I have already created vectors for each firm in a given year which represent the number of IPC classes the firm has filed patents in.

    The technological relatedness measure which I am trying to recreate is:
    Click image for larger version

Name:	Screenshot 2022-12-17 172958.png
Views:	2
Size:	5.7 KB
ID:	1693948


    where f represents the vectors of the firms i and j (apex indicates the transposed vector). TechRel should represent the uncentered correlation between the two vectors and assumes value =1 if the patent activities coincide and 0 if the vectors are orthogonal.

    My data currently looks like this:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double fyear long(gvkey gvkey2) float(vec1sum vec2sum vec1sum2 vec2sum2) str12 d1
    2012  10016  10000 0 0 0 0 "1000010016"
    2012  10115  10000 0 0 0 0 "1000010115"
    2012  10000  10453 0 0 0 0 "1000010453"
    2012  10000  10519 0 0 0 0 "1000010519"
    2012  10860  10000 0 0 0 0 "1000010860"
    2012  10983  10000 0 0 0 0 "1000010983"
    2012 110566  10000 0 0 0 0 "10000110566"
    end
    Since I have patents in 184 IPC classes in my dataset I only included the first two variables of my vectors in this example.

    I don't really know how to compute this measure. In the end I want to have the variable technological relatedness for each firm-pair.

    Any help would be greatly appreciated!
    Last edited by Laura Irrgang; 17 Dec 2022, 09:58.

  • #2
    The obstacle here is that you have your data in wide layout, which makes this computation very difficult, and, I would think, with 184 IPC classes, not at all feasible. In this regard, your situation is like so many in Stata: going to long layout makes simple that which is very difficult in wide layout. The way in which the variables are named is also a barrier to -reshape-ing the data to long layout.

    There is an unclarity in the way you have presented your data that cautions me against giving you specific advice on how to do this. Your vec#sum# variables contain two numbers. It is unclear whether, for example, vec1sum2 refers to IPC class 1 and the firm identified in gvkey2, or to IPC class 2 and the firm identified in gvkey. Nevertheless, since your first two variables lack final numbers, and you also coded the two firms as gvkey (no number) and gvkey2, I'm going to guess that vec1sum2 refers to gvkey2 and IPC class 1. In other words, I'm assuming that in vec#sum#, the first # refers to the IPC class and ranges from 1 to 184, and the second # refers to which firm and is always 1 (or blank in your original variable names) or 2.

    It is also unclear from your description whether you want to compute this measure separately for each fyear, or once across all fyears for each firm. I will here guess it's the former.

    Finally, I assume that foreach combination of gvkey and gvkey2 in your data there is only a single observation for any given fyear, or perhaps none at all. This is not guesswork on my part: it is a necessary condition for the calculations to make sense, and if it is not true in your data, it suggests that one of us misunderstands what is needed here.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double fyear long(gvkey gvkey2) float(vec1sum vec2sum vec1sum2 vec2sum2) str12 d1
    2012  10016  10000 0 0 0 0 "1000010016"
    2012  10115  10000 0 0 0 0 "1000010115"
    2012  10000  10453 0 0 0 0 "1000010453"
    2012  10000  10519 0 0 0 0 "1000010519"
    2012  10860  10000 0 0 0 0 "1000010860"
    2012  10983  10000 0 0 0 0 "1000010983"
    2012 110566  10000 0 0 0 0 "10000110566"
    end
    
    //  RENAME VARIABLE TO SIMPLIFY RESHAPING BY PUTTING THE IPC # AT THE END
    rename gvkey gvkey1
    rename (vec1sum vec2sum) =1
    rename vec#sum# vecsum_#[2]_#[1]
    
    reshape long vecsum_1_ vecsum_2_, i(gvkey1 gvkey2 fyear) j(ipc)
    rename *_ *
    by fyear gvkey1 gvkey2, sort: egen cross_product = total(vecsum_1*vecsum_2)
    forvalues i = 1/2 {
        by fyear gvkey1 gvkey2: egen self_product_`i' = total(vecsum_`i'*vecsum_`i')
    }
    gen tech_rel = cross_product/sqrt(self_product_1 * self_product_2)
    Note: If my assumptions about the numbers in the variable names are wrong, the code must be thoroughly changed, as the existing version will scramble the data. If my assumption about doing this separately by fyear is wrong, there you can fix it simply by removing the references to fyear that occur after -by- throughout. (Do not remove the reference to fyear in the -reshape- command, however.)

    Added: In the example data, only missing values are calculated for tech_rel because all of the values of the vec#sum* variables are zero, so the denominator is always zero. Presumably this is not the case in the real data set.
    Last edited by Clyde Schechter; 17 Dec 2022, 11:54.

    Comment


    • #3
      Thank you so much!

      Since all of your assumptions were correct, the code worked perfectly.

      I just had to adjust the second rename code into rename (vec#sum) =1 and drop all fyear gvkey1 gvkey2 duplicates in the end to have a single observation for each firm-pair for a given fyear.

      Comment


      • #4
        Hi Laura, I am trying to use your approach to create a cultural similarity measure. Can you please help with how you created vectors for each firm in a given year? My data is a M&A data with bidders and targets. for each firm per year, I have cultural variables ( 124 of them) and I intend to put them into vectors and use the same formula you use to find the cultural relatedness.

        Comment

        Working...
        X