Computing a relatedness/similarity measure with patents

Doris Rivera

Join Date: Feb 2020
Posts: 172

Computing a relatedness/similarity measure with patents

30 Jun 2022, 08:34

Dear Statalist, I am working with patent data (EPO) and I would like to address a measure of the relatedness in green technologies with the other technologies in a given firm (person_id). Basically, each firm could appear in the sample in several years (or even several times in a given year if it patents more than once in the same year). The patent data is merged with the technological classification IPC and with other technological classification CPC (reason why you may find that a patent done by a single firm in a single IPC technology may have more than one row in the sample, check appln_id==143931 below).

I would need your help to count in how many patents (appln_id) a green technology appears with any other technology (co-occurrence). In my opinion this could be done with a matrix in which rows could be those technologies (ipc_cpc) with a green classification (green==1), while the columns would be all the technologies. The idea would be to put in the cell the number of times two technologies (row and column) appear in the dataset.

Another possible way (usually done in the literature) is to build two vectors (columns) in which the first one would be the green technologies and the second one would be all the technologies, something like the next table (assuming as green tech 3,5,6): in which you put the number of co-occurrence in another column for each combination.

Green tech	All tech	Co-occurrence
3	1
3	2
3	4
3	5
…	…
5	1
5	2

Is it possible to build something like this for each period (app_year)?

A further complication comes when merging back the obtained measure of similarity with the original dataset in the example below. Any idea?

Please, if there is anyone who have dealt with something similar to this, or have any experience in working with patent data and have a better way of building this relatedness/similarity measure for firms, it will be highly appreciated some hints. Maybe Julio Raffo might want to share some ideas

In the dataex: person_id is firm id; app_year is year; appln_id is patent id; ipc2 cpc ipc_cpc are the technological classifications; green divides patents between green or non-green.

Many thanks in advance!

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long person_id int app_year long appln_id str13 ipc2 str15(cpc_class ipc_cpc) float green
 1 2000 212411 "H04L012/28"  "H04W  92/02"    "H04L012/28"  0
 1 2000 212411 "H04L012/56"  "H04W  84/12"    "H04L012/56"  0
 1 2000 212411 "H04L029/06"  "H04W  36/03"    "H04L029/06"  0
 1 2000 212411 "H04W036/08"  "H04W  88/08"    "H04W036/08"  0
 1 2000 212411 "H04W056/00"  "H04W  56/00"    "H04W056/00"  0
 1 2000 212411 "H04W084/12"  "H04W  92/02"    "H04W084/12"  0
 1 2000 212411 "H04W088/08"  "H04W  84/18"    "H04W088/08"  0
 1 2000 212411 "H04W092/02"  "H04L  29/06"    "H04W092/02"  0
 1 2000 215797 "G09F009/00"  "H05B  45/385"   "G09F009/00"  0
 1 2000 215797 "H01L033/00"  "Y02B  20/30"    "Y02B020/30"  1
 1 2000 215797 "H05B033/08"  "Y02B  20/30"    "Y02B020/30"  1
 1 2000 215797 "H05B037/02"  "Y02B  20/30"    "Y02B020/30"  1
 1 2000 216053 "H04L012/66"  "H04M2207/20"    "H04L012/66"  0
 1 2000 216053 "H04M003/00"  "H04M   7/0009"  "H04M003/00"  0
 1 2000 216053 "H04M007/00"  "H04M   7/12"    "H04M007/00"  0
 1 2000 216053 "H04M007/06"  "H04M2207/20"    "H04M007/06"  0
 1 2000 216053 "H04M007/12"  "H04L  12/66"    "H04M007/12"  0
 1 2000 216053 "H04M011/00"  "H04M2207/18"    "H04M011/00"  0
 1 2002  31572 "G06F001/32"  "G06F   1/3228"  "G06F001/32"  0
 1 2002  31572 "G06F001/32"  "G06F   1/3203"  "G06F001/32"  0
 1 2002  31572 "G06F001/32"  "G06F   1/3287"  "G06F001/32"  0
 1 2002  31572 "G06F001/32"  "G06F   1/3296"  "G06F001/32"  0
 1 2002  31572 "G06F001/32"  "Y02D  10/00"    "Y02D010/00"  1
 1 2002  31572 "G06F001/32"  "G06F   1/324"   "G06F001/32"  0
 1 2003   3347 "H04B007/005" "H04W  52/16"    "H04B007/005" 0
 1 2003   3347 "H04B007/26"  "Y02D  30/70"    "Y02D030/70"  1
 1 2003   3347 "H04J001/00"  "Y02D  30/70"    "Y02D030/70"  1
 1 2003   3347 "H04J003/16"  "H04W  52/26"    "H04J003/16"  0
 1 2003   3347 "H04L012/56"  "H04W  28/22"    "H04L012/56"  0
 1 2003   3347 "H04W028/22"  "H04W  52/267"   "H04W028/22"  0
 1 2003   3347 "H04W036/16"  "H04W  52/12"    "H04W036/16"  0
 1 2003   3347 "H04W052/00"  "H04W  52/0229"  "H04W052/00"  0
 1 2003   3347 "H04W052/02"  "H04W  52/0216"  "H04W052/02"  0
 1 2003   3347 "H04W052/16"  "H04B   7/2628"  "H04W052/16"  0
 1 2003   3347 "H04W052/26"  "H04W  36/16"    "H04W052/26"  0
20 2003 143931 "H04B001/16"  "G06F   1/3271"  "H04B001/16"  0
20 2003 143931 "H04B001/16"  "Y02D  30/70"    "Y02D030/70"  1
20 2003 143931 "H04B001/16"  "H04W  52/0287"  "H04B001/16"  0
20 2003 143931 "H04B001/16"  "G06F   1/3215"  "H04B001/16"  0
20 2003 143931 "H04B001/16"  "Y02D  10/00"    "Y02D010/00"  1
20 2003 143931 "H04B001/16"  "G06F   1/3259"  "H04B001/16"  0
20 2003 220700 "H05B041/292" "Y02B  20/00"    "Y02B020/00"  1
20 2003 220700 "H05B041/292" "Y10S 315/07"    "H05B041/292" 0
20 2003 220700 "H05B041/292" "Y10S 315/05"    "H05B041/292" 0
21 2003 143931 "G06F001/32"  "G06F   1/3209"  "G06F001/32"  0
21 2003 220700 "H05B041/24"  "H05B  41/2928"  "H05B041/24"  0
21 2007 136847 "A61L009/20"  "H01J  61/70"    "A61L009/20"  0
21 2007 136847 "A61L009/20"  "H01J  61/125"   "A61L009/20"  0
21 2007 136847 "A61L009/20"  "H01J  65/042"   "A61L009/20"  0
21 2007 136847 "A61N005/06"  "C02F   1/32"    "A61N005/06"  0
21 2007 136847 "C02F001/32"  "A61L   2/10"    "C02F001/32"  1
21 2007 136847 "H01J061/12"  "A61L   9/20"    "H01J061/12"  0
48 2001     10 "B01J023/46"  "C07F  15/004"   "B01J023/46"  1
48 2001     10 "B01J023/46"  "C07F  15/0053"  "B01J023/46"  1
48 2001     10 "B01J023/46"  "C07F  15/008"   "B01J023/46"  1
48 2001     10 "B01J031/24"  "B01J2531/80"    "B01J031/24"  0
48 2001     10 "C07B053/00"  "C07F  15/0013"  "C07B053/00"  0
48 2001     10 "C07B057/00"  "B01J2531/821"   "C07B057/00"  0
48 2001     10 "C07B061/00"  "B01J2531/822"   "C07B061/00"  0
48 2001     10 "C07C043/23"  "C07C 303/28"    "C07C043/23"  0
48 2001     10 "C07C067/303" "C07F   9/5072"  "C07C067/303" 0
48 2001     10 "C07C069/34"  "B01J  31/2409"  "C07C069/34"  0
48 2001     10 "C07C303/28"  "C07F   9/572"   "C07C 303/28" 0
48 2001     10 "C07C309/63"  "C07F   9/65515" "C07C 309/63" 0
48 2001     10 "C07C309/65"  "C07F   9/5027"  "C07C 309/65" 0
48 2001     10 "C07D321/10"  "B01J2531/0266"  "C07D 321/10" 0
48 2001     10 "C07F009/50"  "C07D 321/10"    "C07F009/50"  0
48 2001     10 "C07F009/572" "B01J2531/827"   "C07F009/572" 0
48 2001     10 "C07F009/655" "B01J2231/645"   "C07F009/655" 0
48 2001     10 "C07F015/00"  "B01J2231/641"   "C07F015/00"  0
end

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#2

30 Jun 2022, 08:50

I didn't understand your explanation. But if all you need is an indicator for whether or not a patent application (appln_id) contains both green and non-green technologies, you can do that with

Code:

assert !missing(green) by appln_id (green), sort: gen byte has_both = (green[1] != green[_N])
Comment
Doris Rivera

Join Date: Feb 2020

Posts: 172
#3

30 Jun 2022, 10:09

Dear Clyde Schechter, thanks very much for your help. Unfortunately this is not what I need.
Briefly, I need to know if a firm's technology is close (similar) to a Green technology. For that, I need to count the number of times two given technologies (variable ipc_cpc) appear in each of the different patents (appln_id). This assume that if two technologies appear too much in many patents, it is said that those two technologies are related, since firms will use them in combinations many times.
Thus, I want to perform a measure at the firm level (for each period) of how similar/related is a given technology of a firm to a green technology. For that I need to first count how many times in combinations of two, two technologies (one of them Green) appear in the data for each year.

I think that building a matrix with rows as non-green technologies and columns as green technologies would help me to know the specific combinations. But do not know how to count them as it is right now in the dataset. However, I have read that other possible way in the literature is to build two variables: one for the green tech and other for the non-green tech as I show in the table in #1, and then add a third variable which is the number of time such a combination appear in a given year.

For instance, in appln_id=212411 (first one), non of the technologies are green. In the second patent, technology "Y02B020/30" (Green) appears with "G09F009/00" which would imply a count of 1 for such a combination. If this combination appear in another patent, it would count as 2... And this should be done for all the patent looking for combinations with green patents, and counting how many times they appear in that year.

I am very sorry for the mess in my explanation, I really hope now is easier to understand.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#4

30 Jun 2022, 12:20

Well, I think I understand it a little bit better. Basically, for each pair of technologies that appear anywhere in your data set, you want to identify the frequency with which the co-occur in the same application.

My problem now, is that I am completely confused by how we identify a distinct technology. You have three variables, ipc2, cpc_class, and ipc_cpc. But I cannot discern what the relationships among them are. Does the combination of ipc2 and cpc_class identify a single distinct technology? (In the example data, when two observations agree on ipc2 and cpc_class they also agree on ipc_cpc. But I don't want to presume this is always the case. Also, in #3 you refer to technology "Y02B020/30." This value is found only in variable ipc_cpc. But when I examine the observations that have ipc_cpc == "Y02B020/30,", they have different values of ipc2. So is it only ipc_cpc that defines a technology, and ipc2 and cpc_class are irrelvant? Or is there something more complicated going on? Please clarify for me how the variables are used to identify a single technology.
Comment
Doris Rivera

Join Date: Feb 2020

Posts: 172
#5

30 Jun 2022, 14:16

Dear Clyde Schechter, you have touched a good point here. Basically, to identify green tech you need both, IPC and CPC codes. So, there may be patents that are not listed as green using their IPC codes, but instead they can be classified as green using CPC (CPC codes are more dissaggregated classification, but unfortunately, the OECD environmental classification is done with the two codes).

So, what I did was to merge IPC with CPC using the appln_id. However, as there is not other variable that helps to fine tune the identification of cases, I was forced to used m:m merging. So, I am sure that each time I redo the merge, the IPC and CPC in a given patent (ex. appln_id = 215797) will change possition within the patent (something I will have to solve).

I have built the ipc_cpc variable as the codes in the ipc but only changing those ipc codes in the lines in which cpc are green. I understand this is not the best way, since for instance, it could be that those substituted ipc codes might also serve as co-occurrences with green tech. Maybe the best way could be to add a different observation for those CPC codes in each patent (maybe appending the cpc observations to a given patent instead of merge).

In any case, lets assume for now that my ipc_cpc is the good one for identifying the technology of patents.

Well, I think I understand it a little bit better. Basically, for each pair of technologies that appear anywhere in your data set, you want to identify the frequency with which the co-occur in the same application

It is pretty much as you say, but notice that a given technology should not be repeated in a given patent (here it happens because my bad merging process). So, teoretically, two technologies can only be within a patent once, or cero otherwise. So, the idea is to observe in how many patents they appear 1 and sum all those 1s.

I hope this will help you to understand what am I dealing with. I am very grateful for your help with this complicated issue!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#6

30 Jun 2022, 14:41

I was forced to used m:m merging

OK, this means that what you have is data salad. -merge m:m- produces results that are meaningless and useless in almost all circumstances. I have been using Stata pretty much every day since 1994 and only once have I encountered a situation where -merge m:m- would produce something usable. And even then, there was a better way to do it. So basically, you should write on the blackboard 500 times "I will not use merge m:m, ever."

More to the point, it means that there is no point trying to go further with this data set. It is already corrupted and nothing good will come of working with it.

Let's get the data fixed up. Why don't you post examples from the data sets that you merged to create what you showed in your example, and explain what the variables are and how they are supposed to match up, and we can figure out the right way to combine them to get correct results. After we've done that, we can figure out how to get your similarity variable.
Comment

Doris Rivera

Join Date: Feb 2020
Posts: 172

01 Jul 2022, 03:26

Dear Clyde Schechter, I show you three dataex. The first one is the patent data, in which you usually have one line per patent (unless you have more than one firm working in the patent). The second dataex is the IPC codes, in which you have several observations per patent each listing one of the technologies the patent refers (the same code cannot be repeated in the same patent, and only appears once). The third dataex is the CPC one, in which it is basically the same, but with possibly more disaggregation of technologies which may result in even more rows per patent.

As you see, the only common variable is appln-id. However, some of the IPC codes are the same in the CPC ones (see appln_id=10 IPC “C07F009/572” and CPC “C07F009/572”) while some disaggregates more the IPC one (see appln_id=10 IPC “B01J031/24” and CPC “B01J031/2409”).

Since from the CPC codes I am only interested in codes “Y02” which is the code classifying a patent as green in the CPC data. Ideally, I think the best would be to append the observations CPC (Y02) codes only to the IPC data. And then build a tech variable with the different codes from IPC and CPC (Y02) within each patent. In this case, for each patent I will have all the IPC codes plus only the CPC (Y02) as other observations within the patent. And finally, merge this with the EPO patent data (using appln_id).

Because the structure of the IPC and CPC codes is X##X###/ (being X a letter and # a number) I use the following commands for eliminating spaces and for identify all CPC “Y02”:

Code:

gen ipc2 = ipc
replace ipc2 = subinstr(ipc2, " ", "", .)
gen new_cpc3 = substr(cpc_class,1,3)

I really hope this clarify how to build the technology variable for doing the relatedness measure.

EPO Patent data

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(appln_id person_id) str7 reg_code
    10 48 "DEA1C"
  3347  1 "FI1B1"
 31572  1 "FI1B1"
136847 21 "NL414"
143931 20 "DE600"
143931 21 "NL414"
212411  1 "FI1B1"
215797  1 "FI1B1"
216053  1 "FI1B1"
220700 20 "DE600"
220700 21 "NL414"
end

IPC Classifications

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long appln_id int(prio_year app_year) str13 ipc
    10 2000 2001 "B01J023/46" 
    10 2000 2001 "B01J031/24" 
    10 2000 2001 "C07B053/00" 
    10 2000 2001 "C07B057/00" 
    10 2000 2001 "C07B061/00" 
    10 2000 2001 "C07C 303/28"
    10 2000 2001 "C07C 309/63"
    10 2000 2001 "C07C 309/65"
    10 2000 2001 "C07C043/23" 
    10 2000 2001 "C07C067/303"
    10 2000 2001 "C07C069/34" 
    10 2000 2001 "C07D 321/10"
    10 2000 2001 "C07F009/50" 
    10 2000 2001 "C07F009/572"
    10 2000 2001 "C07F009/655"
    10 2000 2001 "C07F015/00" 
  3347 2002 2003 "H04B007/005"
  3347 2002 2003 "H04B007/26" 
  3347 2002 2003 "H04J001/00" 
  3347 2002 2003 "H04J003/16" 
  3347 2002 2003 "H04L012/56" 
  3347 2002 2003 "H04W028/22" 
  3347 2002 2003 "H04W036/16" 
  3347 2002 2003 "H04W052/00" 
  3347 2002 2003 "H04W052/02" 
  3347 2002 2003 "H04W052/16" 
  3347 2002 2003 "H04W052/26" 
 31572 2001 2002 "G06F001/32" 
136847 2006 2007 "A61L009/20" 
136847 2006 2007 "A61N005/06" 
136847 2006 2007 "C02F001/32" 
136847 2006 2007 "H01J061/12" 
143931 2002 2003 "G06F001/32" 
143931 2002 2003 "H04B001/16" 
212411 1999 2000 "H04L012/28" 
212411 1999 2000 "H04L012/56" 
212411 1999 2000 "H04L029/06" 
212411 1999 2000 "H04W036/08" 
212411 1999 2000 "H04W056/00" 
212411 1999 2000 "H04W084/12" 
212411 1999 2000 "H04W088/08" 
212411 1999 2000 "H04W092/02" 
215797 1999 2000 "G09F009/00" 
215797 1999 2000 "H01L033/00" 
215797 1999 2000 "H05B033/08" 
215797 1999 2000 "H05B037/02" 
216053 1999 2000 "H04L012/66" 
216053 1999 2000 "H04M003/00" 
216053 1999 2000 "H04M007/00" 
216053 1999 2000 "H04M007/06" 
216053 1999 2000 "H04M007/12" 
216053 1999 2000 "H04M011/00" 
220700 2002 2003 "H05B041/24" 
220700 2002 2003 "H05B041/292"
end

CPC Classifications

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long appln_id str15 cpc_class
    10 "B01J  31/2409" 
    10 "B01J2231/641"  
    10 "B01J2231/645"  
    10 "B01J2531/0266" 
    10 "B01J2531/80"   
    10 "B01J2531/821"  
    10 "B01J2531/822"  
    10 "B01J2531/827"  
    10 "C07C 303/28"   
    10 "C07D 321/10"   
    10 "C07F   9/5027" 
    10 "C07F   9/5072" 
    10 "C07F   9/572"  
    10 "C07F   9/65515"
    10 "C07F  15/0013" 
    10 "C07F  15/004"  
    10 "C07F  15/0053" 
    10 "C07F  15/008"  
  3347 "H04B   7/2628" 
  3347 "H04W  28/22"   
  3347 "H04W  36/16"   
  3347 "H04W  52/0216" 
  3347 "H04W  52/0229" 
  3347 "H04W  52/12"   
  3347 "H04W  52/16"   
  3347 "H04W  52/26"   
  3347 "H04W  52/267"  
  3347 "Y02D  30/70"   
 31572 "G06F   1/3203" 
 31572 "G06F   1/3228" 
 31572 "G06F   1/324"  
 31572 "G06F   1/3287" 
 31572 "G06F   1/3296" 
 31572 "Y02D  10/00"   
136847 "A61L   2/10"   
136847 "A61L   9/20"   
136847 "C02F   1/32"   
136847 "H01J  61/125"  
136847 "H01J  61/70"   
136847 "H01J  65/042"  
143931 "G06F   1/3209" 
143931 "G06F   1/3215" 
143931 "G06F   1/3259" 
143931 "G06F   1/3271" 
143931 "H04W  52/0287" 
143931 "Y02D  10/00"   
143931 "Y02D  30/70"   
212411 "H04L  29/06"   
212411 "H04W  36/03"   
212411 "H04W  56/00"   
212411 "H04W  84/12"   
212411 "H04W  84/18"   
212411 "H04W  88/08"   
212411 "H04W  92/02"   
215797 "H05B  45/385"  
215797 "Y02B  20/30"   
216053 "H04L  12/66"   
216053 "H04M   7/0009" 
216053 "H04M   7/12"   
216053 "H04M2207/18"   
216053 "H04M2207/20"   
220700 "H05B  41/2928" 
220700 "Y02B  20/00"   
220700 "Y10S 315/05"   
220700 "Y10S 315/07"   
end

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30355

01 Jul 2022, 11:42

So, I think you have it basically right: the first step is to append the IPC and CPC code data, then weed out duplicate codes, and then finally merge that with the EPO data.

Now, the structure of the EPO data is a problem: the person_id and reg_code are not really that material to what you are trying to do here. But because there can be multiple person_ids and reg_codes for the same application, you end up looking at an impossible merge. It could be done with -joinby-, but then each person_id will end up with a full copy of all of the techs that belong to the appln_id, which is going to make the data set large, and will get in the way of analyses for which the application is the unit of analysis. So here, I think the better solution is to -reshape- the EPO data to wide layout so there is just one observation per appln_id. Then you have a legitimate 1:m -merge-. Putting this all together, I come out with:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(appln_id person_id) str7 reg_code
    10 48 "DEA1C"
  3347  1 "FI1B1"
 31572  1 "FI1B1"
136847 21 "NL414"
143931 20 "DE600"
143931 21 "NL414"
212411  1 "FI1B1"
215797  1 "FI1B1"
216053  1 "FI1B1"
220700 20 "DE600"
220700 21 "NL414"
end
tempfile epo_data
save `epo_data'

* Example generated by -dataex-. To install: ssc install dataex
clear
input long appln_id int(prio_year app_year) str13 ipc
    10 2000 2001 "B01J023/46"
    10 2000 2001 "B01J031/24"
    10 2000 2001 "C07B053/00"
    10 2000 2001 "C07B057/00"
    10 2000 2001 "C07B061/00"
    10 2000 2001 "C07C 303/28"
    10 2000 2001 "C07C 309/63"
    10 2000 2001 "C07C 309/65"
    10 2000 2001 "C07C043/23"
    10 2000 2001 "C07C067/303"
    10 2000 2001 "C07C069/34"
    10 2000 2001 "C07D 321/10"
    10 2000 2001 "C07F009/50"
    10 2000 2001 "C07F009/572"
    10 2000 2001 "C07F009/655"
    10 2000 2001 "C07F015/00"
  3347 2002 2003 "H04B007/005"
  3347 2002 2003 "H04B007/26"
  3347 2002 2003 "H04J001/00"
  3347 2002 2003 "H04J003/16"
  3347 2002 2003 "H04L012/56"
  3347 2002 2003 "H04W028/22"
  3347 2002 2003 "H04W036/16"
  3347 2002 2003 "H04W052/00"
  3347 2002 2003 "H04W052/02"
  3347 2002 2003 "H04W052/16"
  3347 2002 2003 "H04W052/26"
 31572 2001 2002 "G06F001/32"
136847 2006 2007 "A61L009/20"
136847 2006 2007 "A61N005/06"
136847 2006 2007 "C02F001/32"
136847 2006 2007 "H01J061/12"
143931 2002 2003 "G06F001/32"
143931 2002 2003 "H04B001/16"
212411 1999 2000 "H04L012/28"
212411 1999 2000 "H04L012/56"
212411 1999 2000 "H04L029/06"
212411 1999 2000 "H04W036/08"
212411 1999 2000 "H04W056/00"
212411 1999 2000 "H04W084/12"
212411 1999 2000 "H04W088/08"
212411 1999 2000 "H04W092/02"
215797 1999 2000 "G09F009/00"
215797 1999 2000 "H01L033/00"
215797 1999 2000 "H05B033/08"
215797 1999 2000 "H05B037/02"
216053 1999 2000 "H04L012/66"
216053 1999 2000 "H04M003/00"
216053 1999 2000 "H04M007/00"
216053 1999 2000 "H04M007/06"
216053 1999 2000 "H04M007/12"
216053 1999 2000 "H04M011/00"
220700 2002 2003 "H05B041/24"
220700 2002 2003 "H05B041/292"
end
tempfile ipc
save `ipc'

* Example generated by -dataex-. To install: ssc install dataex
clear
input long appln_id str15 cpc_class
    10 "B01J  31/2409"
    10 "B01J2231/641"  
    10 "B01J2231/645"  
    10 "B01J2531/0266"
    10 "B01J2531/80"  
    10 "B01J2531/821"  
    10 "B01J2531/822"  
    10 "B01J2531/827"  
    10 "C07C 303/28"  
    10 "C07D 321/10"  
    10 "C07F   9/5027"
    10 "C07F   9/5072"
    10 "C07F   9/572"  
    10 "C07F   9/65515"
    10 "C07F  15/0013"
    10 "C07F  15/004"  
    10 "C07F  15/0053"
    10 "C07F  15/008"  
  3347 "H04B   7/2628"
  3347 "H04W  28/22"  
  3347 "H04W  36/16"  
  3347 "H04W  52/0216"
  3347 "H04W  52/0229"
  3347 "H04W  52/12"  
  3347 "H04W  52/16"  
  3347 "H04W  52/26"  
  3347 "H04W  52/267"  
  3347 "Y02D  30/70"  
 31572 "G06F   1/3203"
 31572 "G06F   1/3228"
 31572 "G06F   1/324"  
 31572 "G06F   1/3287"
 31572 "G06F   1/3296"
 31572 "Y02D  10/00"  
136847 "A61L   2/10"  
136847 "A61L   9/20"  
136847 "C02F   1/32"  
136847 "H01J  61/125"  
136847 "H01J  61/70"  
136847 "H01J  65/042"  
143931 "G06F   1/3209"
143931 "G06F   1/3215"
143931 "G06F   1/3259"
143931 "G06F   1/3271"
143931 "H04W  52/0287"
143931 "Y02D  10/00"  
143931 "Y02D  30/70"  
212411 "H04L  29/06"  
212411 "H04W  36/03"  
212411 "H04W  56/00"  
212411 "H04W  84/12"  
212411 "H04W  84/18"  
212411 "H04W  88/08"  
212411 "H04W  92/02"  
215797 "H05B  45/385"  
215797 "Y02B  20/30"  
216053 "H04L  12/66"  
216053 "H04M   7/0009"
216053 "H04M   7/12"  
216053 "H04M2207/18"  
216053 "H04M2207/20"  
220700 "H05B  41/2928"
220700 "Y02B  20/00"  
220700 "Y10S 315/05"  
220700 "Y10S 315/07"  
end
tempfile cpc
save `cpc'

use `cpc', clear
keep if substr(cpc_class, 1, 3) == "Y02"
rename cpc_class ipc
gen code_type = "cpc"
append using `ipc'
replace ipc = subinstr(ipc, " ", "", .)
replace code_type = "ipc" if missing(code_type)
gsort appln_id  ipc prio_year app_year -code_type
by appln_id ipc: assert prio_year == prio_year[1] | missing(prio_year)
by appln_id ipc: assert app_year == app_year[1] | missing(app_year)
by appln_id ipc: keep if _n == 1
tempfile combined_ipc_cpc_codes
save `combined_ipc_cpc_codes'

use `epo_data', clear
by appln_id (person_id), sort: gen _j = _n
reshape wide person_id reg_code, i(appln_id) j(_j)
merge 1:m appln_id using `combined_ipc_cpc_codes', keep(master match) nogenerate

There are a few loose ends here. The ipc code file has these prio_year and app_year variables that are absent in the cpc_code files. I have no idea if these variables are of any importance to you. Life is simpler if they are not, and in that case, I recommend you just drop them from the ipc_code data before the -append- takes place. If you need them, then the above code keeps them, after first verifying that they are constant for all observations of the same appln_id, but it means that for technologies that were found only in the cpc code file, those variables are missing. Next, you refer to different levels of disaggregation in the ipc and cpc codes, but I wasn't sure how to generalize your example. So I have done nothing with that, which may mean that there are some apparently different codes in the resulting file that are really the same thing, just at different levels of "aggregation." As I am unclear exactly what this "aggregation" is about, I have just left it at that and I hope that you can clean that up.

Try this and see if it looks right to you.

Once you have a final data set, post back with a -dataex- from that and we can get down to resolving the problem that started this thread off.

Comment

Doris Rivera

Join Date: Feb 2020
Posts: 172

01 Jul 2022, 11:45

Dear Clyde Schechter, here you will find the dataset in which I append CPC and IPC, and merge with EPO. Now I have all the technologies in the variable ipc_cpc (only listing as green those CPC "Y02"). I have tried hard to come out with a solution to the relatedness variable, but I do not have the Stata knowledge to continue by my own.

Basically my idea is first to expand the dataset by appln_id (number of obs in a patent -1). The hardest part is to create a variable with the first obs as the next technology classification (ipc_cpc) within a patent. Unfortunately I do know how to continue to fill the second obsetvation of variable (x1). The idea is to have two variables ipc_cpc and x1 similar in fashion to the table I show in #1.

Can you help me solve how to continue from here?
Here is my rudimentary code and the data:

Code:

sort person_id app_year appln_id ipc_cpc
bys person_id app_year appln_id (ipc_cpc): gen pat_obs1 = _n
bys person_id app_year appln_id (ipc_cpc): gen pat_obs2 = _N
sum pat_obs2
return list 
local c = r(max)

forvalues x = 1/`c' {
          expand `x'-1 if pat_obs2==`x'
          }

gen x1 = ""    
forvalues x = 1/`c' {
bys person_id app_year appln_id (ipc_cpc): replace x1 = ipc_cpc[`x'] if pat_obs2==`x' & x1[1]==""
}

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long person_id int app_year long appln_id str11(ipc2 ipc_cpc) float green
 1 2000 212411 "H04L012/28"  "H04L012/28"  0
 1 2000 212411 "H04L012/56"  "H04L012/56"  0
 1 2000 212411 "H04L029/06"  "H04L029/06"  0
 1 2000 212411 "H04W036/08"  "H04W036/08"  0
 1 2000 212411 "H04W056/00"  "H04W056/00"  0
 1 2000 212411 "H04W084/12"  "H04W084/12"  0
 1 2000 212411 "H04W088/08"  "H04W088/08"  0
 1 2000 212411 "H04W092/02"  "H04W092/02"  0
 1 2000 215797 "G09F009/00"  "G09F009/00"  0
 1 2000 215797 "H01L033/00"  "H01L033/00"  0
 1 2000 215797 "H05B033/08"  "H05B033/08"  0
 1 2000 215797 "H05B037/02"  "H05B037/02"  0
 1 2000 215797 ""            "Y02B020/30"  1
 1 2000 216053 "H04L012/66"  "H04L012/66"  0
 1 2000 216053 "H04M003/00"  "H04M003/00"  0
 1 2000 216053 "H04M007/00"  "H04M007/00"  0
 1 2000 216053 "H04M007/06"  "H04M007/06"  0
 1 2000 216053 "H04M007/12"  "H04M007/12"  0
 1 2000 216053 "H04M011/00"  "H04M011/00"  0
 1 2002  31572 "G06F001/32"  "G06F001/32"  0
 1 2002  31572 ""            "Y02D010/00"  1
 1 2003   3347 "H04B007/005" "H04B007/005" 0
 1 2003   3347 "H04B007/26"  "H04B007/26"  0
 1 2003   3347 "H04J001/00"  "H04J001/00"  0
 1 2003   3347 "H04J003/16"  "H04J003/16"  0
 1 2003   3347 "H04L012/56"  "H04L012/56"  0
 1 2003   3347 "H04W028/22"  "H04W028/22"  0
 1 2003   3347 "H04W036/16"  "H04W036/16"  0
 1 2003   3347 "H04W052/00"  "H04W052/00"  0
 1 2003   3347 "H04W052/02"  "H04W052/02"  0
 1 2003   3347 "H04W052/16"  "H04W052/16"  0
 1 2003   3347 "H04W052/26"  "H04W052/26"  0
 1 2003   3347 ""            "Y02D030/70"  1
20 2003 143931 "G06F001/32"  "G06F001/32"  0
20 2003 220700 "H05B041/24"  "H05B041/24"  0
21 2003 143931 "H04B001/16"  "H04B001/16"  0
21 2003 143931 ""            "Y02D010/00"  1
21 2003 143931 ""            "Y02D030/70"  1
21 2003 220700 "H05B041/292" "H05B041/292" 0
21 2003 220700 ""            "Y02B020/00"  1
21 2007 136847 "A61L009/20"  "A61L009/20"  0
21 2007 136847 "A61N005/06"  "A61N005/06"  0
21 2007 136847 "C02F001/32"  "C02F001/32"  0
21 2007 136847 "H01J061/12"  "H01J061/12"  0
48 2001     10 "B01J023/46"  "B01J023/46"  0
48 2001     10 "B01J031/24"  "B01J031/24"  0
48 2001     10 "C07B053/00"  "C07B053/00"  0
48 2001     10 "C07B057/00"  "C07B057/00"  0
48 2001     10 "C07B061/00"  "C07B061/00"  0
48 2001     10 "C07C043/23"  "C07C043/23"  0
48 2001     10 "C07C067/303" "C07C067/303" 0
48 2001     10 "C07C069/34"  "C07C069/34"  0
48 2001     10 "C07C303/28"  "C07C303/28"  0
48 2001     10 "C07C309/63"  "C07C309/63"  0
48 2001     10 "C07C309/65"  "C07C309/65"  0
48 2001     10 "C07D321/10"  "C07D321/10"  0
48 2001     10 "C07F009/50"  "C07F009/50"  0
48 2001     10 "C07F009/572" "C07F009/572" 0
48 2001     10 "C07F009/655" "C07F009/655" 0
48 2001     10 "C07F015/00"  "C07F015/00"  0
end

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30355

#10

01 Jul 2022, 12:21

Well, I don't think I would start that way, so I'll show you what I think is a better approach. The gist of it is to form a new data set consisting of all distinct pairs of patents. Then we can count for each patent, the number of distinct applications it occurs in, and for each pair, the number of applications that they both appear in. As I am still a bit uncertain exactly what you want, this may be off base, but I don't think it is:

Code:

isid appln_id ipc_cpc, sort
keep ipc_cpc green appln_id
preserve
rename (ipc_cpc appln_id green) =2
tempfile copy
save `copy'
restore
rename (ipc_cpc appln_id green) =1

cross using `copy' // FORM ALL POSSIBLE PAIRS
drop if ipc_cpc1 >= ipc_cpc2 // ELIMINATE SELF PAIRS AND PAIRS IN REVERSE ORDER
egen ipc_cpc_pair = concat(ipc_cpc1 ipc_cpc2), punct(",")  // CREATE A PAIR ID
duplicates drop

by ipc_cpc1, sort: egen patents_with_ipc_cpc1 = nvals(appln_id1)

gen byte same_appln = (appln_id1 == appln_id2)
isid ipc_cpc_pair appln_id1 appln_id2, sort
by ipc_cpc_pair: egen co_occurrences_this_pair = total(same_appln)
order ipc_cpc_pair patents_with_ipc_cpc1 co_occurrences_this_pair, first

Note: The -egen, nvals()- function is not part of official Stata; rather it is found in Nick Cox's -egenmore- package, available from SSC.

Comment

Doris Rivera

Join Date: Feb 2020

Posts: 172
#11

02 Jul 2022, 12:33

Dear Clyde Schechter, first thing first, thank you very much for your help because this is pretty much what I was looking for. However, let me comment you a couple of things for clarification:

First, my real data is really long (+16 million obs) and when trying the code it gives error

Code:

cross using `copy' // FORM ALL POSSIBLE PAIRS sum of expand values exceed 1,099,511,627,775 The dataset may not contain more than 1,099,511,627,775 observations.

I am thinking if it would be possible to do this analysis but for each year app_year (in fact what a I need) and store the results and then try to combine the collected similarity measures by years and merge that to the original dataset. If not, I have seen that the command "cross" does every combination of all the ipc_cpc codes. Is it possible to build using this command (cross) only for those combinations that appears in the patent? In fact, this is what is needed to build the similarity measure. This way it will need to build a lower number of combinations. Do you think is possible any of these solution? Or do you know a better solution for this.

Second, I do not fully understand why in the following tab the appln_id=10 have only 120 “1s” when it should be 240 (16*15 combinations of 1x1 from that patent). Also, look that appln_id=31572 have 5 “1s” but it should have only one combination of ipc_cpc (G06F001/32 and Y02D010/00). Also, why is it that “co_occurrences_this_pair” have also “2s”? Should not these duplications be eliminated from the line (duplicates drop)? Maybe I am looking at the wrong tab, but it seems strange to me. Any idea of what is happening?

Code:

tab appln_id1 co_occurrences_this_pair | co_occurrences_this_pair appln_id1 | 0 1 2 | Total -----------+---------------------------------+---------- 10 | 658 120 0 | 778 3347 | 171 87 0 | 258 31572 | 34 5 2 | 41 136847 | 203 6 0 | 209 143931 | 65 9 2 | 76 212411 | 112 37 0 | 149 215797 | 85 10 0 | 95 216053 | 127 15 0 | 142 220700 | 15 3 0 | 18 -----------+---------------------------------+---------- Total | 1,470 292 4 | 1,766
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30355

#12

02 Jul 2022, 14:14

Is it possible to build using this command (cross) only for those combinations that appears in the patent?

I don't think so. I don't even understand what "the patent" means in this context. The whole idea is to form all pairs of technologies, and then identify how many patents the pair co-occurs in. I don't know how many distinct technologies are in your data set, but if we call that number T, in the end we need a data set containing all distinct pairs of technologies (order irrelevant), so the resulting data set must contain T*(T-1)/2 observations. Moreover, this is a best-case scenario, that applies only if each pair occurs only in a single year. If you do it separately by year, then multiply that by the average number of years in which a given pair of technologies will both exist in some patent in a year. Now, T is less than the size of the data set, to be clear. But before proceeding, you should find out what T is, and do that calculation to see if this is feasible at all.

Assuming that it is feasible, here's a different way to get a data set of co-occurrences that will be memory-sparing. (The price one pays for this is that it will be very slow. I have tried to, within the constraint of minimizing memory use, make it as fast as I can, but you're going to set your expectations for a very long run-time on this.)

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear*
input long person_id int app_year long appln_id str11(ipc2 ipc_cpc) float green
 1 2000 212411 "H04L012/28"  "H04L012/28"  0
 1 2000 212411 "H04L012/56"  "H04L012/56"  0
 1 2000 212411 "H04L029/06"  "H04L029/06"  0
 1 2000 212411 "H04W036/08"  "H04W036/08"  0
 1 2000 212411 "H04W056/00"  "H04W056/00"  0
 1 2000 212411 "H04W084/12"  "H04W084/12"  0
 1 2000 212411 "H04W088/08"  "H04W088/08"  0
 1 2000 212411 "H04W092/02"  "H04W092/02"  0
 1 2000 215797 "G09F009/00"  "G09F009/00"  0
 1 2000 215797 "H01L033/00"  "H01L033/00"  0
 1 2000 215797 "H05B033/08"  "H05B033/08"  0
 1 2000 215797 "H05B037/02"  "H05B037/02"  0
 1 2000 215797 ""            "Y02B020/30"  1
 1 2000 216053 "H04L012/66"  "H04L012/66"  0
 1 2000 216053 "H04M003/00"  "H04M003/00"  0
 1 2000 216053 "H04M007/00"  "H04M007/00"  0
 1 2000 216053 "H04M007/06"  "H04M007/06"  0
 1 2000 216053 "H04M007/12"  "H04M007/12"  0
 1 2000 216053 "H04M011/00"  "H04M011/00"  0
 1 2002  31572 "G06F001/32"  "G06F001/32"  0
 1 2002  31572 ""            "Y02D010/00"  1
 1 2003   3347 "H04B007/005" "H04B007/005" 0
 1 2003   3347 "H04B007/26"  "H04B007/26"  0
 1 2003   3347 "H04J001/00"  "H04J001/00"  0
 1 2003   3347 "H04J003/16"  "H04J003/16"  0
 1 2003   3347 "H04L012/56"  "H04L012/56"  0
 1 2003   3347 "H04W028/22"  "H04W028/22"  0
 1 2003   3347 "H04W036/16"  "H04W036/16"  0
 1 2003   3347 "H04W052/00"  "H04W052/00"  0
 1 2003   3347 "H04W052/02"  "H04W052/02"  0
 1 2003   3347 "H04W052/16"  "H04W052/16"  0
 1 2003   3347 "H04W052/26"  "H04W052/26"  0
 1 2003   3347 ""            "Y02D030/70"  1
20 2003 143931 "G06F001/32"  "G06F001/32"  0
20 2003 220700 "H05B041/24"  "H05B041/24"  0
21 2003 143931 "H04B001/16"  "H04B001/16"  0
21 2003 143931 ""            "Y02D010/00"  1
21 2003 143931 ""            "Y02D030/70"  1
21 2003 220700 "H05B041/292" "H05B041/292" 0
21 2003 220700 ""            "Y02B020/00"  1
21 2007 136847 "A61L009/20"  "A61L009/20"  0
21 2007 136847 "A61N005/06"  "A61N005/06"  0
21 2007 136847 "C02F001/32"  "C02F001/32"  0
21 2007 136847 "H01J061/12"  "H01J061/12"  0
48 2001     10 "B01J023/46"  "B01J023/46"  0
48 2001     10 "B01J031/24"  "B01J031/24"  0
48 2001     10 "C07B053/00"  "C07B053/00"  0
48 2001     10 "C07B057/00"  "C07B057/00"  0
48 2001     10 "C07B061/00"  "C07B061/00"  0
48 2001     10 "C07C043/23"  "C07C043/23"  0
48 2001     10 "C07C067/303" "C07C067/303" 0
48 2001     10 "C07C069/34"  "C07C069/34"  0
48 2001     10 "C07C303/28"  "C07C303/28"  0
48 2001     10 "C07C309/63"  "C07C309/63"  0
48 2001     10 "C07C309/65"  "C07C309/65"  0
48 2001     10 "C07D321/10"  "C07D321/10"  0
48 2001     10 "C07F009/50"  "C07F009/50"  0
48 2001     10 "C07F009/572" "C07F009/572" 0
48 2001     10 "C07F009/655" "C07F009/655" 0
48 2001     10 "C07F015/00"  "C07F015/00"  0
end

isid ipc_cpc appln_id
tempfile original
save `original', replace

keep ipc_cpc app_year
duplicates drop
tempfile copy
save `copy'
rename ipc_cpc ipc_cpc_1
joinby app_year using `copy'
rename ipc_cpc ipc_cpc_2
drop if ipc_cpc_2 <= ipc_cpc_1
gen original = `"`original'"'

capture program drop one_pair_year
program define one_pair_year
    local code1 = ipc_cpc_1[1]
    local code2 = ipc_cpc_2[1]
    local app_year = app_year[1]
    local original = original[1]
    use ipc_cpc appln_id app_year using `original' if app_year == `app_year' ///
        & inlist(ipc_cpc, `"`code1'"', `"`code2'"'), clear
    by appln_id, sort: egen has_code1 = max(ipc_cpc == `"`code1'"')
    by appln_id: egen has_code2 = max(ipc_cpc == `"`code2'"')
    keep if has_code1 & has_code2
    by appln_id: keep if _n == 1
    count
    local co_occurrences = r(N)
    clear
    set obs 1
    gen ipc_cpc_1 = `"`code1'"'
    gen ipc_cpc_2 = `"`code2'"'
    gen app_year = `app_year'
    gen co_ocurrences = `co_occurrences'
    exit
end

runby one_pair_year, by(ipc_cpc_1 ipc_cpc_2 app_year) status

-runby- is written by Robert Picard and me, and is available from SSC.

The slow points in this will be the -use- and -save- commands, -joinby-, and -runby-. -runby- will give you a periodic progress report showing how much it has finished and an estimate of the remaining time.

Note that if this -joinby- command gives you an error message similar to what you got from -cross-, I would conclude that this problem simply cannot be solved in Stata with a data set this large, because at no point will -joinby- require more observations than the minimum number needed as described in my first paragraph in this post.

At the end of this code the data in active memory will consist of pairs of ipc/cpc codes and years in which either code in the pair appears in such patents, followed by the number of patents from that app_year in which both members of the pair are found.

Second, I do not fully understand why in the following tab...

I don't think that table makes sense. There is nothing meaningful about the number of co-occurrences per patent. We are calculating co-occurrences of pairs of technology. With the revised code, you won't even be able to run a table per patent, because the final data set doesn't even have any patents in it. It's just a count of co-occurrences for each technology pair in each year.

Comment

Doris Rivera

Join Date: Feb 2020

Posts: 172
#13

02 Jul 2022, 16:23

Dear Clyde Schechter, thanks again for all your assistance, help, and patience. And thanks for this second code. Unfortunately, as you commented in #12, joinby command gives error:

Code:

. joinby app_year using `copy' op. sys. refuses to provide memory

I have re-done this analysis only for a single country just to see if with a smaller sample it run, but the same error appears. In fact, after the -duplicates drop- in your last code (before joinby) I have a considerable smaller number of obs (787300), but still the error appears.
I cannot figure out how people working with patent data (PATSTAT or EPO) then can handle this in Stata. Maybe it is because they focus on the aggregation at regional level (even though relatedness measures are done by co-occurrences in technologies in their papers). Maybe Julio Raffo, that work with patent data would like to share his opinion.
Anyway, I thank you very much for your help.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#14

02 Jul 2022, 16:40

Well, actually, this error message isn't quite the same thing. It's not saying that the resulting data set would be too large for Stata to work with. It's saying that the operating system won't give you enough memory to create it. This might be solved by moving to a computer with more RAM.

Another thought would be to run just one year's worth of data, and if that works out, do that over again, one year at a time, and then append all the results together at the end.

It's also possible there is a more memory-conserving way to do this that I've overlooked. I'll be interested to see if Julio Raffo responds with something more workable.
Comment

Announcement

Computing a relatedness/similarity measure with patents

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment