Dear Statalist, I am working with patent data (EPO) and I would like to address a measure of the relatedness in green technologies with the other technologies in a given firm (person_id). Basically, each firm could appear in the sample in several years (or even several times in a given year if it patents more than once in the same year). The patent data is merged with the technological classification IPC and with other technological classification CPC (reason why you may find that a patent done by a single firm in a single IPC technology may have more than one row in the sample, check appln_id==143931 below).
I would need your help to count in how many patents (appln_id) a green technology appears with any other technology (co-occurrence). In my opinion this could be done with a matrix in which rows could be those technologies (ipc_cpc) with a green classification (green==1), while the columns would be all the technologies. The idea would be to put in the cell the number of times two technologies (row and column) appear in the dataset.
Another possible way (usually done in the literature) is to build two vectors (columns) in which the first one would be the green technologies and the second one would be all the technologies, something like the next table (assuming as green tech 3,5,6): in which you put the number of co-occurrence in another column for each combination.
Is it possible to build something like this for each period (app_year)?
A further complication comes when merging back the obtained measure of similarity with the original dataset in the example below. Any idea?
Please, if there is anyone who have dealt with something similar to this, or have any experience in working with patent data and have a better way of building this relatedness/similarity measure for firms, it will be highly appreciated some hints. Maybe Julio Raffo might want to share some ideas
In the dataex: person_id is firm id; app_year is year; appln_id is patent id; ipc2 cpc ipc_cpc are the technological classifications; green divides patents between green or non-green.
Many thanks in advance!
I would need your help to count in how many patents (appln_id) a green technology appears with any other technology (co-occurrence). In my opinion this could be done with a matrix in which rows could be those technologies (ipc_cpc) with a green classification (green==1), while the columns would be all the technologies. The idea would be to put in the cell the number of times two technologies (row and column) appear in the dataset.
Another possible way (usually done in the literature) is to build two vectors (columns) in which the first one would be the green technologies and the second one would be all the technologies, something like the next table (assuming as green tech 3,5,6): in which you put the number of co-occurrence in another column for each combination.
| Green tech | All tech | Co-occurrence |
| 3 | 1 | |
| 3 | 2 | |
| 3 | 4 | |
| 3 | 5 | |
| … | … | |
| 5 | 1 | |
| 5 | 2 |
A further complication comes when merging back the obtained measure of similarity with the original dataset in the example below. Any idea?
Please, if there is anyone who have dealt with something similar to this, or have any experience in working with patent data and have a better way of building this relatedness/similarity measure for firms, it will be highly appreciated some hints. Maybe Julio Raffo might want to share some ideas

In the dataex: person_id is firm id; app_year is year; appln_id is patent id; ipc2 cpc ipc_cpc are the technological classifications; green divides patents between green or non-green.
Many thanks in advance!
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input long person_id int app_year long appln_id str13 ipc2 str15(cpc_class ipc_cpc) float green 1 2000 212411 "H04L012/28" "H04W 92/02" "H04L012/28" 0 1 2000 212411 "H04L012/56" "H04W 84/12" "H04L012/56" 0 1 2000 212411 "H04L029/06" "H04W 36/03" "H04L029/06" 0 1 2000 212411 "H04W036/08" "H04W 88/08" "H04W036/08" 0 1 2000 212411 "H04W056/00" "H04W 56/00" "H04W056/00" 0 1 2000 212411 "H04W084/12" "H04W 92/02" "H04W084/12" 0 1 2000 212411 "H04W088/08" "H04W 84/18" "H04W088/08" 0 1 2000 212411 "H04W092/02" "H04L 29/06" "H04W092/02" 0 1 2000 215797 "G09F009/00" "H05B 45/385" "G09F009/00" 0 1 2000 215797 "H01L033/00" "Y02B 20/30" "Y02B020/30" 1 1 2000 215797 "H05B033/08" "Y02B 20/30" "Y02B020/30" 1 1 2000 215797 "H05B037/02" "Y02B 20/30" "Y02B020/30" 1 1 2000 216053 "H04L012/66" "H04M2207/20" "H04L012/66" 0 1 2000 216053 "H04M003/00" "H04M 7/0009" "H04M003/00" 0 1 2000 216053 "H04M007/00" "H04M 7/12" "H04M007/00" 0 1 2000 216053 "H04M007/06" "H04M2207/20" "H04M007/06" 0 1 2000 216053 "H04M007/12" "H04L 12/66" "H04M007/12" 0 1 2000 216053 "H04M011/00" "H04M2207/18" "H04M011/00" 0 1 2002 31572 "G06F001/32" "G06F 1/3228" "G06F001/32" 0 1 2002 31572 "G06F001/32" "G06F 1/3203" "G06F001/32" 0 1 2002 31572 "G06F001/32" "G06F 1/3287" "G06F001/32" 0 1 2002 31572 "G06F001/32" "G06F 1/3296" "G06F001/32" 0 1 2002 31572 "G06F001/32" "Y02D 10/00" "Y02D010/00" 1 1 2002 31572 "G06F001/32" "G06F 1/324" "G06F001/32" 0 1 2003 3347 "H04B007/005" "H04W 52/16" "H04B007/005" 0 1 2003 3347 "H04B007/26" "Y02D 30/70" "Y02D030/70" 1 1 2003 3347 "H04J001/00" "Y02D 30/70" "Y02D030/70" 1 1 2003 3347 "H04J003/16" "H04W 52/26" "H04J003/16" 0 1 2003 3347 "H04L012/56" "H04W 28/22" "H04L012/56" 0 1 2003 3347 "H04W028/22" "H04W 52/267" "H04W028/22" 0 1 2003 3347 "H04W036/16" "H04W 52/12" "H04W036/16" 0 1 2003 3347 "H04W052/00" "H04W 52/0229" "H04W052/00" 0 1 2003 3347 "H04W052/02" "H04W 52/0216" "H04W052/02" 0 1 2003 3347 "H04W052/16" "H04B 7/2628" "H04W052/16" 0 1 2003 3347 "H04W052/26" "H04W 36/16" "H04W052/26" 0 20 2003 143931 "H04B001/16" "G06F 1/3271" "H04B001/16" 0 20 2003 143931 "H04B001/16" "Y02D 30/70" "Y02D030/70" 1 20 2003 143931 "H04B001/16" "H04W 52/0287" "H04B001/16" 0 20 2003 143931 "H04B001/16" "G06F 1/3215" "H04B001/16" 0 20 2003 143931 "H04B001/16" "Y02D 10/00" "Y02D010/00" 1 20 2003 143931 "H04B001/16" "G06F 1/3259" "H04B001/16" 0 20 2003 220700 "H05B041/292" "Y02B 20/00" "Y02B020/00" 1 20 2003 220700 "H05B041/292" "Y10S 315/07" "H05B041/292" 0 20 2003 220700 "H05B041/292" "Y10S 315/05" "H05B041/292" 0 21 2003 143931 "G06F001/32" "G06F 1/3209" "G06F001/32" 0 21 2003 220700 "H05B041/24" "H05B 41/2928" "H05B041/24" 0 21 2007 136847 "A61L009/20" "H01J 61/70" "A61L009/20" 0 21 2007 136847 "A61L009/20" "H01J 61/125" "A61L009/20" 0 21 2007 136847 "A61L009/20" "H01J 65/042" "A61L009/20" 0 21 2007 136847 "A61N005/06" "C02F 1/32" "A61N005/06" 0 21 2007 136847 "C02F001/32" "A61L 2/10" "C02F001/32" 1 21 2007 136847 "H01J061/12" "A61L 9/20" "H01J061/12" 0 48 2001 10 "B01J023/46" "C07F 15/004" "B01J023/46" 1 48 2001 10 "B01J023/46" "C07F 15/0053" "B01J023/46" 1 48 2001 10 "B01J023/46" "C07F 15/008" "B01J023/46" 1 48 2001 10 "B01J031/24" "B01J2531/80" "B01J031/24" 0 48 2001 10 "C07B053/00" "C07F 15/0013" "C07B053/00" 0 48 2001 10 "C07B057/00" "B01J2531/821" "C07B057/00" 0 48 2001 10 "C07B061/00" "B01J2531/822" "C07B061/00" 0 48 2001 10 "C07C043/23" "C07C 303/28" "C07C043/23" 0 48 2001 10 "C07C067/303" "C07F 9/5072" "C07C067/303" 0 48 2001 10 "C07C069/34" "B01J 31/2409" "C07C069/34" 0 48 2001 10 "C07C303/28" "C07F 9/572" "C07C 303/28" 0 48 2001 10 "C07C309/63" "C07F 9/65515" "C07C 309/63" 0 48 2001 10 "C07C309/65" "C07F 9/5027" "C07C 309/65" 0 48 2001 10 "C07D321/10" "B01J2531/0266" "C07D 321/10" 0 48 2001 10 "C07F009/50" "C07D 321/10" "C07F009/50" 0 48 2001 10 "C07F009/572" "B01J2531/827" "C07F009/572" 0 48 2001 10 "C07F009/655" "B01J2231/645" "C07F009/655" 0 48 2001 10 "C07F015/00" "B01J2231/641" "C07F015/00" 0 end

Comment