Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Computing a relatedness/similarity measure with patents

    Dear Statalist, I am working with patent data (EPO) and I would like to address a measure of the relatedness in green technologies with the other technologies in a given firm (person_id). Basically, each firm could appear in the sample in several years (or even several times in a given year if it patents more than once in the same year). The patent data is merged with the technological classification IPC and with other technological classification CPC (reason why you may find that a patent done by a single firm in a single IPC technology may have more than one row in the sample, check appln_id==143931 below).

    I would need your help to count in how many patents (appln_id) a green technology appears with any other technology (co-occurrence). In my opinion this could be done with a matrix in which rows could be those technologies (ipc_cpc) with a green classification (green==1), while the columns would be all the technologies. The idea would be to put in the cell the number of times two technologies (row and column) appear in the dataset.

    Another possible way (usually done in the literature) is to build two vectors (columns) in which the first one would be the green technologies and the second one would be all the technologies, something like the next table (assuming as green tech 3,5,6): in which you put the number of co-occurrence in another column for each combination.
    Green tech All tech Co-occurrence
    3 1
    3 2
    3 4
    3 5
    5 1
    5 2
    Is it possible to build something like this for each period (app_year)?

    A further complication comes when merging back the obtained measure of similarity with the original dataset in the example below. Any idea?

    Please, if there is anyone who have dealt with something similar to this, or have any experience in working with patent data and have a better way of building this relatedness/similarity measure for firms, it will be highly appreciated some hints. Maybe Julio Raffo might want to share some ideas
    In the dataex: person_id is firm id; app_year is year; appln_id is patent id; ipc2 cpc ipc_cpc are the technological classifications; green divides patents between green or non-green.

    Many thanks in advance!

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long person_id int app_year long appln_id str13 ipc2 str15(cpc_class ipc_cpc) float green
     1 2000 212411 "H04L012/28"  "H04W  92/02"    "H04L012/28"  0
     1 2000 212411 "H04L012/56"  "H04W  84/12"    "H04L012/56"  0
     1 2000 212411 "H04L029/06"  "H04W  36/03"    "H04L029/06"  0
     1 2000 212411 "H04W036/08"  "H04W  88/08"    "H04W036/08"  0
     1 2000 212411 "H04W056/00"  "H04W  56/00"    "H04W056/00"  0
     1 2000 212411 "H04W084/12"  "H04W  92/02"    "H04W084/12"  0
     1 2000 212411 "H04W088/08"  "H04W  84/18"    "H04W088/08"  0
     1 2000 212411 "H04W092/02"  "H04L  29/06"    "H04W092/02"  0
     1 2000 215797 "G09F009/00"  "H05B  45/385"   "G09F009/00"  0
     1 2000 215797 "H01L033/00"  "Y02B  20/30"    "Y02B020/30"  1
     1 2000 215797 "H05B033/08"  "Y02B  20/30"    "Y02B020/30"  1
     1 2000 215797 "H05B037/02"  "Y02B  20/30"    "Y02B020/30"  1
     1 2000 216053 "H04L012/66"  "H04M2207/20"    "H04L012/66"  0
     1 2000 216053 "H04M003/00"  "H04M   7/0009"  "H04M003/00"  0
     1 2000 216053 "H04M007/00"  "H04M   7/12"    "H04M007/00"  0
     1 2000 216053 "H04M007/06"  "H04M2207/20"    "H04M007/06"  0
     1 2000 216053 "H04M007/12"  "H04L  12/66"    "H04M007/12"  0
     1 2000 216053 "H04M011/00"  "H04M2207/18"    "H04M011/00"  0
     1 2002  31572 "G06F001/32"  "G06F   1/3228"  "G06F001/32"  0
     1 2002  31572 "G06F001/32"  "G06F   1/3203"  "G06F001/32"  0
     1 2002  31572 "G06F001/32"  "G06F   1/3287"  "G06F001/32"  0
     1 2002  31572 "G06F001/32"  "G06F   1/3296"  "G06F001/32"  0
     1 2002  31572 "G06F001/32"  "Y02D  10/00"    "Y02D010/00"  1
     1 2002  31572 "G06F001/32"  "G06F   1/324"   "G06F001/32"  0
     1 2003   3347 "H04B007/005" "H04W  52/16"    "H04B007/005" 0
     1 2003   3347 "H04B007/26"  "Y02D  30/70"    "Y02D030/70"  1
     1 2003   3347 "H04J001/00"  "Y02D  30/70"    "Y02D030/70"  1
     1 2003   3347 "H04J003/16"  "H04W  52/26"    "H04J003/16"  0
     1 2003   3347 "H04L012/56"  "H04W  28/22"    "H04L012/56"  0
     1 2003   3347 "H04W028/22"  "H04W  52/267"   "H04W028/22"  0
     1 2003   3347 "H04W036/16"  "H04W  52/12"    "H04W036/16"  0
     1 2003   3347 "H04W052/00"  "H04W  52/0229"  "H04W052/00"  0
     1 2003   3347 "H04W052/02"  "H04W  52/0216"  "H04W052/02"  0
     1 2003   3347 "H04W052/16"  "H04B   7/2628"  "H04W052/16"  0
     1 2003   3347 "H04W052/26"  "H04W  36/16"    "H04W052/26"  0
    20 2003 143931 "H04B001/16"  "G06F   1/3271"  "H04B001/16"  0
    20 2003 143931 "H04B001/16"  "Y02D  30/70"    "Y02D030/70"  1
    20 2003 143931 "H04B001/16"  "H04W  52/0287"  "H04B001/16"  0
    20 2003 143931 "H04B001/16"  "G06F   1/3215"  "H04B001/16"  0
    20 2003 143931 "H04B001/16"  "Y02D  10/00"    "Y02D010/00"  1
    20 2003 143931 "H04B001/16"  "G06F   1/3259"  "H04B001/16"  0
    20 2003 220700 "H05B041/292" "Y02B  20/00"    "Y02B020/00"  1
    20 2003 220700 "H05B041/292" "Y10S 315/07"    "H05B041/292" 0
    20 2003 220700 "H05B041/292" "Y10S 315/05"    "H05B041/292" 0
    21 2003 143931 "G06F001/32"  "G06F   1/3209"  "G06F001/32"  0
    21 2003 220700 "H05B041/24"  "H05B  41/2928"  "H05B041/24"  0
    21 2007 136847 "A61L009/20"  "H01J  61/70"    "A61L009/20"  0
    21 2007 136847 "A61L009/20"  "H01J  61/125"   "A61L009/20"  0
    21 2007 136847 "A61L009/20"  "H01J  65/042"   "A61L009/20"  0
    21 2007 136847 "A61N005/06"  "C02F   1/32"    "A61N005/06"  0
    21 2007 136847 "C02F001/32"  "A61L   2/10"    "C02F001/32"  1
    21 2007 136847 "H01J061/12"  "A61L   9/20"    "H01J061/12"  0
    48 2001     10 "B01J023/46"  "C07F  15/004"   "B01J023/46"  1
    48 2001     10 "B01J023/46"  "C07F  15/0053"  "B01J023/46"  1
    48 2001     10 "B01J023/46"  "C07F  15/008"   "B01J023/46"  1
    48 2001     10 "B01J031/24"  "B01J2531/80"    "B01J031/24"  0
    48 2001     10 "C07B053/00"  "C07F  15/0013"  "C07B053/00"  0
    48 2001     10 "C07B057/00"  "B01J2531/821"   "C07B057/00"  0
    48 2001     10 "C07B061/00"  "B01J2531/822"   "C07B061/00"  0
    48 2001     10 "C07C043/23"  "C07C 303/28"    "C07C043/23"  0
    48 2001     10 "C07C067/303" "C07F   9/5072"  "C07C067/303" 0
    48 2001     10 "C07C069/34"  "B01J  31/2409"  "C07C069/34"  0
    48 2001     10 "C07C303/28"  "C07F   9/572"   "C07C 303/28" 0
    48 2001     10 "C07C309/63"  "C07F   9/65515" "C07C 309/63" 0
    48 2001     10 "C07C309/65"  "C07F   9/5027"  "C07C 309/65" 0
    48 2001     10 "C07D321/10"  "B01J2531/0266"  "C07D 321/10" 0
    48 2001     10 "C07F009/50"  "C07D 321/10"    "C07F009/50"  0
    48 2001     10 "C07F009/572" "B01J2531/827"   "C07F009/572" 0
    48 2001     10 "C07F009/655" "B01J2231/645"   "C07F009/655" 0
    48 2001     10 "C07F015/00"  "B01J2231/641"   "C07F015/00"  0
    end

  • #2
    I didn't understand your explanation. But if all you need is an indicator for whether or not a patent application (appln_id) contains both green and non-green technologies, you can do that with
    Code:
    assert !missing(green)
    by appln_id (green), sort: gen byte has_both = (green[1] != green[_N])

    Comment


    • #3
      Dear Clyde Schechter, thanks very much for your help. Unfortunately this is not what I need.
      Briefly, I need to know if a firm's technology is close (similar) to a Green technology. For that, I need to count the number of times two given technologies (variable ipc_cpc) appear in each of the different patents (appln_id). This assume that if two technologies appear too much in many patents, it is said that those two technologies are related, since firms will use them in combinations many times.
      Thus, I want to perform a measure at the firm level (for each period) of how similar/related is a given technology of a firm to a green technology. For that I need to first count how many times in combinations of two, two technologies (one of them Green) appear in the data for each year.

      I think that building a matrix with rows as non-green technologies and columns as green technologies would help me to know the specific combinations. But do not know how to count them as it is right now in the dataset. However, I have read that other possible way in the literature is to build two variables: one for the green tech and other for the non-green tech as I show in the table in #1, and then add a third variable which is the number of time such a combination appear in a given year.

      For instance, in appln_id=212411 (first one), non of the technologies are green. In the second patent, technology "Y02B020/30" (Green) appears with "G09F009/00" which would imply a count of 1 for such a combination. If this combination appear in another patent, it would count as 2... And this should be done for all the patent looking for combinations with green patents, and counting how many times they appear in that year.

      I am very sorry for the mess in my explanation, I really hope now is easier to understand.

      Comment


      • #4
        Well, I think I understand it a little bit better. Basically, for each pair of technologies that appear anywhere in your data set, you want to identify the frequency with which the co-occur in the same application.

        My problem now, is that I am completely confused by how we identify a distinct technology. You have three variables, ipc2, cpc_class, and ipc_cpc. But I cannot discern what the relationships among them are. Does the combination of ipc2 and cpc_class identify a single distinct technology? (In the example data, when two observations agree on ipc2 and cpc_class they also agree on ipc_cpc. But I don't want to presume this is always the case. Also, in #3 you refer to technology "Y02B020/30." This value is found only in variable ipc_cpc. But when I examine the observations that have ipc_cpc == "Y02B020/30,", they have different values of ipc2. So is it only ipc_cpc that defines a technology, and ipc2 and cpc_class are irrelvant? Or is there something more complicated going on? Please clarify for me how the variables are used to identify a single technology.

        Comment


        • #5
          Dear Clyde Schechter, you have touched a good point here. Basically, to identify green tech you need both, IPC and CPC codes. So, there may be patents that are not listed as green using their IPC codes, but instead they can be classified as green using CPC (CPC codes are more dissaggregated classification, but unfortunately, the OECD environmental classification is done with the two codes).

          So, what I did was to merge IPC with CPC using the appln_id. However, as there is not other variable that helps to fine tune the identification of cases, I was forced to used m:m merging. So, I am sure that each time I redo the merge, the IPC and CPC in a given patent (ex. appln_id = 215797) will change possition within the patent (something I will have to solve).

          I have built the ipc_cpc variable as the codes in the ipc but only changing those ipc codes in the lines in which cpc are green. I understand this is not the best way, since for instance, it could be that those substituted ipc codes might also serve as co-occurrences with green tech. Maybe the best way could be to add a different observation for those CPC codes in each patent (maybe appending the cpc observations to a given patent instead of merge).

          In any case, lets assume for now that my ipc_cpc is the good one for identifying the technology of patents.

          Well, I think I understand it a little bit better. Basically, for each pair of technologies that appear anywhere in your data set, you want to identify the frequency with which the co-occur in the same application
          It is pretty much as you say, but notice that a given technology should not be repeated in a given patent (here it happens because my bad merging process). So, teoretically, two technologies can only be within a patent once, or cero otherwise. So, the idea is to observe in how many patents they appear 1 and sum all those 1s.

          I hope this will help you to understand what am I dealing with. I am very grateful for your help with this complicated issue!

          Comment


          • #6
            I was forced to used m:m merging
            OK, this means that what you have is data salad. -merge m:m- produces results that are meaningless and useless in almost all circumstances. I have been using Stata pretty much every day since 1994 and only once have I encountered a situation where -merge m:m- would produce something usable. And even then, there was a better way to do it. So basically, you should write on the blackboard 500 times "I will not use merge m:m, ever."

            More to the point, it means that there is no point trying to go further with this data set. It is already corrupted and nothing good will come of working with it.

            Let's get the data fixed up. Why don't you post examples from the data sets that you merged to create what you showed in your example, and explain what the variables are and how they are supposed to match up, and we can figure out the right way to combine them to get correct results. After we've done that, we can figure out how to get your similarity variable.

            Comment


            • #7
              Dear Clyde Schechter, I show you three dataex. The first one is the patent data, in which you usually have one line per patent (unless you have more than one firm working in the patent). The second dataex is the IPC codes, in which you have several observations per patent each listing one of the technologies the patent refers (the same code cannot be repeated in the same patent, and only appears once). The third dataex is the CPC one, in which it is basically the same, but with possibly more disaggregation of technologies which may result in even more rows per patent.

              As you see, the only common variable is appln-id. However, some of the IPC codes are the same in the CPC ones (see appln_id=10 IPC “C07F009/572” and CPC “C07F009/572”) while some disaggregates more the IPC one (see appln_id=10 IPC “B01J031/24” and CPC “B01J031/2409”).

              Since from the CPC codes I am only interested in codes “Y02” which is the code classifying a patent as green in the CPC data. Ideally, I think the best would be to append the observations CPC (Y02) codes only to the IPC data. And then build a tech variable with the different codes from IPC and CPC (Y02) within each patent. In this case, for each patent I will have all the IPC codes plus only the CPC (Y02) as other observations within the patent. And finally, merge this with the EPO patent data (using appln_id).

              Because the structure of the IPC and CPC codes is X##X###/ (being X a letter and # a number) I use the following commands for eliminating spaces and for identify all CPC “Y02”:

              Code:
              gen ipc2 = ipc
              replace ipc2 = subinstr(ipc2, " ", "", .)
              gen new_cpc3 = substr(cpc_class,1,3)
              I really hope this clarify how to build the technology variable for doing the relatedness measure.

              EPO Patent data
              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input long(appln_id person_id) str7 reg_code
                  10 48 "DEA1C"
                3347  1 "FI1B1"
               31572  1 "FI1B1"
              136847 21 "NL414"
              143931 20 "DE600"
              143931 21 "NL414"
              212411  1 "FI1B1"
              215797  1 "FI1B1"
              216053  1 "FI1B1"
              220700 20 "DE600"
              220700 21 "NL414"
              end
              IPC Classifications
              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input long appln_id int(prio_year app_year) str13 ipc
                  10 2000 2001 "B01J023/46" 
                  10 2000 2001 "B01J031/24" 
                  10 2000 2001 "C07B053/00" 
                  10 2000 2001 "C07B057/00" 
                  10 2000 2001 "C07B061/00" 
                  10 2000 2001 "C07C 303/28"
                  10 2000 2001 "C07C 309/63"
                  10 2000 2001 "C07C 309/65"
                  10 2000 2001 "C07C043/23" 
                  10 2000 2001 "C07C067/303"
                  10 2000 2001 "C07C069/34" 
                  10 2000 2001 "C07D 321/10"
                  10 2000 2001 "C07F009/50" 
                  10 2000 2001 "C07F009/572"
                  10 2000 2001 "C07F009/655"
                  10 2000 2001 "C07F015/00" 
                3347 2002 2003 "H04B007/005"
                3347 2002 2003 "H04B007/26" 
                3347 2002 2003 "H04J001/00" 
                3347 2002 2003 "H04J003/16" 
                3347 2002 2003 "H04L012/56" 
                3347 2002 2003 "H04W028/22" 
                3347 2002 2003 "H04W036/16" 
                3347 2002 2003 "H04W052/00" 
                3347 2002 2003 "H04W052/02" 
                3347 2002 2003 "H04W052/16" 
                3347 2002 2003 "H04W052/26" 
               31572 2001 2002 "G06F001/32" 
              136847 2006 2007 "A61L009/20" 
              136847 2006 2007 "A61N005/06" 
              136847 2006 2007 "C02F001/32" 
              136847 2006 2007 "H01J061/12" 
              143931 2002 2003 "G06F001/32" 
              143931 2002 2003 "H04B001/16" 
              212411 1999 2000 "H04L012/28" 
              212411 1999 2000 "H04L012/56" 
              212411 1999 2000 "H04L029/06" 
              212411 1999 2000 "H04W036/08" 
              212411 1999 2000 "H04W056/00" 
              212411 1999 2000 "H04W084/12" 
              212411 1999 2000 "H04W088/08" 
              212411 1999 2000 "H04W092/02" 
              215797 1999 2000 "G09F009/00" 
              215797 1999 2000 "H01L033/00" 
              215797 1999 2000 "H05B033/08" 
              215797 1999 2000 "H05B037/02" 
              216053 1999 2000 "H04L012/66" 
              216053 1999 2000 "H04M003/00" 
              216053 1999 2000 "H04M007/00" 
              216053 1999 2000 "H04M007/06" 
              216053 1999 2000 "H04M007/12" 
              216053 1999 2000 "H04M011/00" 
              220700 2002 2003 "H05B041/24" 
              220700 2002 2003 "H05B041/292"
              end
              CPC Classifications
              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input long appln_id str15 cpc_class
                  10 "B01J  31/2409" 
                  10 "B01J2231/641"  
                  10 "B01J2231/645"  
                  10 "B01J2531/0266" 
                  10 "B01J2531/80"   
                  10 "B01J2531/821"  
                  10 "B01J2531/822"  
                  10 "B01J2531/827"  
                  10 "C07C 303/28"   
                  10 "C07D 321/10"   
                  10 "C07F   9/5027" 
                  10 "C07F   9/5072" 
                  10 "C07F   9/572"  
                  10 "C07F   9/65515"
                  10 "C07F  15/0013" 
                  10 "C07F  15/004"  
                  10 "C07F  15/0053" 
                  10 "C07F  15/008"  
                3347 "H04B   7/2628" 
                3347 "H04W  28/22"   
                3347 "H04W  36/16"   
                3347 "H04W  52/0216" 
                3347 "H04W  52/0229" 
                3347 "H04W  52/12"   
                3347 "H04W  52/16"   
                3347 "H04W  52/26"   
                3347 "H04W  52/267"  
                3347 "Y02D  30/70"   
               31572 "G06F   1/3203" 
               31572 "G06F   1/3228" 
               31572 "G06F   1/324"  
               31572 "G06F   1/3287" 
               31572 "G06F   1/3296" 
               31572 "Y02D  10/00"   
              136847 "A61L   2/10"   
              136847 "A61L   9/20"   
              136847 "C02F   1/32"   
              136847 "H01J  61/125"  
              136847 "H01J  61/70"   
              136847 "H01J  65/042"  
              143931 "G06F   1/3209" 
              143931 "G06F   1/3215" 
              143931 "G06F   1/3259" 
              143931 "G06F   1/3271" 
              143931 "H04W  52/0287" 
              143931 "Y02D  10/00"   
              143931 "Y02D  30/70"   
              212411 "H04L  29/06"   
              212411 "H04W  36/03"   
              212411 "H04W  56/00"   
              212411 "H04W  84/12"   
              212411 "H04W  84/18"   
              212411 "H04W  88/08"   
              212411 "H04W  92/02"   
              215797 "H05B  45/385"  
              215797 "Y02B  20/30"   
              216053 "H04L  12/66"   
              216053 "H04M   7/0009" 
              216053 "H04M   7/12"   
              216053 "H04M2207/18"   
              216053 "H04M2207/20"   
              220700 "H05B  41/2928" 
              220700 "Y02B  20/00"   
              220700 "Y10S 315/05"   
              220700 "Y10S 315/07"   
              end

              Comment


              • #8
                So, I think you have it basically right: the first step is to append the IPC and CPC code data, then weed out duplicate codes, and then finally merge that with the EPO data.

                Now, the structure of the EPO data is a problem: the person_id and reg_code are not really that material to what you are trying to do here. But because there can be multiple person_ids and reg_codes for the same application, you end up looking at an impossible merge. It could be done with -joinby-, but then each person_id will end up with a full copy of all of the techs that belong to the appln_id, which is going to make the data set large, and will get in the way of analyses for which the application is the unit of analysis. So here, I think the better solution is to -reshape- the EPO data to wide layout so there is just one observation per appln_id. Then you have a legitimate 1:m -merge-. Putting this all together, I come out with:

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input long(appln_id person_id) str7 reg_code
                    10 48 "DEA1C"
                  3347  1 "FI1B1"
                 31572  1 "FI1B1"
                136847 21 "NL414"
                143931 20 "DE600"
                143931 21 "NL414"
                212411  1 "FI1B1"
                215797  1 "FI1B1"
                216053  1 "FI1B1"
                220700 20 "DE600"
                220700 21 "NL414"
                end
                tempfile epo_data
                save `epo_data'
                
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input long appln_id int(prio_year app_year) str13 ipc
                    10 2000 2001 "B01J023/46"
                    10 2000 2001 "B01J031/24"
                    10 2000 2001 "C07B053/00"
                    10 2000 2001 "C07B057/00"
                    10 2000 2001 "C07B061/00"
                    10 2000 2001 "C07C 303/28"
                    10 2000 2001 "C07C 309/63"
                    10 2000 2001 "C07C 309/65"
                    10 2000 2001 "C07C043/23"
                    10 2000 2001 "C07C067/303"
                    10 2000 2001 "C07C069/34"
                    10 2000 2001 "C07D 321/10"
                    10 2000 2001 "C07F009/50"
                    10 2000 2001 "C07F009/572"
                    10 2000 2001 "C07F009/655"
                    10 2000 2001 "C07F015/00"
                  3347 2002 2003 "H04B007/005"
                  3347 2002 2003 "H04B007/26"
                  3347 2002 2003 "H04J001/00"
                  3347 2002 2003 "H04J003/16"
                  3347 2002 2003 "H04L012/56"
                  3347 2002 2003 "H04W028/22"
                  3347 2002 2003 "H04W036/16"
                  3347 2002 2003 "H04W052/00"
                  3347 2002 2003 "H04W052/02"
                  3347 2002 2003 "H04W052/16"
                  3347 2002 2003 "H04W052/26"
                 31572 2001 2002 "G06F001/32"
                136847 2006 2007 "A61L009/20"
                136847 2006 2007 "A61N005/06"
                136847 2006 2007 "C02F001/32"
                136847 2006 2007 "H01J061/12"
                143931 2002 2003 "G06F001/32"
                143931 2002 2003 "H04B001/16"
                212411 1999 2000 "H04L012/28"
                212411 1999 2000 "H04L012/56"
                212411 1999 2000 "H04L029/06"
                212411 1999 2000 "H04W036/08"
                212411 1999 2000 "H04W056/00"
                212411 1999 2000 "H04W084/12"
                212411 1999 2000 "H04W088/08"
                212411 1999 2000 "H04W092/02"
                215797 1999 2000 "G09F009/00"
                215797 1999 2000 "H01L033/00"
                215797 1999 2000 "H05B033/08"
                215797 1999 2000 "H05B037/02"
                216053 1999 2000 "H04L012/66"
                216053 1999 2000 "H04M003/00"
                216053 1999 2000 "H04M007/00"
                216053 1999 2000 "H04M007/06"
                216053 1999 2000 "H04M007/12"
                216053 1999 2000 "H04M011/00"
                220700 2002 2003 "H05B041/24"
                220700 2002 2003 "H05B041/292"
                end
                tempfile ipc
                save `ipc'
                
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input long appln_id str15 cpc_class
                    10 "B01J  31/2409"
                    10 "B01J2231/641"  
                    10 "B01J2231/645"  
                    10 "B01J2531/0266"
                    10 "B01J2531/80"  
                    10 "B01J2531/821"  
                    10 "B01J2531/822"  
                    10 "B01J2531/827"  
                    10 "C07C 303/28"  
                    10 "C07D 321/10"  
                    10 "C07F   9/5027"
                    10 "C07F   9/5072"
                    10 "C07F   9/572"  
                    10 "C07F   9/65515"
                    10 "C07F  15/0013"
                    10 "C07F  15/004"  
                    10 "C07F  15/0053"
                    10 "C07F  15/008"  
                  3347 "H04B   7/2628"
                  3347 "H04W  28/22"  
                  3347 "H04W  36/16"  
                  3347 "H04W  52/0216"
                  3347 "H04W  52/0229"
                  3347 "H04W  52/12"  
                  3347 "H04W  52/16"  
                  3347 "H04W  52/26"  
                  3347 "H04W  52/267"  
                  3347 "Y02D  30/70"  
                 31572 "G06F   1/3203"
                 31572 "G06F   1/3228"
                 31572 "G06F   1/324"  
                 31572 "G06F   1/3287"
                 31572 "G06F   1/3296"
                 31572 "Y02D  10/00"  
                136847 "A61L   2/10"  
                136847 "A61L   9/20"  
                136847 "C02F   1/32"  
                136847 "H01J  61/125"  
                136847 "H01J  61/70"  
                136847 "H01J  65/042"  
                143931 "G06F   1/3209"
                143931 "G06F   1/3215"
                143931 "G06F   1/3259"
                143931 "G06F   1/3271"
                143931 "H04W  52/0287"
                143931 "Y02D  10/00"  
                143931 "Y02D  30/70"  
                212411 "H04L  29/06"  
                212411 "H04W  36/03"  
                212411 "H04W  56/00"  
                212411 "H04W  84/12"  
                212411 "H04W  84/18"  
                212411 "H04W  88/08"  
                212411 "H04W  92/02"  
                215797 "H05B  45/385"  
                215797 "Y02B  20/30"  
                216053 "H04L  12/66"  
                216053 "H04M   7/0009"
                216053 "H04M   7/12"  
                216053 "H04M2207/18"  
                216053 "H04M2207/20"  
                220700 "H05B  41/2928"
                220700 "Y02B  20/00"  
                220700 "Y10S 315/05"  
                220700 "Y10S 315/07"  
                end
                tempfile cpc
                save `cpc'
                
                use `cpc', clear
                keep if substr(cpc_class, 1, 3) == "Y02"
                rename cpc_class ipc
                gen code_type = "cpc"
                append using `ipc'
                replace ipc = subinstr(ipc, " ", "", .)
                replace code_type = "ipc" if missing(code_type)
                gsort appln_id  ipc prio_year app_year -code_type
                by appln_id ipc: assert prio_year == prio_year[1] | missing(prio_year)
                by appln_id ipc: assert app_year == app_year[1] | missing(app_year)
                by appln_id ipc: keep if _n == 1
                tempfile combined_ipc_cpc_codes
                save `combined_ipc_cpc_codes'
                
                use `epo_data', clear
                by appln_id (person_id), sort: gen _j = _n
                reshape wide person_id reg_code, i(appln_id) j(_j)
                merge 1:m appln_id using `combined_ipc_cpc_codes', keep(master match) nogenerate
                There are a few loose ends here. The ipc code file has these prio_year and app_year variables that are absent in the cpc_code files. I have no idea if these variables are of any importance to you. Life is simpler if they are not, and in that case, I recommend you just drop them from the ipc_code data before the -append- takes place. If you need them, then the above code keeps them, after first verifying that they are constant for all observations of the same appln_id, but it means that for technologies that were found only in the cpc code file, those variables are missing. Next, you refer to different levels of disaggregation in the ipc and cpc codes, but I wasn't sure how to generalize your example. So I have done nothing with that, which may mean that there are some apparently different codes in the resulting file that are really the same thing, just at different levels of "aggregation." As I am unclear exactly what this "aggregation" is about, I have just left it at that and I hope that you can clean that up.

                Try this and see if it looks right to you.

                Once you have a final data set, post back with a -dataex- from that and we can get down to resolving the problem that started this thread off.

                Comment


                • #9
                  Dear Clyde Schechter, here you will find the dataset in which I append CPC and IPC, and merge with EPO. Now I have all the technologies in the variable ipc_cpc (only listing as green those CPC "Y02"). I have tried hard to come out with a solution to the relatedness variable, but I do not have the Stata knowledge to continue by my own.

                  Basically my idea is first to expand the dataset by appln_id (number of obs in a patent -1). The hardest part is to create a variable with the first obs as the next technology classification (ipc_cpc) within a patent. Unfortunately I do know how to continue to fill the second obsetvation of variable (x1). The idea is to have two variables ipc_cpc and x1 similar in fashion to the table I show in #1.

                  Can you help me solve how to continue from here?
                  Here is my rudimentary code and the data:
                  Code:
                  sort person_id app_year appln_id ipc_cpc
                  bys person_id app_year appln_id (ipc_cpc): gen pat_obs1 = _n
                  bys person_id app_year appln_id (ipc_cpc): gen pat_obs2 = _N
                  sum pat_obs2
                  return list 
                  local c = r(max)
                  
                  forvalues x = 1/`c' {
                            expand `x'-1 if pat_obs2==`x'
                            }
                  
                  gen x1 = ""    
                  forvalues x = 1/`c' {
                  bys person_id app_year appln_id (ipc_cpc): replace x1 = ipc_cpc[`x'] if pat_obs2==`x' & x1[1]==""
                  }
                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input long person_id int app_year long appln_id str11(ipc2 ipc_cpc) float green
                   1 2000 212411 "H04L012/28"  "H04L012/28"  0
                   1 2000 212411 "H04L012/56"  "H04L012/56"  0
                   1 2000 212411 "H04L029/06"  "H04L029/06"  0
                   1 2000 212411 "H04W036/08"  "H04W036/08"  0
                   1 2000 212411 "H04W056/00"  "H04W056/00"  0
                   1 2000 212411 "H04W084/12"  "H04W084/12"  0
                   1 2000 212411 "H04W088/08"  "H04W088/08"  0
                   1 2000 212411 "H04W092/02"  "H04W092/02"  0
                   1 2000 215797 "G09F009/00"  "G09F009/00"  0
                   1 2000 215797 "H01L033/00"  "H01L033/00"  0
                   1 2000 215797 "H05B033/08"  "H05B033/08"  0
                   1 2000 215797 "H05B037/02"  "H05B037/02"  0
                   1 2000 215797 ""            "Y02B020/30"  1
                   1 2000 216053 "H04L012/66"  "H04L012/66"  0
                   1 2000 216053 "H04M003/00"  "H04M003/00"  0
                   1 2000 216053 "H04M007/00"  "H04M007/00"  0
                   1 2000 216053 "H04M007/06"  "H04M007/06"  0
                   1 2000 216053 "H04M007/12"  "H04M007/12"  0
                   1 2000 216053 "H04M011/00"  "H04M011/00"  0
                   1 2002  31572 "G06F001/32"  "G06F001/32"  0
                   1 2002  31572 ""            "Y02D010/00"  1
                   1 2003   3347 "H04B007/005" "H04B007/005" 0
                   1 2003   3347 "H04B007/26"  "H04B007/26"  0
                   1 2003   3347 "H04J001/00"  "H04J001/00"  0
                   1 2003   3347 "H04J003/16"  "H04J003/16"  0
                   1 2003   3347 "H04L012/56"  "H04L012/56"  0
                   1 2003   3347 "H04W028/22"  "H04W028/22"  0
                   1 2003   3347 "H04W036/16"  "H04W036/16"  0
                   1 2003   3347 "H04W052/00"  "H04W052/00"  0
                   1 2003   3347 "H04W052/02"  "H04W052/02"  0
                   1 2003   3347 "H04W052/16"  "H04W052/16"  0
                   1 2003   3347 "H04W052/26"  "H04W052/26"  0
                   1 2003   3347 ""            "Y02D030/70"  1
                  20 2003 143931 "G06F001/32"  "G06F001/32"  0
                  20 2003 220700 "H05B041/24"  "H05B041/24"  0
                  21 2003 143931 "H04B001/16"  "H04B001/16"  0
                  21 2003 143931 ""            "Y02D010/00"  1
                  21 2003 143931 ""            "Y02D030/70"  1
                  21 2003 220700 "H05B041/292" "H05B041/292" 0
                  21 2003 220700 ""            "Y02B020/00"  1
                  21 2007 136847 "A61L009/20"  "A61L009/20"  0
                  21 2007 136847 "A61N005/06"  "A61N005/06"  0
                  21 2007 136847 "C02F001/32"  "C02F001/32"  0
                  21 2007 136847 "H01J061/12"  "H01J061/12"  0
                  48 2001     10 "B01J023/46"  "B01J023/46"  0
                  48 2001     10 "B01J031/24"  "B01J031/24"  0
                  48 2001     10 "C07B053/00"  "C07B053/00"  0
                  48 2001     10 "C07B057/00"  "C07B057/00"  0
                  48 2001     10 "C07B061/00"  "C07B061/00"  0
                  48 2001     10 "C07C043/23"  "C07C043/23"  0
                  48 2001     10 "C07C067/303" "C07C067/303" 0
                  48 2001     10 "C07C069/34"  "C07C069/34"  0
                  48 2001     10 "C07C303/28"  "C07C303/28"  0
                  48 2001     10 "C07C309/63"  "C07C309/63"  0
                  48 2001     10 "C07C309/65"  "C07C309/65"  0
                  48 2001     10 "C07D321/10"  "C07D321/10"  0
                  48 2001     10 "C07F009/50"  "C07F009/50"  0
                  48 2001     10 "C07F009/572" "C07F009/572" 0
                  48 2001     10 "C07F009/655" "C07F009/655" 0
                  48 2001     10 "C07F015/00"  "C07F015/00"  0
                  end

                  Comment


                  • #10
                    Well, I don't think I would start that way, so I'll show you what I think is a better approach. The gist of it is to form a new data set consisting of all distinct pairs of patents. Then we can count for each patent, the number of distinct applications it occurs in, and for each pair, the number of applications that they both appear in. As I am still a bit uncertain exactly what you want, this may be off base, but I don't think it is:

                    Code:
                    isid appln_id ipc_cpc, sort
                    keep ipc_cpc green appln_id
                    preserve
                    rename (ipc_cpc appln_id green) =2
                    tempfile copy
                    save `copy'
                    restore
                    rename (ipc_cpc appln_id green) =1
                    
                    cross using `copy' // FORM ALL POSSIBLE PAIRS
                    drop if ipc_cpc1 >= ipc_cpc2 // ELIMINATE SELF PAIRS AND PAIRS IN REVERSE ORDER
                    egen ipc_cpc_pair = concat(ipc_cpc1 ipc_cpc2), punct(",")  // CREATE A PAIR ID
                    duplicates drop
                    
                    by ipc_cpc1, sort: egen patents_with_ipc_cpc1 = nvals(appln_id1)
                    
                    gen byte same_appln = (appln_id1 == appln_id2)
                    isid ipc_cpc_pair appln_id1 appln_id2, sort
                    by ipc_cpc_pair: egen co_occurrences_this_pair = total(same_appln)
                    order ipc_cpc_pair patents_with_ipc_cpc1 co_occurrences_this_pair, first
                    Note: The -egen, nvals()- function is not part of official Stata; rather it is found in Nick Cox's -egenmore- package, available from SSC.

                    Comment


                    • #11
                      Dear Clyde Schechter, first thing first, thank you very much for your help because this is pretty much what I was looking for. However, let me comment you a couple of things for clarification:

                      First, my real data is really long (+16 million obs) and when trying the code it gives error
                      Code:
                      cross using `copy' // FORM ALL POSSIBLE PAIRS
                      sum of expand values exceed 1,099,511,627,775
                          The dataset may not contain more than 1,099,511,627,775 observations.
                      I am thinking if it would be possible to do this analysis but for each year app_year (in fact what a I need) and store the results and then try to combine the collected similarity measures by years and merge that to the original dataset. If not, I have seen that the command "cross" does every combination of all the ipc_cpc codes. Is it possible to build using this command (cross) only for those combinations that appears in the patent? In fact, this is what is needed to build the similarity measure. This way it will need to build a lower number of combinations. Do you think is possible any of these solution? Or do you know a better solution for this.

                      Second, I do not fully understand why in the following tab the appln_id=10 have only 120 “1s” when it should be 240 (16*15 combinations of 1x1 from that patent). Also, look that appln_id=31572 have 5 “1s” but it should have only one combination of ipc_cpc (G06F001/32 and Y02D010/00). Also, why is it that “co_occurrences_this_pair” have also “2s”? Should not these duplications be eliminated from the line (duplicates drop)? Maybe I am looking at the wrong tab, but it seems strange to me. Any idea of what is happening?

                      Code:
                       tab appln_id1 co_occurrences_this_pair
                      
                                 |     co_occurrences_this_pair
                       appln_id1 |         0          1          2 |     Total
                      -----------+---------------------------------+----------
                              10 |       658        120          0 |       778 
                            3347 |       171         87          0 |       258 
                           31572 |        34          5          2 |        41 
                          136847 |       203          6          0 |       209 
                          143931 |        65          9          2 |        76 
                          212411 |       112         37          0 |       149 
                          215797 |        85         10          0 |        95 
                          216053 |       127         15          0 |       142 
                          220700 |        15          3          0 |        18 
                      -----------+---------------------------------+----------
                           Total |     1,470        292          4 |     1,766

                      Comment


                      • #12
                        Is it possible to build using this command (cross) only for those combinations that appears in the patent?
                        I don't think so. I don't even understand what "the patent" means in this context. The whole idea is to form all pairs of technologies, and then identify how many patents the pair co-occurs in. I don't know how many distinct technologies are in your data set, but if we call that number T, in the end we need a data set containing all distinct pairs of technologies (order irrelevant), so the resulting data set must contain T*(T-1)/2 observations. Moreover, this is a best-case scenario, that applies only if each pair occurs only in a single year. If you do it separately by year, then multiply that by the average number of years in which a given pair of technologies will both exist in some patent in a year. Now, T is less than the size of the data set, to be clear. But before proceeding, you should find out what T is, and do that calculation to see if this is feasible at all.

                        Assuming that it is feasible, here's a different way to get a data set of co-occurrences that will be memory-sparing. (The price one pays for this is that it will be very slow. I have tried to, within the constraint of minimizing memory use, make it as fast as I can, but you're going to set your expectations for a very long run-time on this.)

                        Code:
                        * Example generated by -dataex-. To install: ssc install dataex
                        clear*
                        input long person_id int app_year long appln_id str11(ipc2 ipc_cpc) float green
                         1 2000 212411 "H04L012/28"  "H04L012/28"  0
                         1 2000 212411 "H04L012/56"  "H04L012/56"  0
                         1 2000 212411 "H04L029/06"  "H04L029/06"  0
                         1 2000 212411 "H04W036/08"  "H04W036/08"  0
                         1 2000 212411 "H04W056/00"  "H04W056/00"  0
                         1 2000 212411 "H04W084/12"  "H04W084/12"  0
                         1 2000 212411 "H04W088/08"  "H04W088/08"  0
                         1 2000 212411 "H04W092/02"  "H04W092/02"  0
                         1 2000 215797 "G09F009/00"  "G09F009/00"  0
                         1 2000 215797 "H01L033/00"  "H01L033/00"  0
                         1 2000 215797 "H05B033/08"  "H05B033/08"  0
                         1 2000 215797 "H05B037/02"  "H05B037/02"  0
                         1 2000 215797 ""            "Y02B020/30"  1
                         1 2000 216053 "H04L012/66"  "H04L012/66"  0
                         1 2000 216053 "H04M003/00"  "H04M003/00"  0
                         1 2000 216053 "H04M007/00"  "H04M007/00"  0
                         1 2000 216053 "H04M007/06"  "H04M007/06"  0
                         1 2000 216053 "H04M007/12"  "H04M007/12"  0
                         1 2000 216053 "H04M011/00"  "H04M011/00"  0
                         1 2002  31572 "G06F001/32"  "G06F001/32"  0
                         1 2002  31572 ""            "Y02D010/00"  1
                         1 2003   3347 "H04B007/005" "H04B007/005" 0
                         1 2003   3347 "H04B007/26"  "H04B007/26"  0
                         1 2003   3347 "H04J001/00"  "H04J001/00"  0
                         1 2003   3347 "H04J003/16"  "H04J003/16"  0
                         1 2003   3347 "H04L012/56"  "H04L012/56"  0
                         1 2003   3347 "H04W028/22"  "H04W028/22"  0
                         1 2003   3347 "H04W036/16"  "H04W036/16"  0
                         1 2003   3347 "H04W052/00"  "H04W052/00"  0
                         1 2003   3347 "H04W052/02"  "H04W052/02"  0
                         1 2003   3347 "H04W052/16"  "H04W052/16"  0
                         1 2003   3347 "H04W052/26"  "H04W052/26"  0
                         1 2003   3347 ""            "Y02D030/70"  1
                        20 2003 143931 "G06F001/32"  "G06F001/32"  0
                        20 2003 220700 "H05B041/24"  "H05B041/24"  0
                        21 2003 143931 "H04B001/16"  "H04B001/16"  0
                        21 2003 143931 ""            "Y02D010/00"  1
                        21 2003 143931 ""            "Y02D030/70"  1
                        21 2003 220700 "H05B041/292" "H05B041/292" 0
                        21 2003 220700 ""            "Y02B020/00"  1
                        21 2007 136847 "A61L009/20"  "A61L009/20"  0
                        21 2007 136847 "A61N005/06"  "A61N005/06"  0
                        21 2007 136847 "C02F001/32"  "C02F001/32"  0
                        21 2007 136847 "H01J061/12"  "H01J061/12"  0
                        48 2001     10 "B01J023/46"  "B01J023/46"  0
                        48 2001     10 "B01J031/24"  "B01J031/24"  0
                        48 2001     10 "C07B053/00"  "C07B053/00"  0
                        48 2001     10 "C07B057/00"  "C07B057/00"  0
                        48 2001     10 "C07B061/00"  "C07B061/00"  0
                        48 2001     10 "C07C043/23"  "C07C043/23"  0
                        48 2001     10 "C07C067/303" "C07C067/303" 0
                        48 2001     10 "C07C069/34"  "C07C069/34"  0
                        48 2001     10 "C07C303/28"  "C07C303/28"  0
                        48 2001     10 "C07C309/63"  "C07C309/63"  0
                        48 2001     10 "C07C309/65"  "C07C309/65"  0
                        48 2001     10 "C07D321/10"  "C07D321/10"  0
                        48 2001     10 "C07F009/50"  "C07F009/50"  0
                        48 2001     10 "C07F009/572" "C07F009/572" 0
                        48 2001     10 "C07F009/655" "C07F009/655" 0
                        48 2001     10 "C07F015/00"  "C07F015/00"  0
                        end
                        
                        isid ipc_cpc appln_id
                        tempfile original
                        save `original', replace
                        
                        keep ipc_cpc app_year
                        duplicates drop
                        tempfile copy
                        save `copy'
                        rename ipc_cpc ipc_cpc_1
                        joinby app_year using `copy'
                        rename ipc_cpc ipc_cpc_2
                        drop if ipc_cpc_2 <= ipc_cpc_1
                        gen original = `"`original'"'
                        
                        capture program drop one_pair_year
                        program define one_pair_year
                            local code1 = ipc_cpc_1[1]
                            local code2 = ipc_cpc_2[1]
                            local app_year = app_year[1]
                            local original = original[1]
                            use ipc_cpc appln_id app_year using `original' if app_year == `app_year' ///
                                & inlist(ipc_cpc, `"`code1'"', `"`code2'"'), clear
                            by appln_id, sort: egen has_code1 = max(ipc_cpc == `"`code1'"')
                            by appln_id: egen has_code2 = max(ipc_cpc == `"`code2'"')
                            keep if has_code1 & has_code2
                            by appln_id: keep if _n == 1
                            count
                            local co_occurrences = r(N)
                            clear
                            set obs 1
                            gen ipc_cpc_1 = `"`code1'"'
                            gen ipc_cpc_2 = `"`code2'"'
                            gen app_year = `app_year'
                            gen co_ocurrences = `co_occurrences'
                            exit
                        end
                        
                        runby one_pair_year, by(ipc_cpc_1 ipc_cpc_2 app_year) status
                        -runby- is written by Robert Picard and me, and is available from SSC.

                        The slow points in this will be the -use- and -save- commands, -joinby-, and -runby-. -runby- will give you a periodic progress report showing how much it has finished and an estimate of the remaining time.

                        Note that if this -joinby- command gives you an error message similar to what you got from -cross-, I would conclude that this problem simply cannot be solved in Stata with a data set this large, because at no point will -joinby- require more observations than the minimum number needed as described in my first paragraph in this post.

                        At the end of this code the data in active memory will consist of pairs of ipc/cpc codes and years in which either code in the pair appears in such patents, followed by the number of patents from that app_year in which both members of the pair are found.

                        Second, I do not fully understand why in the following tab...
                        I don't think that table makes sense. There is nothing meaningful about the number of co-occurrences per patent. We are calculating co-occurrences of pairs of technology. With the revised code, you won't even be able to run a table per patent, because the final data set doesn't even have any patents in it. It's just a count of co-occurrences for each technology pair in each year.

                        Comment


                        • #13
                          Dear Clyde Schechter, thanks again for all your assistance, help, and patience. And thanks for this second code. Unfortunately, as you commented in #12, joinby command gives error:
                          Code:
                          . joinby app_year using `copy'
                          op. sys. refuses to provide memory
                          I have re-done this analysis only for a single country just to see if with a smaller sample it run, but the same error appears. In fact, after the -duplicates drop- in your last code (before joinby) I have a considerable smaller number of obs (787300), but still the error appears.
                          I cannot figure out how people working with patent data (PATSTAT or EPO) then can handle this in Stata. Maybe it is because they focus on the aggregation at regional level (even though relatedness measures are done by co-occurrences in technologies in their papers). Maybe Julio Raffo, that work with patent data would like to share his opinion.
                          Anyway, I thank you very much for your help.

                          Comment


                          • #14
                            Well, actually, this error message isn't quite the same thing. It's not saying that the resulting data set would be too large for Stata to work with. It's saying that the operating system won't give you enough memory to create it. This might be solved by moving to a computer with more RAM.

                            Another thought would be to run just one year's worth of data, and if that works out, do that over again, one year at a time, and then append all the results together at the end.

                            It's also possible there is a more memory-conserving way to do this that I've overlooked. I'll be interested to see if Julio Raffo responds with something more workable.

                            Comment

                            Working...
                            X