How to efficiently replicate this indicator in a huge dataset?

Doris Rivera

Join Date: Feb 2020
Posts: 172

How to efficiently replicate this indicator in a huge dataset?

16 Dec 2022, 08:08

Dear Statalist, I am wondering if you can help to do the analysis I am doing in the loop (see below) in a more efficient way (faster way). For a group of cited patents I want to generate an indicator that tells me if the firms (a patent may be developed by several firms that are captured in several variables) in _n are in any observation within that group (_n+1, _n+2…). So, if only one firm in the groups of variables in _n is present in the group of variables in _n+x, I would assign a 1. So, let’s say that I have:

	Firm1	Firm2	Firm3	Firm4
_n	3	6	8	9
_n+1	1	2	4	5
_n+2	6	7	11	12

Then the loop will create several variables that assign a 1 if that obs have a firm in common with other obs. That is, following the previous table, for _n and _n+1 it will create a variable (in _n) that will be missing in that obs because there are not common firms between _n and _n+1. But between _n and _n+2 it will create another variable that will be 1 because the same firm is in Firm2[_n] and Firm1[_n+2]. Thereafter for having this information in a single variable I use “egen rowtotal” in the second part of the loop.

Look that the command is not comparing firm1[_n] with firm1[_n+x]; but firm1[_n] with firm1, firm2, firm3, firm4… in [_n+x]. And the same for firm2[_n]…

Since I am using a dataset with more than 7million obs and there are around 40 firms variables and some groups have around 2000 obs, this loop is endless. Can you please suggest me a more efficient way to know if for the group of cited patent there are any citing firm in [_n] that is also in [_n+x]?

Below you can see the loop and a small sample of the data.

Thanks in advance!

Code:

ds citing_firm_id*
local nwords :  word count `r(varlist)'
display `nwords'
display wordcount("`r(varlist)'")

sum ccc if cited_firm_id1!=. & citing_firm_id1!=.
local max_k = r(max)
   
forvalues k = 1/`max_k'  { 
forvalues x = 1/`nwords' { 
forvalues y = 1/`nwords' {   

sort cited_appln_id citing_year citing_appln_id
bys cited_appln_id: gen count4_`x'_`y'_`k' = 1 if citing_firm_id`x'==citing_firm_id`y'[_n+`k'] & !missing(citing_firm_id`x',citing_firm_id`y'[_n+`k'])
}
}
egen self_cit_v_k`k' = rowtotal(count4_*), missing
drop count4_*
}

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(citing_appln_id cited_appln_id) float(citing_firm_id1 citing_firm_id2 citing_firm_id3 citing_firm_id4 citing_firm_id5 citing_firm_id6 citing_firm_id7 citing_firm_id8 citing_firm_id9 citing_firm_id10 cited_firm_id1 citing_year ccc)
273682949  522 2931813       .       . . . . . . . . 2931813 2010  1
315557296  522 1709046 2931813       . . . . . . . . 2931813 2010  2
274384595  522 2931813       .       . . . . . . . . 2931813 2011  3
333687126  522 2931813       .       . . . . . . . . 2931813 2012  4
334378122  522 2931813       .       . . . . . . . . 2931813 2012  5
336160680  522 2931813 4073560       . . . . . . . . 2931813 2013  6
337102271  522 2931813       .       . . . . . . . . 2931813 2013  7
337102281  522 2931813       .       . . . . . . . . 2931813 2013  8
352267359  522 2931813       .       . . . . . . . . 2931813 2013  9
379237343  522 1394838       .       . . . . . . . . 2931813 2013 10
380717448  522  296197       .       . . . . . . . . 2931813 2014 11
381189080  522 2931813       .       . . . . . . . . 2931813 2014 12
412500492  522  908975       .       . . . . . . . . 2931813 2015 13
415658469  522  908975       .       . . . . . . . . 2931813 2015 14
418707239  522 2931813       .       . . . . . . . . 2931813 2015 15
419016602  522 2931813       .       . . . . . . . . 2931813 2015 16
438831616  522  908975       .       . . . . . . . . 2931813 2016 17
439633449  522  908975       .       . . . . . . . . 2931813 2016 18
474527945  522  918273       .       . . . . . . . . 2931813 2018 19
524645129  522 2931813       .       . . . . . . . . 2931813 2020 20
 56840315  540  296791 3091119 3096471 . . . . . . .  296791 2009  1
273300774  540  296791       .       . . . . . . . .  296791 2010  2
   208669 1186  140411       .       . . . . . . . .  140411 2003  1
 15909980 1186  557078       .       . . . . . . . .  140411 2003  2
 15999323 1186  140411  238661       . . . . . . . .  140411 2003  3
 16031254 1186  981997       .       . . . . . . . .  140411 2004  4
336876707 1186  140411  238661       . . . . . . . .  140411 2011  5
end

Tags: None

Mike Lacy

Join Date: Apr 2014

Posts: 2413
#2

16 Dec 2022, 09:15

I'm finding it difficult to understand what you are trying to describe, but I'll start with one simple but crucial idea:

As you may know, any algorithm with a *sort* in the middle of a nest of loops will take a very long time even for a modest sized sample. To my best understanding, none of the variables on which you are sorting are modified within the nested loop, so you should be able to move the -sort- outside the nested loops. A related matter is that a simple "by" rather than "bys" should work. I think Stata is smart enough not to perform the sort implied by "bys" if a data set is already sorted, but I'd nevertheless avoid that and just use "by" because the data is already sorted. The lesson here is that sorting is computationally expensive in all computer languages, and being careful about its use is important.

My second comment is more generic: I would wonder if your conceptualization of a data structure and its implementation might not be the best for your problem. (I could be wrong since I don't understand the substance of your problem very well.) It sounds like the patent application is your unit of analysis, but I'd think that maybe the firm or firm-pair might be a better choice.. Regardless of that, I also have the thought that a long data layout, with observations defined by pairs of firms (citing/cited), would likely be better. That's a Stata-specific suggestion, based on the idea that long layouts are almost always easier for Stata and its user than are wide ones. It's quite unusual, especially in Stata, to need a triply nested loop, so that makes me suspicious that there's a conceptual problem here.

Related to that is that a more substantive and simpler description of your problem might help people understand it well enough to suggest an alternative approach. In context, I'll offer advice I commonly suggest: Show your description to a colleague, even one not particularly expert about what you are doing, in order to get help creating a more comprehensible description for people here.

Finally, I'd note that what you are doing sounds like it is a social network problem, as I believe it centers around ties between firms and their linkage through patents. If so, the community-contributed package -nwcommands" (-net describe nwcommands-ado-) may eventually prove useful, as it embodies a range of commands to efficiently handle network data. However, I would not suggest you look at that right now. That's a suggestion that might be relevant if you continue to have problems with this data set.
2 likes
Comment
Doris Rivera

Join Date: Feb 2020

Posts: 172
#3

17 Dec 2022, 02:40

Dear Mike Lacy, you have pointed me to some very interesting and useful suggestion I did not know (not sorting if possible within a loop, and using "by" inistead of "bys").
In fact this is an intermediate step for what I need (I did not explain it to not make the post too long).
Briefly, for each cited patent (group) I need to count citations discarding those citations coming from the same citing firm. This is why I created this indicator (loop) to tell me if among the different observations within each cited patent (between _n and _n+1, _n and _n+2,...) two any observation share the same citing firm.
The problem comes from the fact that each patent (observation) can be made by more than one firm (which is captured by different variables: citing_firm_id1, citing_firm_id2,...).
So, what I need is to know if there is a common firm in the list of citing firms between two any observation within the cited patent group, in order to use thereafter this information in the real counting of patents. I thought that maybe a command (like "distinct" or any other that I do not know) might be faster than just building this loop. Doing simple maths, it takes aprox. 1 day for this loop to finish 1 of its nearly 2000 rounds (something I cannot afford). This is way I am kindly asking for your help.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2413
#4

17 Dec 2022, 12:41

First, let me say that I think I was wrong about the -sort- inside the nested loops slowing down your code. In many languages it would, but I did some small simulations and it appears that Stata avoids re-sorting data that is already sorted, so the -sort- inside the loops would never happen. I still would not code that way.

Anyway, I don't understand what you are doing any better than before. I don't know if I will be able to help you, but here are some questions that might help me or someone else understand what you are doing and what you want.

1. What is your unit of analysis? Is it the patent application, or is it the firm? What set of variables distinctly identify each observation?
2. When you say "each patent (group)," I'm somewhat confused. Do you mean: "Patent applications are my unit of analysis. There are multiple observations pertaining to each patent application, which I call "groups."
3. I don't know what it means to say that "each patent...can be made by more than one firm... ." Firms don't "make" observations. Perhaps you mean something like: "Each patent can be linked to more than one firm. Each time a firm is linked to (perhaps "cites") a patent, I have a variable giving the name of the firm that did that. I don't know why you see this as a "problem."
4. I don't know what you mean by "a common firm," or what you mean by "list of citing firms."
5. I don't know what you have in mind in referring to _n, _n+1, ... The abbreviation "_n" denotes an observation number in Stata, but you mean something else.
6. The statement "I created this indicator (loop)..." is confusing to me because an indicator cannot be a loop.
7. I don't know what you mean by "discarding those citations ..." Perhaps you mean something like "I don't want to count these citations more than once."

Other people may understand what you're doing, but I'm confused. As I tried to say above, it would be easier if you gave a more substantive description of your data and your goals. Much of the difficulty for me here is that your data is described such that a person would need to know something about patent applications and citing/cited features of them. However, only a few people who participate here are likely to do research on such matters. By describing your data in a way that requires that expertise, you have unfortunately limited the number of people who could help here. That's why I suggested you get some help from a colleague to describe your data. As I said previously, I would think of your data in social network terms, which is more generic than "patent applications," but which might be just as opaque to you as patent application practices are to many of the rest of us <grin>.

One simple thing that might help here would be if you told us the meaning of each of your variables.
1 like
Comment
Doris Rivera

Join Date: Feb 2020

Posts: 172
#5

18 Dec 2022, 02:39

Dear Mike Lacy, thanks for taking your time to help me, I really apreciate it. I thought that keeping things simple would be helpful for you. At the end, what I need is something mechanical: within a specific group (cited_appln_id) how do I identify if one of the firms in [citing_firm_id1, citing_firm_id2, citing_firm_id3,...] in _n are also in the group of firms [citing_firm_id1, citing_firm_id2, citing_firm_id3,...] in any of the following obs (_n+1, _n+2, _n+3,...). That said, I explain some of the points you asked before for if it help you or any other person to better understand my problem.

A patent (invention) can be developed by several firms which is captured in the dataset as different firms variables [citing_firm_id1, citing_firm_id2, citing_firm_id3,...]. And in that document, other patents are cited (which is my group for doing the counting)

1. My unit of analysis is the patent. Each observation is a single patent.
2. The counting needs to be done for each group, which in this case is the cited patent (cited_appln_id).
3. A patent (invention) can be made by a single firm, or by more than 1 firm. So, in one observation you may have a single firm (in that obs, which will be captured by citing_firm_id1) or more than one firm (citing_firm_id1, citing_firm_id2,...). This is the problem, because I have involve several variables for determining what I want (if any firm in the group of variables for firms [citing_firm_id1, citing_firm_id2, citing_firm_id3,...] in _n are present in the group of variables [citing_firm_id1, citing_firm_id2, citing_firm_id3,...] in _n+x (within the group of cited patents).
4, A common firm mean to have any of the firms appearing in [citing_firm_id1, citing_firm_id2, citing_firm_id3,...] in _n, also appearing in [citing_firm_id1, citing_firm_id2, citing_firm_id3,...] but in _n+x (do not know how to explain this in a better way)
5. As in the loop, groups are sorted in a unique combination, so, _n is the current obs and _n+1 is the next obs...
6. The indicator is created through this loop
7. This mean that I won't count that citation. Let's imagine that I have a group of cited patents with 3 obs. If the first one and the last one have the same citing firm (which is why I do in the loop), I won't use that obs in the following counting.

Maybe, looking at the table and its explanation in #1 could help you understand what I need. The problem is that the loop is endless.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2413
#6

19 Dec 2022, 08:37

This description should be helpful. I'm going to step aside here, and perhaps someone else can now come in and solve your problem.
Comment
Doris Rivera

Join Date: Feb 2020

Posts: 172
#7

20 Dec 2022, 02:23

Dear Mike Lacy, thanks anyway for your time and advices!

I was wondering if in Stata 16, this could be implemented using python (maybe that would be faster?). Even though I do not know nothing about python and I understand this is not a python forum.
I am sure there might be a Stata command for looking at differences among groups of variables between any two observations. That is, if citing_firm_id1[_n] is different than citing_firm_id1[_n+1], citing_firm_id2[_n+1], citing_firm_id3[_n+1],... if citing_firm_id2[_n], is different than citing_firm_id1[_n+1], citing_firm_id2[_n+1], citing_firm_id3[_n+1],...
In fact this is what the above loop does, but it is really slow with more than 7 million obs.

I will really apreciate your help!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30078
#8

20 Dec 2022, 11:50

I don't know what you will be doing with your results once you finally get them. So my comments here may not be useful. But I think the bulk of the problem arises from the fact that you are starting with a data set organized in ways that are suboptimal for Stata's approach to data management and trying to build from it another data set that is still more alien to the Stata approach. Consequently, Stata is not making it easy for you at all.

Again, with the caveat that I don't know where you are going from here, I would approach this entirely differently. The first thing I would do is convert the starting data set into a long layout:

Code:

reshape long citing_firm_id, i(citing_appln_id ccc) j(seq) drop if missing(citing_firm_id)

and work with that one going forward. This organization is much more amenable to Stata's data management facilities. Since your data set has 7,000,000 observations to start out, -reshape- is going to be slow. You can greatly improve on its speed by using one of the user-contributed commands -tolong- or -greshape- instead. The former is available from SSC and the latter is part of the -gtools- suite of commands, also available from SSC. Either one of these will do the -reshape- faster, and the syntax is the same as Stata's -reshape- command.

Once you have this organizations, finding redundant citing_firm_id's from observations in higher numbered ccc's for the same citing_appln_id is also pretty straightforward:

Code:

preserve keep cited_appln_id ccc citing_firm_id tempfile holding save `holding' restore rangejoin ccc 1 . using `holding', by(cited_appln_id citing_firm_id)

For this you will need to install the -rangejoin- command, by Robert Picard, available from SSC. And -rangejoin-, in turn, requires -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, again available from SSC.

At this point you now have each citing firm associated with a new variable ccc_U, which indicates the higher ccc values in which that same citing firm appears. This is now a doubly-long data set, and it contains all of the information you are seeking*, though organized rather differently. But it is likely that this data organization, too, will serve you better in Stata for whatever you want to do with it, than the one you were seeking.

This approach will be appreciably faster than what you have done, and if you can make use of the data organized in this way, you will probably save time further down the line as well. By the way, while it will be faster, it will not be fast. But you can expect this to run in minutes or hours rather than hours or days.

*Added: Actually, this gives you more information than what you were generating: you now know not only that there is some duplication of citing_firm_id (and where it occurs), you also now know which citing_firm_id(s) are duplicated.

Last edited by Clyde Schechter; 20 Dec 2022, 12:08.
Comment
Doris Rivera

Join Date: Feb 2020

Posts: 172
#9

20 Dec 2022, 16:05

Dear Clyde Schechter , thanks a lot for your help! Without making this post too long, this data structure came from an intent of merging two previous datasets without using m:m (you advice me once not to use m:m and I keep it always in mind).

That said, I got an error when reshaping because a citing patent may cite several other patents (cited_appln_id), so, there is not a unique combination of citing_appln_id ccc. Sorry, I did not figure it out when putting the dataex example.

The main approach (after this step) is to count the number of citing_appln_id within each group of cited_appln_id but not including those obs in which the citing_firms (associated to each citing_appln_id) are the same. This is why as a previous to this counting step, I created the indicator using that loop for knowing if the citing_firms between two obs are the same.

Of course, having this info in a single citing_firm variable is better, because then I can tell the process for counting the condition not using the same citing_firm in an easier way (without generating such indicator). However, I am afraid that not having a single citing_appln_id per obs (within a group of cited_appln_id) may double account those citing patents with more than one firm (a single patent will have several obs within the cited patent group). Not only that, if a patent with two firms is in two obs (instead of one obs), I will have two different firms and when asking Stata to count the number of citing_appln_id it may bias even more the counting. Lets imagine that we have two citing_appln_id (1 and 2) in a group of cited_appln_id. Imagine that citing_appln_id 1 is done by firms A and B, and 2 is done by firm C. If I count as it is right now (each citing_appln_id is an obs within cited_appln_id groups), I will obtain 1 for citing_appln_id=1 (because the other obs within the group is from another firm, which is C). And for citing_appln_id 2, the counting is also 1 (because there is only another obs in which firms are different from C).

If I do the same, but this time having all citing firms in a single variable, then I will have (within the same cited_appln_id group), 3 obs (two from citing_appln_id 1 with firms A and B) and a third obs for citing_appln_id 2 (with firm C). If I count this for different citing_firms, the counting will be different (biased). Each of the three obs will have a count of 2 (which is not true, because firms A and B should imaginary account as a single unit). Notice that in the example, the own obs does not enter in the counting.

Sorry if I made this too long, but I wanted to explain you the best I can.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30078
#10

20 Dec 2022, 17:48

this data structure came from an intent of merging two previous datasets without using m:m (you advice me once not to use m:m and I keep it always in mind).

Ah, well I probably should have told you to reshape those data sets to long first, and then it would have resulted in a long layout in the first place.

I'm afraid I find this all very confusing. It's hard for me to keep straight in my head citing patents and cited patents and citing firms and it all sort of blurs together, probably because it all sounds very similar. I guess if I worked with these things regularly, as you do, it would be easier for me to visualize what is needed. Anyway, I didn't really understand your explanation. Can you post a new data example that exhibits the phenomena you are referring to and indicate how you would want the results to come out in that data example?
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1381

#11

20 Dec 2022, 19:12

I'm not entirely certain I understand your problem either, but perhaps this code might help. I second the suggestion to reshape your data to long form first -- if this is already the case at a previous point in your code, you may not even need to do the reshape step below.

Code:

reshape long citing_firm_id, i(cited_appln_id cited_firm_id1 ccc) j(num)
drop if missing(citing_firm_id)
bys cited_appln_id citing_firm_id: gen num_times_citing = _N

Then, whenever num_times_citing is more than 1, that is a situation where for any group defined by cited_appln_id, there are multiple instances of the same citing_firm_id.

A quick way of seeing the results:

Code:

. egen byte tag = tag(cited_appln_id citing_firm_id)
. li cited_appln_id cited_firm_id1 citing_firm_id num_times_citing if tag, noobs ab(20) sepby(cited_appln_id)

  +---------------------------------------------------------------------+
  | cited_appln_id   cited_firm_id1   citing_firm_id   num_times_citing |
  |---------------------------------------------------------------------|
  |            522          2931813           296197                  1 |
  |            522          2931813           908975                  4 |
  |            522          2931813           918273                  1 |
  |            522          2931813          1394838                  1 |
  |            522          2931813          1709046                  1 |
  |            522          2931813          2931813                 13 |
  |            522          2931813          4073560                  1 |
  |---------------------------------------------------------------------|
  |            540           296791           296791                  2 |
  |            540           296791          3091119                  1 |
  |            540           296791          3096471                  1 |
  |---------------------------------------------------------------------|
  |           1186           140411           140411                  3 |
  |           1186           140411           238661                  2 |
  |           1186           140411           557078                  1 |
  |           1186           140411           981997                  1 |
  +---------------------------------------------------------------------+

Last edited by Hemanshu Kumar; 20 Dec 2022, 19:15.

Comment

Doris Rivera

Join Date: Feb 2020
Posts: 172

#12

22 Dec 2022, 06:32

Dear Clyde Schechter and Hemanshu Kumar , sorry for the delay in answering, but I was testing how to take a sample that cover what I think will let you have an idea of the problem. First, Clyde’s suggestion is interesting, but, I cannot reshape to long with any of the suggested commands (memory problems for -tolong- or -greshape-, and the reshape tells me that there is not a unique combination of citing_appln_id ccc, which is normal because a patent can cited several patents in the same document and ccc is a sorting base on cited patents). That said, I manage to put the data as in a long format using the append command (see below the new dataex).

Now that the data is in long format, I think I can go directly to the true issue I want to do (without doing this intermediate step about knowing if any of the firms in the group of citing firms variables in _n are also in the group of citing firms variables in _n+x).

What I need is to count for each group of cited patents (cited_appln_id) in each citing_year, how many other patents are. “Others” mean that those other patents should not come from the same cited firm (captured in different cted_firm_id* variables) but neither from the same citing firm. The problem are the patents with more than one firm. In a wide format (as the data were structured before) it was easy to know (all citing firms are in different variables for the same row; but the loop were endless). Now, a patent with more than one firm is in different rows. So, I do not want to double account those different obs in the counting.

Looking at cited_appln_id==1226 in 2012 (see table below), we see that the first two obs are the same citing patent (citing_appln_id). So, these two obs should be (imaginary speaking as a single obs for the counting). For the first obs, there are 3 other citing patents in the same group of cited patent in the same year (which are obs number 3, 4 and 5). Since the second obs is in fact the same patent, this must not be accounted for the first obs. The counting for the second obs should be the same since they are the same citing patent. For the third obs, the counting is 2 (only the last obs and the first and second [that act as a single obs for the counting] comes from different citing and cited firms taking as reference the third obs).
Notice that if the citing firm in the second obs were (3657), this would imply that neither the first and second obs should be accounted for the counting of the third obs (since the obs 1 and 2 are the same citing patent, and they should be like a single obs, this would imply that the firm in the third obs would be in the group of firms from the first patent (obs 1 and 2).

citing_appln_id	cited_appln_id	citing_firm_id1	cited_firm_id1	citing_year	ccc	Count
332431744	1226	697362	140411	2012	19	3
332431744	1226	1904196	140411	2012	20	3
338133576	1226	3657	140411	2012	21	2
338133577	1226	3657	140411	2012	22	2
341003673	1226	1325123	140411	2012	23	3

The same would happen for the cited_appln_id==42. In that case, obs 2 and 3 are the same citing patent, and should only count as 1 for any other citing patent in the same group (cited patent and year); and should not count for another obs with the same citing patent (here would be for obs 2 and 3).

Summarizing, the counting should be within the cited patent and year, not accounting those other citing patents coming from the same citing and cited firms. And if a citing patent within the group and year are in several obs (because there are more than one firm in that patent), it should account only as 1 for other citing patents (with different citing and cited firms), while not account for the several obs with their own citing patent.

I really hope it is much clearer now. Please, let me know if any other explanation is needed. And thanks a lot for your help!!!

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double citing_appln_id float citing_year double cited_appln_id long citing_pub_date float(citing_firm_id1 cited_firm_id1 cited_firm_id2 cited_firm_id3 ccc)
 15914251 2003    1 20030709   20671 2125445       . .  1
 15937040 2003    1 20031126  540830 2125445       . .  2
 16122608 2006    1 20060531 3449971 2125445       . .  3
 16300347 2007    1 20070516 2537543 2125445       . .  4
 55416440 2009    1 20090211 1663023 2125445       . .  5
 15971185 2004    2 20040630 1416041  395785 1913710 .  1
507275964 2019    2 20190710 2024639  395785 1913710 .  2
 15832997 2001    4 20010725   77933   77933       . .  1
 15818506 2002    4 20020313 2079773   77933       . .  2
   177563 2004    4 20040728 1432953   77933       . .  3
    40340 2008    8 20081022 1982247  188973       . .  1
424539427 2015    9 20150715 1211494 1861960       . .  1
   315410 2008   21 20080213 2207387  778539 2281836 .  1
417362469 2015   21 20150902 1164943  778539 2281836 .  2
417363086 2015   21 20150902 1164943  778539 2281836 .  3
 16122009 2006   23 20060503  888369 1663881       . .  1
406381978 2014   23 20140917 2860620 1663881       . .  2
 16402844 2009   37 20090401 3457296 1611455       . .  1
 55321597 2009   38 20090107 1153579 1153579       . .  1
   308298 2007   42 20070124  441693  698218       . .  1
 16331494 2007   42 20071031    1301  698218       . .  2
 16331494 2007   42 20071031 2764638  698218       . .  3
341312005 2013   45 20130605  841007  841007       . .  1
381189080 2014   52 20140820 2931813 3142237       . .  1
267633559 2011   53 20110119 2282952  476769       . .  1
274645935 2010   55 20101020 3273351 3273351       . .  1
   219419 2004   56 20041229  723786  718348       . .  1
 55402958 2010   58 20100414  707578  855696 1911535 .  1
411462882 2014   58 20141119 1915882  855696 1911535 .  2
415014423 2015   58 20150617 1322452  855696 1911535 .  3
    65413 2009   59 20090311  352268  352268       . .  1
 16296111 2007   61 20070228 1663023  698218       . .  1
407212071 2014   61 20140806 1663023  698218       . .  2
 16096647 2005   62 20051130  580131  621731       . .  1
417626862 2015   62 20151111 1722279  621731       . .  2
408934947 2015   63 20150701  558120 3418455       . .  1
   230261 2007   75 20071010  971192  438061       . .  1
 16400793 2009   75 20090218  439120  438061       . .  2
   411326 2010   75 20100113 2650367  438061       . .  3
 55065299 2010   75 20100421 2650367  438061       . .  4
352907144 2013   75 20130918  192823  438061       . .  5
409346005 2013   75 20131204 2650367  438061       . .  6
450737356 2016   75 20160928 1781211  438061       . .  7
273682949 2010  522 20100825 2931813 2931813       . .  1
315557296 2010  522 20101124 1709046 2931813       . .  2
315557296 2010  522 20101124 2931813 2931813       . .  3
274384595 2011  522 20110810 2931813 2931813       . .  4
333687126 2012  522 20121031 2931813 2931813       . .  5
334378122 2012  522 20121205 2931813 2931813       . .  6
336160680 2013  522 20130220 2931813 2931813       . .  7
336160680 2013  522 20130220 4073560 2931813       . .  8
337102271 2013  522 20130320 2931813 2931813       . .  9
337102281 2013  522 20130320 2931813 2931813       . . 10
352267359 2013  522 20130925 2931813 2931813       . . 11
379237343 2013  522 20130626 1394838 2931813       . . 12
380717448 2014  522 20140806  296197 2931813       . . 13
381189080 2014  522 20140820 2931813 2931813       . . 14
412500492 2015  522 20150520  908975 2931813       . . 15
415658469 2015  522 20150909  908975 2931813       . . 16
418707239 2015  522 20151209 2931813 2931813       . . 17
419016602 2015  522 20151216 2931813 2931813       . . 18
438831616 2016  522 20161012  908975 2931813       . . 19
439633449 2016  522 20161026  908975 2931813       . . 20
474527945 2018  522 20181128  918273 2931813       . . 21
524645129 2020  522 20200805 2931813 2931813       . . 22
 56840315 2009  540 20090617  296791  296791       . .  1
 56840315 2009  540 20090617 3091119  296791       . .  2
 56840315 2009  540 20090617 3096471  296791       . .  3
273300774 2010  540 20100714  296791  296791       . .  4
      644 2009  636 20091118    5661    5661 1195552 .  1
      644 2009  636 20091118 1195552    5661 1195552 .  2
 56755306 2010  636 20100414    5661    5661 1195552 .  3
 56755306 2010  636 20100414 1195552    5661 1195552 .  4
409742446 2014  636 20140305 2569580    5661 1195552 .  5
 16978768 1993  737 19930811   69690   11335       . .  1
 16978768 1993  737 19930811 2135375   11335       . .  2
 17302359 1998  737 19981216 1686986   11335       . .  3
 17302361 1998  737 19981118 1686986   11335       . .  4
   208669 2003 1186 20031022  140411  140411       . .  1
 15909980 2003 1186 20030102  557078  140411       . .  2
 15999323 2003 1186 20031022  140411  140411       . .  3
 15999323 2003 1186 20031022  238661  140411       . .  4
 16031254 2004 1186 20040102  981997  140411       . .  5
336876707 2011 1186 20111221  140411  140411       . .  6
336876707 2011 1186 20111221  238661  140411       . .  7
 15824384 2002 1226 20020605  140411  140411       . .  1
 15902830 2002 1226 20020911  140411  140411       . .  2
 15902831 2002 1226 20020911  140411  140411       . .  3
 15903340 2002 1226 20020925  140411  140411       . .  4
 15945196 2003 1226 20031217  538054  140411       . .  5
 15959096 2004 1226 20040324  140411  140411       . .  6
 16284755 2006 1226 20060712    3657  140411       . .  7
   308274 2007 1226 20070207  538063  140411       . .  8
   312748 2007 1226 20071219  538063  140411       . .  9
 55060965 2009 1226 20090311    3657  140411       . . 10
 56671571 2009 1226 20090722  505020  140411       . . 11
315728818 2010 1226 20100811    3657  140411       . . 12
320769678 2010 1226 20101117 1928613  140411       . . 13
324145471 2010 1226 20101215    3657  140411       . . 14
321288832 2011 1226 20110209 1928613  140411       . . 15
326067183 2011 1226 20110309    3657  140411       . . 16
328004893 2011 1226 20110427    3657  140411       . . 17
328004917 2011 1226 20110525    3657  140411       . . 18
332431744 2012 1226 20120704  697362  140411       . . 19
332431744 2012 1226 20120704 1904196  140411       . . 20
338133576 2012 1226 20120201    3657  140411       . . 21
338133577 2012 1226 20120321    3657  140411       . . 22
341003673 2012 1226 20120801 1325123  140411       . . 23
413982151 2014 1226 20140430 1928613  140411       . . 24
418419198 2014 1226 20141008 1928613  140411       . . 25
409998359 2015 1226 20150701 1928613  140411       . . 26
438491636 2015 1226 20151007  540780  140411       . . 27
438491664 2015 1226 20151007  540780  140411       . . 28
420725092 2016 1226 20160210 1322929  140411       . . 29
421528978 2016 1226 20160309 1322929  140411       . . 30
423239555 2016 1226 20160504 1322929  140411       . . 31
424539747 2016 1226 20160622 1322929  140411       . . 32
424539751 2016 1226 20160622 1322929  140411       . . 33
426257185 2016 1226 20160810 1322929  140411       . . 34
426257189 2016 1226 20160810 1322929  140411       . . 35
437284626 2016 1226 20160831 1322929  140411       . . 36
438143296 2016 1226 20160921 1322929  140411       . . 37
438408396 2016 1226 20160928 1322929  140411       . . 38
438663425 2016 1226 20161005 1322929  140411       . . 39
447230001 2016 1226 20160622 3149826  140411       . . 40
458143228 2016 1226 20161207 1928613  140411       . . 41
445024421 2017 1226 20170405 1322929  140411       . . 42
445024432 2017 1226 20170405 1322929  140411       . . 43
478230320 2017 1226 20171018 3149826  140411       . . 44
478351862 2017 1226 20171025 1928613  140411       . . 45
477661080 2018 1226 20180110 3149826  140411       . . 46
483707380 2018 1226 20180418 1928613  140411       . . 47
487374497 2018 1226 20180509  228839  140411       . . 48
487374497 2018 1226 20180509 3149826  140411       . . 49
489799033 2018 1226 20180711 3149826  140411       . . 50
490336834 2018 1226 20180711 1928613  140411       . . 51
493704409 2018 1226 20180926 1928613  140411       . . 52
484506355 2019 1226 20190410 1322929  140411       . . 53
496995569 2019 1226 20190109 1928613  140411       . . 54
499795795 2019 1226 20190206 1928613  140411       . . 55
502703402 2019 1226 20190327 1928613  140411       . . 56
510963642 2019 1226 20190904 1928613  140411       . . 57
517137855 2019 1226 20191218 1928613  140411       . . 58
497704351 2020 1226 20200205 1322929  140411       . . 59
498068991 2020 1226 20200212 1322929  140411       . . 60
519627349 2020 1226 20200212 3149826  140411       . . 61
522905490 2020 1226 20200422 3149826  140411       . . 62
523245625 2020 1226 20200506 1928613  140411       . . 63
527225683 2020 1226 20200812 1928613  140411       . . 64
528222773 2020 1226 20200916 1928613  140411       . . 65
530694846 2020 1226 20200923 1928613  140411       . . 66
540195705 2021 1226 20210317 1928613  140411       . . 67
542422974 2021 1226 20210414 1928613  140411       . . 68
end

Last edited by Doris Rivera; 22 Dec 2022, 07:22.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30078
#13

22 Dec 2022, 12:43

I am more confused than before.

Looking at your example table (not the -dataex-), I think I understand why the first and second observations should only be counted as 1, because they have the same citing_appln_id. But why shouldn't the third and fourth be also counted as one since they have the same citing_firm_id?

Where am I going wrong here?
Comment
Doris Rivera

Join Date: Feb 2020

Posts: 172
#14

22 Dec 2022, 13:47

Dear Clyde Schechter, obs 3 and 4 should count as two separate obs for other citing patents with different citing firms. That is, for obs 1, the valid obs in the counting are obs 3, 4 and 5. Why? Because these are different citing_appln_ids (from obs 1) coming from different citing firms (than the ones in the patent in the obs 1 -> notice that this patent have two firm invoved: one in obs 1 and the second in obs 2) as well as the fact that the citing firms in obs 3, 4 and 5 are different than the cited firms in obs 1 and 2. The same happen for obs 2. However, for obs 3, the valid obs for the counting are obs 5 and (1 or 2 -> because they are a single patent). And the same happen for obs 4. For obs 5, the valid to the counting obs are (1 or 2, 3 and 4 (that is, 3 in total).

The idea is not to count the other rows within cited_appln_id in each year, but to count the other citing_appln_id (knowing that a citing patent can be in several obs in case of more than one firm in the patent, and should count as a single patent) if coming from different citing firms and cited firms.

If we had a 6th obs in which the citing firm is the cited firm (140411), then we will not do the counting because it would be a self citation, and thus, I would be missing and should not count for any other citing patent within the group. The same would happen if in obs 5 we had a second cited firm being the same as in the citing firm (1325123). That would also be a self citation in the same row (citing firm is one of the ccited firms).

Thanks a lot for your time and help Clyde!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30078
#15

22 Dec 2022, 14:57

I'm sorry, but I'm just getting more confused. I think I need to give up at this point. I'm sorry. I hope somebody else can make sense of this.
Comment

Announcement

How to efficiently replicate this indicator in a huge dataset?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment