Comparing data in the same field of two rows in a group

Michael Risch

Join Date: Jul 2014

Posts: 23
#1

Comparing data in the same field of two rows in a group

23 Mar 2025, 15:50

Hi all -

I'm doing a project where I need to compare the data of a single field, but by grouped rows (so something like matchit won't really work unless I am misunderstanding it). I've looked for solutions here, but have not gotten many hits. It may well be my search is using the wrong words, so sorry if this is a repeat. Basically, I have multiple patents in a "family" and I want to see if the patents have the same inventors.

So a simplified dataset would be:
patno,familyno,inventor
1,1,A
2,1,A
3,1,B
4,2,C
5,2,C

I want to test all familyno of 1, 2, etc. and have stata report, tag, etc., all patno with an inventor that matches another with the same familyno.

Problem 1: doing the comparison at all
Problem 2: doing a fuzzy comparison
Problem 3: it can be multiple match (that is patno 1 and patno 2 could have both A and B and I want to see if both are the same)

FWIW, this is a followon of this discussion https://www.statalist.org/forums/for...ation-probleim
I have done that consolidation, but the counting methods don't work as well (I don't think) because a) fuzzy matching on names and b) not every family has all or none

Thoughts appreciated.

Last edited by Michael Risch; 23 Mar 2025, 16:25.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

23 Mar 2025, 18:44

I think if I were tackling this problem, the first thing I would do is get past the fuzzy matching part. So I would create a new data set that contains two variables, each containing all of the distinct values of inventor.

Code:

keep inventor duplicates drop clonevar inventor2 = inventor

Then I would use -matchit inventor inventor2- so that you get each name paired up with all its credible aliases. Then based on their similarity scores, calculated by -matchit-, and some judgment, I would then pick a single "reference" spelling for each such name and edit down the data set so that variable inventor contains all the values of inventor from the original data set, and inventor2 contains the corresponding reference spelling. I would save this data set, let's say we call it reference_names.dta, and then

Code:

use original_data_set, clear merge m:1 inventor using reference_names

From that point on, I would use inventor2 as the "real" name of the inventor. That way you can forget about the complication of different spellings of the same name and just focus on writing code that will do the comparisons within families that you want. Now, I have to say that from your description, I have no idea what kind of comparison you want to do here and what the results you want would look like. Perhaps if you showed what the results you want to see based on the example you showed in #1 would look like it would be possible to advise further.

Also, I don't understand what you mean by "patno 1 and patno 2 could have both A and B." In your example schema there is only one variable for inventor. So do you mean that the variable inventor could contain, in the same observation, both A and B? If so, how are they separated from each other within the variable? Or are they just run together? How would you know how to separate them from each other? Again, probably better to show than to try to describe in words.

Last edited by Clyde Schechter; 23 Mar 2025, 18:46.
Comment
Michael Risch

Join Date: Jul 2014

Posts: 23
#3

23 Mar 2025, 19:16

Thanks! This is very helpful. To complicate things, the data is stored first, mi, last so I will have to concatenate.

As for your second question, the gist is that this is table data with patno as the key. So, if a patent has multiple inventors, both (or all 3, or 5 or 50) will show up for the same patent in a 1:many merge

Here's an expanded example.

patno,familyno,inventor
1,1,A
2,1,A
2,1,D
3,1,B
4,2,C
4,2,E
5,2,C
5,2,E

What I want to know is whether two patents in the same family have ALL the same inventors (like Patno 4 and 5, in family 2) or if the inventors are different (like Patno 1 and 3 and family 1) and ideally are different a little (that is, share at least one common inventor but not ALL inventors) (like Patno1 and 2 in family 1)

The project is to identify whether inventor groups have an effect on certain aspects of patenting (that I haven't shown here). I want to know which patents came from the same inventors and which came from different ones within the same family.

Hope that clarifies!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#4

23 Mar 2025, 22:08

I think this, at least in principle, will do the job:

Code:

tempfile copy save `copy' rangejoin patno 1 . using `copy', by(familyno) suffix(2) rename (patno inventor) =1 drop if missing(patno2) order familyno, first capture program drop one_patent_pair program define one_patent_pair levelsof inventor1, local(list1) levelsof inventor2, local(list2) forvalues i = 1/2 { local list`i': list sort list`i' } gen byte all_same_inventors = `"`list1'"' == `"`list2'"' local overlap: list list1 & list2 gen byte some_common_inventors = (`"`overlap'"' != "") keep familyno patno* all_same_inventors some_common_inventors keep in 1 exit end runby one_patent_pair, by(familyno patno1 patno2) status

I do not work directly with patent data in my own work, but from the limited exposure to this kind of data I have had here on Statalist, I have the impression that these data sets are typically enormous. Consequently, this code may encounter obstacles in terms of overflowing available memory, or perhaps just taking far too long to run. One thing that you can do to avoid these limitations is to prune the data set down to just the variable absolutely necessary for this calculation: familyno, patno, and inventor. In addition, rather than using the inventor names, I suggest you assign each inventor a unique integer code (which will probably have to be a long, not a mere int storage type). If the number of distinct inventors is 65,536 or less, -encode- will do this for you. If the number is greater than that, use -egen, group()- instead. Then keep, in a separate .dta file, a crosswalk between the names and the integer codes for later reference if you will need the names. Then merge the pruned data set on inventor with the crosswalk file, and get rid of the names. Rename the code inventor for use above. (The above code will accept strings or numbers, but for reasons of memory and speed I'm urging you to use numbers.)

-runby- is written by Robert Picard and me, and is available from SSC.

The result of this will be a data set in memory containing one observation for each pair of patents within the same family and a new variable indicating whether that pair has all the same others, and another new variable indicating whether that pair has some overlapping authors. As -runby- iterates through the data set, it will display periodic progress reports showing how many patent-pairs have been processed so far, and the estimated time remaining to completion of the job.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#5

24 Mar 2025, 13:51

For reasons independent of this thread, I had occasion to check something in the PDF documentation for the -levelsof- command, and I incidentally saw that it states that it returns a sorted list of values of the variable. That being the case, the code in #4 contains an unnecessary block:

Code:

forvalues i = 1/2 { local list`i': list sort list`i' }

The above can be removed from the code, and doing so might speed up execution noticeably in a very large data set.
Comment
Michael Risch

Join Date: Jul 2014

Posts: 23
#6

24 Mar 2025, 15:51

Thank you! I'll give it a try. I have 723000 patents (fewer families obviously) x ? inventors/patent. We'll see how it goes. I've got 32MB RAM precisely because I use these types of datasets, so I'll show this to my admin people whenever they balk at why I need the computer I do!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#7

24 Mar 2025, 15:59

32 MB? That's not big at all. Did you mean 32GB?

Also, I just noticed that in #2, I forgot to tell you how to get -rangejoin-, which is not part of official Stata. It is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also from SSC.

Last edited by Clyde Schechter; 24 Mar 2025, 16:02.
Comment
Michael Risch

Join Date: Jul 2014

Posts: 23
#8

24 Mar 2025, 16:02

Originally posted by Clyde Schechter View Post

32 MB? That's not big at all. Did you mean 32GB?

They lied to me! Haha, yes, GB, which I realize is also not that big in the scheme. But at least I'm not stuck in the 1990s
Comment
Michael Risch

Join Date: Jul 2014

Posts: 23
#9

27 Apr 2025, 19:08

Clyde Schechter OK, so I finally got around to doing this. I didn't encode the inventor name variables yet, but I thought I would see if it would go without it as the set isn't THAT huge (about 2m lines with inventors). Rangejoin has been running about 60 minutes now. If this is just going to be a long process then so be it, but the temp files haven't updated in some 40 minutes (there are like 6 of them - 5 are about the size of my original data, one is about twice that), so I don't know what it might be doing - maybe it's all in RAM or pagefile.

My question is whether rangejoin is the step that's supposed to take a long time - I thought it just basically duplicated all the items.

Edited to add - it finished about an hour later. runby is running now - I'll just let it go all night! Thanks again.

Last edited by Michael Risch; 27 Apr 2025, 20:03.
Comment
Michael Risch

Join Date: Jul 2014

Posts: 23
#10

28 Apr 2025, 05:45

OK, 8 hours on, runby is still running. But I wonder if it is working? Here is the status for the first few - the errors match the count all the way down, and there are 0 saved to date.

elapsed ----------- by-groups ---------- ------- observations ------ time
time count errors no-data processed saved remaining
------------------------------------------------------------------------------------
00:00:01 61 61 0 487 0 31:08:27
00:00:02 126 126 0 940 0 32:10:16
00:00:03 170 170 0 1,209 0 37:29:41
00:00:04 253 253 0 1,909 0 31:39:30
Comment
Michael Risch

Join Date: Jul 2014

Posts: 23
#11

28 Apr 2025, 06:49

Data set was totally empty. I'm going to run again up to the point of running runby to make sure the data I'm working with matches what the program is looking for.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#12

28 Apr 2025, 08:54

Once you have done that, I suggest picking out a small number of familyno patno1 patno2 groups of observations, keeping just those, and doing the -runby- command with them, but adding the -verbose- option (and removing -status-). That way -runby- will let you see error messages emanating from program one_patent_pair. That should prove helpful in figuring out what's going wrong.

One you've debugged everything on this smaller data set, then re-run using the full data set, putting back the -status- option and removing -verbose-.
Comment
Michael Risch

Join Date: Jul 2014

Posts: 23
#13

28 Apr 2025, 17:15

Good news - just a typo. Running it now and hopefully will work! Thanks for your help.Turns out speed and RAM weren't the problems, but the pagefile took like 100GB and filled my disk the first few times I ran rangejoin. I pruned a bunch of variables I don't need and it went a lot faster
Comment
Michael Risch

Join Date: Jul 2014

Posts: 23
#14

28 Apr 2025, 17:54

Nevermind my last post if you saw it. It worked! My problem was that my pruned data didn't include all observations. When I pruned to include the whole family, it worked like a charm. Thanks! This is really great stuff.

Last edited by Michael Risch; 28 Apr 2025, 18:43.
Comment

Announcement

Comparing data in the same field of two rows in a group

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment