Calculate how many words separated by a comma match between two string variables

Dongwuk Kim

Join Date: Oct 2021
Posts: 9

Calculate how many words separated by a comma match between two string variables

10 Oct 2021, 20:32

gameid	year	credits1	credits2
1	1996	Aleksandr Kravchuk,Alexey Menshikov,Alexey Menshikov,Rostislav Garbar	George Bain,Ted Regulski
2	1995	David H. Tolley,David H. Tolley,Steven Metcalf,Steven Metcalf	James Tripp,Paul Margrave,Steven Metcalf

Hi everyone,

I have a dataset of people's credit for a video game portrayed in the table above. Each name is separated by a comma, and it varies from 5 people to 1500people.
Is there a way to see how many names are in common between two variables of credits1 and credits 1?
I was searching through the forum but could not find any topic about finding a number of matching multiple words separated by a comma.

Thank you!

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#2

10 Oct 2021, 21:10

This is made more complicated because the same name can appear more than once within credits1 or credits2. I will assume that, for example, in gameid 2, 1995, you want to count only one occurrence in common. That is the fact that Steven Metcalf appears twice in credits 1 does not cause his occurrence in credits2 to count as two joint occurrences.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte gameid int year str70 credits1 str40 credits2 1 1996 "Aleksandr Kravchuk,Alexey Menshikov,Alexey Menshikov,Rostislav Garbar " "George Bain,Ted Regulski" 2 1995 "David H. Tolley,David H. Tolley,Steven Metcalf,Steven Metcalf " "James Tripp,Paul Margrave,Steven Metcalf" end split credits1, parse(",") gen(c1) split credits2, parse(",") gen(c2) reshape long c1, i(gameid year) replace c1 = trim(itrim(c1)) by gameid year c1, sort: keep if _n == 1 drop _j reshape long c2, i(gameid year c1) replace c2 = trim(itrim(c2)) by gameid year c1 c2, sort: keep if _n == 1 by gameid year: egen wanted = total(c1 == c2 & !missing(c1, c2)) by gameid year, sort: keep if _n == 1 drop c1 c2 _j

Note: The code crucially assumes that there is only one observation for any combination of gameid and year. The commands applyng the trim() and itrim() functions to c1 and c2 are not strictly necessary in your data, but will be if sometimes the commas are followed by spaces. And, evidently, the code will not account for misspellings. Text data of this kind, especially when dealing with names, is often riddled with inconsistencies like those, and there is only so much one can do to work around them.

In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Last edited by Clyde Schechter; 10 Oct 2021, 21:19.
1 like
Comment

Ali Atia

Join Date: May 2020
Posts: 737

10 Oct 2021, 22:09

An alternative approach, which retains Clyde's assumption that duplicated names do not count twice. Also, I assume duplicated names are always adjacent as they are in your example.

Code:

gen nodupscredits1 = ustrregexra(ustrregexra(credits1,"(.*)(,\1)+",",$1"),"(,){2,}|^,","$1")
gen nodupscredits2 = ustrregexra(ustrregexra(credits2,"(.*)(,\1)+",",$1"),"(,){2,}|^,","$1")
gen matches = ((strlen(nodupscredits1)-strlen(subinstr(nodupscredits1,",","",.)))+1)-((strlen(ustrregexra(ustrregexra(nodupscredits1,subinstr(nodupscredits2,",","|",.),""),"((,+)$)|(^(,+))",""))-strlen(subinstr(ustrregexra(ustrregexra(nodupscredits1,subinstr(nodupscredits2,",","|",.),""),"((,+)$)|(^(,+))",""),",","",.)))+1)
drop nodups*

Last edited by Ali Atia; 10 Oct 2021, 22:15.

Comment

Dongwuk Kim

Join Date: Oct 2021

Posts: 9
#4

11 Oct 2021, 20:15

Thank you Clyde and Ali! I'll try your methods and tell you how it goes!

Also, I tried using dataex but stata said:

input float(gameid year) strL(creat_credits suits_credits)
data width (7061 chars) exceeds max linesize. Try specifying fewer variables
r(1000);

so I couldn't use dataex....
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#5

11 Oct 2021, 21:20

Yes, for this particular data, -dataex- will fail. It doesn't handle long string variables very well because it tries to write each observation on a single line in the Results window. However, your particular example could have been done with -dataex-: you can see in my earlier post exactly how it looks when you do that. To accomplish that, you can just create your example in the Data Editor in Stata, and then run -dataex-. Anyway, please do use -dataex- in the more general situation going forward.
Comment
Dongwuk Kim

Join Date: Oct 2021

Posts: 9
#6

20 Oct 2021, 13:12

I'm having trouble with Stata keeps crashing when I am using both of your methods. I am guessing because the strL is extremely long. I am currently collapsing different levels of names from each video game credit, to a single year in a company with multiple games and then finally a year in the industry with multiple companies.
Clyde's method crashed in the single video game credits level when I tried to reshape the long string variable with a length of about 5000. Ali's method worked for individual game levels but is crashing when the string variable is at the company level or industry level. I waited overnight but it was still non-responsive ( the longest length I have found is 979,232).
Do you have any tips on dealing with this issue?

Last edited by Dongwuk Kim; 20 Oct 2021, 13:15.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#7

20 Oct 2021, 13:56

-reshape- can be very slow with very large data sets. I would not conclude that Stata has crashed just because it is still hanging on a -reshape- command after an overnight wait. You might try replacing the -reshape long- commands with Rafal Raciborski's -tolong-, which is available from SSC and is consdierably faster than -reshape-. I don't know, however, if it is compatible with strL variables.
Comment

Announcement

Calculate how many words separated by a comma match between two string variables

Comment

Comment

Comment

Comment

Comment

Comment