Treating duplicate IDs in my data

Maria Silverado

Join Date: Aug 2024

Posts: 7
#1

Treating duplicate IDs in my data

16 Feb 2025, 17:29

Hello,

I have the following data from the boardex, thomson reuters people link file on WRDS.

It's clear that this is the same person, Craig Smith. However, he has 2 different DirectorID (202943 and 310278) as well as 2 different PersonID (16148674 and 14606). How do I create/give him one unique individual ID?

DirectorID directorname PERSONID score dup_personid dup_directorid OWNER
202943 craig smith 16148674 1 0 1 SMITH CRAIG R
202943 craig smith 14606 1 1 1 SMITH CRAIG R
310278 doctor craig smith 14606 1 1 0 SMITH CRAIG R. M.D.
202943 craig smith 14606 1 1 1 SMITH CRAIG ROBERT M.D.

Thank you for your help!
Tags: None
Maarten Buis

Join Date: Mar 2014

Posts: 3458
#2

17 Feb 2025, 01:17

If you think one of the ids is wrong, you can change it using the replace command:

Code:

// look at all occurences of DirectorID == 310278 // decide if you think they are all wrong list DirectorID directorname PERSONID score OWNER if DirectorID == 310278 // if that is the case, you correct it: replace DirectorID = 202943 if DirectorID == 310278

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10213
#3

17 Feb 2025, 03:22

Assuming #1 refers to a general problem rather than a specific issue related to one pair of entries in the dataset, you can consider using fuzzy matching, where you match separately on 'directorname' and 'OWNER' and then specify similarity cutoffs for both variables to get matches. See

Code:

ssc describe matchit
1 like
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10213
#4

17 Feb 2025, 03:32

Here are some examples where the latter specifies two criteria, similar to your case:

https://www.statalist.org/forums/for...s-of-companies

https://www.statalist.org/forums/for...order-to-group

Finally, I would add that if I were confronted with this problem today, I would simply provide the list of names to an AI chatbot and ask it to group the names. It would do this faster and more efficiently than any fuzzy matching code you could write.

Last edited by Andrew Musau; 17 Feb 2025, 03:51.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#5

17 Feb 2025, 09:05

Finally, I would add that if I were confronted with this problem today, I would simply provide the list of names to an AI chatbot and ask it to group the names. It would do this faster and more efficiently than any fuzzy matching code you could write.

No doubt it would be faster and more efficient. But if you were asked to explain and defend the grouping result, what could you say? In fact, if the number of instances is large, would you trust that it had been done "correctly." (Using scarequotes here because I don't know quite what correct even means in the context of fuzzy matching.)

Actually, the whole premise of the original post leaves me queasy. Craig and Smith are pretty common given and surnames, and Robert is a common middle name. I would not be quick to presume that two records with name Craig R Smith refer to the same person. Even knowing that both are MDs wouldn't persuade me; a quick Google search for "Craig R Smith MD" turns up at least four clearly different persons on just the first two pages. Now perhaps the database O.P. is using is created from a well-defined population that reduces the number of possibilities, but to satisfy me, that would have to be marked out in other variables. Do these Craig R Smith MDs all have the same date of birth (or I suppose I'd settle for same age if no birthdate is available)? What about a residential zip code? Something else?
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10213
#6

17 Feb 2025, 11:00

Originally posted by Clyde Schechter View Post

No doubt it would be faster and more efficient. But if you were asked to explain and defend the grouping result, what could you say? In fact, if the number of instances is large, would you trust that it had been done "correctly." (Using scarequotes here because I don't know quite what correct even means in the context of fuzzy matching.)

I agree with this criticism, Clyde, and it's a valid concern—especially about interpretability and trust in the results. However, the same critique applies if the categorization were done manually, as human judgment is also subjective and prone to inconsistencies.

One significant advantage of using AI over fuzzy matching is its ability to leverage extensive knowledge bases, enabling it to recognize context-specific variations more accurately. For example, if locals use different names for the same place, an AI model trained on geographic or cultural data might correctly group them, whereas traditional fuzzy matching would likely miss these semantic connections.

While early AI models were prone to errors, the technology has rapidly advanced, with models becoming more context-aware and precise. I expect that as algorithms continue to improve, concerns about accuracy and defensibility will diminish. This may eventually make AI a more reliable tool for categorization than either manual grouping or fuzzy matching.
1 like
Comment
Maria Silverado

Join Date: Aug 2024

Posts: 7
#7

18 Feb 2025, 10:49

Thank you for your replies. I have established by looking at the whole data and other factors that Craig smith and Dr. Craig smith are two different individuals. Thus, DirectorID is correct. The issue comes with PersonID because it is saying that both Dr. Craig Smith and Craig smith are the same person: 14606.
Comment

Announcement

Treating duplicate IDs in my data

Comment

Comment

Comment

Comment

Comment

Comment