Partial String Matches

Tom Cordiez

Join Date: May 2020

Posts: 7
#1

Partial String Matches

15 Jun 2020, 14:10

Hi Statalist,

I have two datasets which I would like to merge. Looking at auction sales data.

Dataset A is comprehensive in terms of the information and variables it has.
Dataset B is limited in information (only has name, date_sale, and price).

I would like to add dataset B to dataset A, and, where an observation from B matches one of A I would like to complete the missing information in B from A.
Whilst I am confident I would be able to do this, I am faced with a problem. The names by which I would like to match observations are not consistent between datasets.

I'll illustrate this with an example.

Dataset A (name): Bowmore 1982 29 Year Old

Dataset B (name): Bowmore 1982 Vintage Release 29

Dataset A also contains strings that have very similar titles, e.g. "Bowmore 1964 Black 29 Year Old 1st Edition

In all cases the strings match on the basis of the distillery name (e.g. Bowmore), in almost all cases the years (1982) are contained in both strings, and in most cases the ages (29) are also included.

Is there a way to reliably match these observations? I believe I am looking for code that finds the "best partial match".

Thank you in advance!

edit: using STATA 15.1 - name is in str159 format.
Tags: partial match, string variables
Tom Cordiez

Join Date: May 2020

Posts: 7
#2

19 Jun 2020, 05:54

Problem has been solved.
Could admin kindly delete thread, or I am otherwise happy to share how I solved the problem should someone be interested.
Comment
Hang Lai

Join Date: Jan 2021

Posts: 1
#3

07 Jan 2021, 01:10

hi Tom, I am interested at how to solve this fuzzy match problem, since I am doing a project similar as yours.
Comment

Announcement

Partial String Matches

Comment

Comment