Hi Statalist,
I have two datasets which I would like to merge. Looking at auction sales data.
Dataset A is comprehensive in terms of the information and variables it has.
Dataset B is limited in information (only has name, date_sale, and price).
I would like to add dataset B to dataset A, and, where an observation from B matches one of A I would like to complete the missing information in B from A.
Whilst I am confident I would be able to do this, I am faced with a problem. The names by which I would like to match observations are not consistent between datasets.
I'll illustrate this with an example.
Dataset A (name): Bowmore 1982 29 Year Old
Dataset B (name): Bowmore 1982 Vintage Release 29
Dataset A also contains strings that have very similar titles, e.g. "Bowmore 1964 Black 29 Year Old 1st Edition
In all cases the strings match on the basis of the distillery name (e.g. Bowmore), in almost all cases the years (1982) are contained in both strings, and in most cases the ages (29) are also included.
Is there a way to reliably match these observations? I believe I am looking for code that finds the "best partial match".
Thank you in advance!
edit: using STATA 15.1 - name is in str159 format.
I have two datasets which I would like to merge. Looking at auction sales data.
Dataset A is comprehensive in terms of the information and variables it has.
Dataset B is limited in information (only has name, date_sale, and price).
I would like to add dataset B to dataset A, and, where an observation from B matches one of A I would like to complete the missing information in B from A.
Whilst I am confident I would be able to do this, I am faced with a problem. The names by which I would like to match observations are not consistent between datasets.
I'll illustrate this with an example.
Dataset A (name): Bowmore 1982 29 Year Old
Dataset B (name): Bowmore 1982 Vintage Release 29
Dataset A also contains strings that have very similar titles, e.g. "Bowmore 1964 Black 29 Year Old 1st Edition
In all cases the strings match on the basis of the distillery name (e.g. Bowmore), in almost all cases the years (1982) are contained in both strings, and in most cases the ages (29) are also included.
Is there a way to reliably match these observations? I believe I am looking for code that finds the "best partial match".
Thank you in advance!
edit: using STATA 15.1 - name is in str159 format.
Comment