Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Partial String Matches

    Hi Statalist,

    I have two datasets which I would like to merge. Looking at auction sales data.

    Dataset A is comprehensive in terms of the information and variables it has.
    Dataset B is limited in information (only has name, date_sale, and price).

    I would like to add dataset B to dataset A, and, where an observation from B matches one of A I would like to complete the missing information in B from A.
    Whilst I am confident I would be able to do this, I am faced with a problem. The names by which I would like to match observations are not consistent between datasets.

    I'll illustrate this with an example.

    Dataset A (name): Bowmore 1982 29 Year Old

    Dataset B (name): Bowmore 1982 Vintage Release 29

    Dataset A also contains strings that have very similar titles, e.g. "Bowmore 1964 Black 29 Year Old 1st Edition

    In all cases the strings match on the basis of the distillery name (e.g. Bowmore), in almost all cases the years (1982) are contained in both strings, and in most cases the ages (29) are also included.

    Is there a way to reliably match these observations? I believe I am looking for code that finds the "best partial match".

    Thank you in advance!

    edit: using STATA 15.1 - name is in str159 format.

  • #2
    Problem has been solved.
    Could admin kindly delete thread, or I am otherwise happy to share how I solved the problem should someone be interested.

    Comment


    • #3
      hi Tom, I am interested at how to solve this fuzzy match problem, since I am doing a project similar as yours.

      Comment

      Working...
      X