Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging data sets based on common words within 2 string variables

    Dear all,

    I am trying to match observations from 2 data sets based on common strings within 2 string variables.

    The data has the following structure:

    Data Set 1:
    Company Number Name
    A 1 Robert G., Diane S.
    A 2 Julia C.
    A 3 Robert G., Tim H.
    B 1 Brad D., Olaf E.
    B 2 Richard W., Albert C.

    Data Set 2:
    Company Description
    A Robert G. became CEO [...]
    A Diane S. was appointed [...]
    B Richard W. died [...]
    B Olaf E. was appointed [...]
    • Variable "Name" in data set 1 and variable "Description" in data set 2 are both strings
    • First the observations can be matched based on the variable "Company"
    • Then the goal is to find an identical string (one word) in the fields "Name" and "Description" based on which the two data sets can be merged in order to assign each name within variable "Description" in data set 2 to the correct "Number" from data set 1.
    • This common string would be a new variable in the merged dataset called "MatchName"
    • In Data Set 2, each name within "Description" only appears once for the respective company, but it may appear several times within "Name" in data set 1 but for different "Numbers" --> see example: Robert G. appears in data set 1 for Number 1 and 3; in data set 2 he only appears once, but in the merged data set he will appear twice (assigned to "Number" 1 and 3)
    The merged data set would then look like this:

    Company Number Name Description MatchName
    A 1 Robert G., Diane S. Robert G. became CEO [...] Robert G.
    A 1 Robert G., Diane S. Diane S. was appointed [...] Diane S.
    A 3 Robert G., Diane S. Robert G.
    B 1 Brad D., Olaf E. Olaf E. was appointed [...] Olaf E.
    B 2 Richard W., Albert C. Richard W. died [...] Richard W.

    I hope the example illustrates problem in an understandable manner - I would very much appreciate your support, thank you in advance!


  • #2
    you have not set out the data in a way that makes it easy for others to use and try code (read the FAQ please); further, at least part of your description confused me (in your third bullet you say you want just "one word", but that seems unlikely to be correct; so here are some thoughts on how I might go after such a task:

    1. in data set 1, use -split- to separate the names for observations that have more than one name (unless you know that you will always be matching on the first name only; in that case do as in 3 below)

    2. -reshape- data set one to long (so that there is only one name/observation)

    3. use string functions (e.g., strpos) to get the name from your "Description" variable and put it into a variable with the same variable name as you used in the first data set

    4. -merge- the two data sets

    5. -reshape- back if you want

    look at the help files for the command/functions mentioned above:
    Code:
    help split
    help reshape
    help strpos
    help merge

    Comment


    • #3
      Extending Rich's answer, the user-written matchit command available from SSC may prove helpful.
      Code:
      search matchit

      Comment

      Working...
      X