Merging data sets based on common words within 2 string variables

Diane Smith

Join Date: Aug 2018
Posts: 32

Merging data sets based on common words within 2 string variables

09 Sep 2018, 05:29

Dear all,

I am trying to match observations from 2 data sets based on common strings within 2 string variables.

The data has the following structure:

Data Set 1:

Company	Number	Name
A	1	Robert G., Diane S.
A	2	Julia C.
A	3	Robert G., Tim H.
B	1	Brad D., Olaf E.
B	2	Richard W., Albert C.

Data Set 2:

Company	Description
A	Robert G. became CEO [...]
A	Diane S. was appointed [...]
B	Richard W. died [...]
B	Olaf E. was appointed [...]

Variable "Name" in data set 1 and variable "Description" in data set 2 are both strings
First the observations can be matched based on the variable "Company"
Then the goal is to find an identical string (one word) in the fields "Name" and "Description" based on which the two data sets can be merged in order to assign each name within variable "Description" in data set 2 to the correct "Number" from data set 1.
This common string would be a new variable in the merged dataset called "MatchName"
In Data Set 2, each name within "Description" only appears once for the respective company, but it may appear several times within "Name" in data set 1 but for different "Numbers" --> see example: Robert G. appears in data set 1 for Number 1 and 3; in data set 2 he only appears once, but in the merged data set he will appear twice (assigned to "Number" 1 and 3)

The merged data set would then look like this:

Company	Number	Name	Description	MatchName
A	1	Robert G., Diane S.	Robert G. became CEO [...]	Robert G.
A	1	Robert G., Diane S.	Diane S. was appointed [...]	Diane S.
A	3	Robert G., Diane S.		Robert G.
B	1	Brad D., Olaf E.	Olaf E. was appointed [...]	Olaf E.
B	2	Richard W., Albert C.	Richard W. died [...]	Richard W.

I hope the example illustrates problem in an understandable manner - I would very much appreciate your support, thank you in advance!

Tags: data, merge, string

Rich Goldstein

Join Date: Mar 2014

Posts: 4470
#2

09 Sep 2018, 06:48

you have not set out the data in a way that makes it easy for others to use and try code (read the FAQ please); further, at least part of your description confused me (in your third bullet you say you want just "one word", but that seems unlikely to be correct; so here are some thoughts on how I might go after such a task:

1. in data set 1, use -split- to separate the names for observations that have more than one name (unless you know that you will always be matching on the first name only; in that case do as in 3 below)

2. -reshape- data set one to long (so that there is only one name/observation)

3. use string functions (e.g., strpos) to get the name from your "Description" variable and put it into a variable with the same variable name as you used in the first data set

4. -merge- the two data sets

5. -reshape- back if you want

look at the help files for the command/functions mentioned above:

Code:

help split help reshape help strpos help merge
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

09 Sep 2018, 12:06

Extending Rich's answer, the user-written matchit command available from SSC may prove helpful.

Code:

search matchit
Comment

Announcement

Merging data sets based on common words within 2 string variables

Comment

Comment