Comparing two string variables

Monica Muller

Join Date: Jul 2014

Posts: 226
#1

Comparing two string variables

12 Jul 2016, 12:03

Hi,
I was wondering if there is a command in Stata I can use to compare two strings. For example in two different variables I have words "prog" and "program". Is there a way to create a new variable that compares these two and tells me numerically how similar these two are? For example, a command that calculates the percentage of similarity between each pair of observations in two string variables.

Thanks
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35724

12 Jul 2016, 12:27

Code:

. ssc describe strdist

---------------------------------------------------------------------------------------------
package strdist from http://fmwww.bc.edu/repec/bocode/s
---------------------------------------------------------------------------------------------

TITLE
      'STRDIST': module to calculate the Levenshtein distance, or edit distance, between stri
> ngs

DESCRIPTION/AUTHOR(S)
      
       strdist calculates the Levenshtein distance, or edit distance,
      between strings. It is implemented in Mata, and does not require
      a C plugin.
      
      KW: edit distance
      KW: Levenshtein distance
      KW: string comparison
      KW: data management
      
      Requires: Stata version 10
      
      Distribution-Date: 20121111
      
      Author: Michael Barker, Georgetown University
      Support: email [email protected]
      

INSTALLATION FILES                              (type net install strdist)
      strdist.ado
      strdist.sthlp
---------------------------------------------------------------------------------------------
(type -ssc install strdist- to install)

Comment

Monica Muller

Join Date: Jul 2014

Posts: 226
#3

12 Jul 2016, 13:11

Thank you very much Nick,

I tried to use the command but unfortunately it does not allow for weights.

I have a wide data that has names and previous employers. Some people have same names but they are different people. I want to check that by comparing their previous employers, but they have typos and have typed the names of the employers slightly differently. So, direct comparison didn't help me. The following code is what I used but I received the error that "weights not allowed". Do you have an alternative suggestion? Thanks so much.

I also included an example of my data below.

Code:

forvalues i= 1/17 { forvalues j= 1/17 { strdist employername`i' employername`j'[_n+1] if name==name[_n+1] & employername`i' !="" & employername`j'[_n+1] != "", gen(similarity`i'_`j') } }

Sample of Data: (I have 17 employernames but here I only included 3)

Code:

clear* input str06 name str06 employername1 str06 employername2 str06 employername3 Amy alpha beta alpha Amy alp alph bet Amy gamma epsilon John alpha delta John del gamma end
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#4

12 Jul 2016, 15:43

You confused Stata with the use of employername`j'[_n+1]. It mistook your subscripting _n+1 for weights. So the error message is not descriptive of the problem. The problem is that -strdist- compares the values of two different variables in the same observation. You are trying to get it to compare the values of two different observations of the same variable--which it cannot do.

I think this is one of those rare situations in Stata where you need to go to wide layout so that you can then make comparisons across variables.
Comment
Monica Muller

Join Date: Jul 2014

Posts: 226
#5

12 Jul 2016, 19:21

Thanks so much. I made the file wide and it worked perfectly. I also dropped all variables that include only missing values. But still I have 900 variables. Is there a way to keep only variables in which there is at least one observation smaller than 10. Because I checked the result and when the distance calculated by strdist is less than 10 the two names belong to the same person (their previous employers are the same)

Thanks
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#6

12 Jul 2016, 20:03

I think you need to show a small but representative sample of the data you now have. Don't try to show all 900 variables, but a small subset of them that illustrates some that you want to keep and some that you want to drop, and, of course, the generated similarity scores. I can't visualize what it looks like, so I can't advise how to solve your current problem.
Comment
Monica Muller

Join Date: Jul 2014

Posts: 226
#7

12 Jul 2016, 20:23

Sure! here is an example of similarity values. The first and third columns have values less than 10, so I want to keep these variables, but in column 2 and column 4 all values are above 10. So, I want to drop those variables. Thanks.

Code:

clear* input int sim1 int sim2 int sim3 int sim4 sim1 Sim2 Sim3 Sim4 10 . 5 . 34 5 5 . 3 . 10 24 . 24 . 30 15 45 15 10 47 . 40 41 end
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#8

13 Jul 2016, 06:59

I think the simplest way to do it is this:

Code:

clear* input int sim1 int sim2 int sim3 int sim4 sim1 Sim2 Sim3 Sim4 10 . 5 . 34 5 5 . 3 . 10 24 . 24 . 30 15 45 15 10 47 . 40 41 end foreach v of varlist sim* { sort `v' if `v'[1] >= 10 & !missing(`v'[1]) { drop `v' } } des

The log ic is that when you sort on `v', the smallest value comes to the first observation. So if that one is >= 10, then they all are. I included an exception for the possibility that `v'[1] is missing (in which case `v' consists entirely of missing values) as you don't say what you want to do in that case. This code will retain `v' if everything is missing. If you want to drop a variable which is always missing, then just remove the &!missing(`v'[1]) part.
Comment
Thong Nguyen

Join Date: Oct 2015

Posts: 236
#9

13 Jul 2016, 07:12

Originally posted by Clyde Schechter View Post

I think the simplest way to do it is this:

Code:

clear* input int sim1 int sim2 int sim3 int sim4 sim1 Sim2 Sim3 Sim4 10 . 5 . 34 5 5 . 3 . 10 24 . 24 . 30 15 45 15 10 47 . 40 41 end foreach v of varlist sim* { sort `v' if `v'[1] >= 10 & !missing(`v'[1]) { drop `v' } } des

The log ic is that when you sort on `v', the smallest value comes to the first observation. So if that one is >= 10, then they all are. I included an exception for the possibility that `v'[1] is missing (in which case `v' consists entirely of missing values) as you don't say what you want to do in that case. This code will retain `v' if everything is missing. If you want to drop a variable which is always missing, then just remove the &!missing(`v'[1]) part.

Dear Clyde,
I do not understand just one part of your codes

Code:

if `v'[1] >= 10 & !missing(`v'[1]) {

What is your purpose when using -if `v'[1]-. the [1], what does it mean?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#10

13 Jul 2016, 07:23

`v'[1] means the value of variable `v' in the first observation (which, after sorting on `v', will be the smallest value of `v').
Comment
Thong Nguyen

Join Date: Oct 2015

Posts: 236
#11

13 Jul 2016, 07:51

Thank you for your explanation, Clyde.
Comment
Monica Muller

Join Date: Jul 2014

Posts: 226
#12

13 Jul 2016, 20:26

Hi Clyde,

Thank you so much. It worked perfectly. I really appreciate your help.
Comment

Announcement

Comparing two string variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment