Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparing two string variables

    Hi,
    I was wondering if there is a command in Stata I can use to compare two strings. For example in two different variables I have words "prog" and "program". Is there a way to create a new variable that compares these two and tells me numerically how similar these two are? For example, a command that calculates the percentage of similarity between each pair of observations in two string variables.

    Thanks

  • #2
    Code:
    . ssc describe strdist
    
    ---------------------------------------------------------------------------------------------
    package strdist from http://fmwww.bc.edu/repec/bocode/s
    ---------------------------------------------------------------------------------------------
    
    TITLE
          'STRDIST': module to calculate the Levenshtein distance, or edit distance, between stri
    > ngs
    
    DESCRIPTION/AUTHOR(S)
          
           strdist calculates the Levenshtein distance, or edit distance,
          between strings. It is implemented in Mata, and does not require
          a C plugin.
          
          KW: edit distance
          KW: Levenshtein distance
          KW: string comparison
          KW: data management
          
          Requires: Stata version 10
          
          Distribution-Date: 20121111
          
          Author: Michael Barker, Georgetown University
          Support: email [email protected]
          
    
    INSTALLATION FILES                              (type net install strdist)
          strdist.ado
          strdist.sthlp
    ---------------------------------------------------------------------------------------------
    (type -ssc install strdist- to install)

    Comment


    • #3
      Thank you very much Nick,

      I tried to use the command but unfortunately it does not allow for weights.

      I have a wide data that has names and previous employers. Some people have same names but they are different people. I want to check that by comparing their previous employers, but they have typos and have typed the names of the employers slightly differently. So, direct comparison didn't help me. The following code is what I used but I received the error that "weights not allowed". Do you have an alternative suggestion? Thanks so much.

      I also included an example of my data below.


      Code:
      forvalues i= 1/17 {
      forvalues j= 1/17  {
      
      strdist employername`i' employername`j'[_n+1] if name==name[_n+1] & employername`i' !="" & employername`j'[_n+1] != "", gen(similarity`i'_`j')
      
      
      }
      }

      Sample of Data: (I have 17 employernames but here I only included 3)

      Code:
      clear*
      input str06 name str06 employername1 str06 employername2 str06 employername3
      Amy  alpha  beta  alpha      
      Amy  alp  alph  bet        
      Amy  gamma  epsilon  
      John  alpha  delta  
      John  del  gamma
      end

      Comment


      • #4
        You confused Stata with the use of employername`j'[_n+1]. It mistook your subscripting _n+1 for weights. So the error message is not descriptive of the problem. The problem is that -strdist- compares the values of two different variables in the same observation. You are trying to get it to compare the values of two different observations of the same variable--which it cannot do.

        I think this is one of those rare situations in Stata where you need to go to wide layout so that you can then make comparisons across variables.

        Comment


        • #5
          Thanks so much. I made the file wide and it worked perfectly. I also dropped all variables that include only missing values. But still I have 900 variables. Is there a way to keep only variables in which there is at least one observation smaller than 10. Because I checked the result and when the distance calculated by strdist is less than 10 the two names belong to the same person (their previous employers are the same)

          Thanks

          Comment


          • #6
            I think you need to show a small but representative sample of the data you now have. Don't try to show all 900 variables, but a small subset of them that illustrates some that you want to keep and some that you want to drop, and, of course, the generated similarity scores. I can't visualize what it looks like, so I can't advise how to solve your current problem.

            Comment


            • #7
              Sure! here is an example of similarity values. The first and third columns have values less than 10, so I want to keep these variables, but in column 2 and column 4 all values are above 10. So, I want to drop those variables. Thanks.

              Code:
              clear*
              input int sim1 int sim2 int sim3 int sim4
              sim1     Sim2    Sim3    Sim4
              10 . 5 . 34 5
              5 . 3 .  10
              24 . 24 . 30
              15 45 15 10
              47 . 40 41
              end

              Comment


              • #8
                I think the simplest way to do it is this:

                Code:
                clear*
                input int sim1 int sim2 int sim3 int sim4
                sim1     Sim2    Sim3    Sim4
                10 . 5 . 34 5
                5 . 3 .  10
                24 . 24 . 30
                15 45 15 10
                47 . 40 41
                end
                
                foreach v of varlist sim* {
                    sort `v'
                    if `v'[1] >= 10 & !missing(`v'[1]) {
                        drop `v'
                    }
                }
                des
                The log ic is that when you sort on `v', the smallest value comes to the first observation. So if that one is >= 10, then they all are. I included an exception for the possibility that `v'[1] is missing (in which case `v' consists entirely of missing values) as you don't say what you want to do in that case. This code will retain `v' if everything is missing. If you want to drop a variable which is always missing, then just remove the &!missing(`v'[1]) part.

                Comment


                • #9
                  Originally posted by Clyde Schechter View Post
                  I think the simplest way to do it is this:

                  Code:
                  clear*
                  input int sim1 int sim2 int sim3 int sim4
                  sim1 Sim2 Sim3 Sim4
                  10 . 5 . 34 5
                  5 . 3 . 10
                  24 . 24 . 30
                  15 45 15 10
                  47 . 40 41
                  end
                  
                  foreach v of varlist sim* {
                  sort `v'
                  if `v'[1] >= 10 & !missing(`v'[1]) {
                  drop `v'
                  }
                  }
                  des
                  The log ic is that when you sort on `v', the smallest value comes to the first observation. So if that one is >= 10, then they all are. I included an exception for the possibility that `v'[1] is missing (in which case `v' consists entirely of missing values) as you don't say what you want to do in that case. This code will retain `v' if everything is missing. If you want to drop a variable which is always missing, then just remove the &!missing(`v'[1]) part.

                  Dear Clyde,
                  I do not understand just one part of your codes
                  Code:
                   
                   if `v'[1] >= 10 & !missing(`v'[1]) {
                  What is your purpose when using -if `v'[1]-. the [1], what does it mean?

                  Comment


                  • #10
                    `v'[1] means the value of variable `v' in the first observation (which, after sorting on `v', will be the smallest value of `v').

                    Comment


                    • #11
                      Thank you for your explanation, Clyde.

                      Comment


                      • #12
                        Hi Clyde,

                        Thank you so much. It worked perfectly. I really appreciate your help.

                        Comment

                        Working...
                        X