Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • New package for phonetic string encoding and string distance/similarity metrics

    I've just pushed out a new package -strutil- that includes new tools for phonetic string encoding (e.g., alternatives to soundex and soundex_nara) and string similarity/distance metrics. Both the phoneticenc and strdist commands are wrappers around Java plugins that perform all of the work and in both cases, you can retrieve several different return values simultaneously. The first example below shows some of the different phonetic string encoding options available:

    Code:
    . sysuse auto.dta, clear
    (1978 Automobile Data)
    
    . phoneticenc make, caverphone1(cav1) caverphone2(cav2) col(kolner) dms(daitch) dblm(dblmeta) metap(metaphone) nys(nysiis) beiderm(bmencode) matchrating(mrating)
    
    . li make cav1 cav2 kolner daitch in 1
    
         +---------------------------------------------------------------------------------+
         | make              cav1         cav2                             kolner   daitch |
         |---------------------------------------------------------------------------------|
      1. | AMC Concord     AMKNKT   AMKNKTNNNN   06846472656565656565656565656565   064649 |
         +---------------------------------------------------------------------------------+
    
    . li make dblmeta metaphone nysiis mrating in 1
    
         +-------------------------------------------------------+
         | make            dblmeta   metaph~e   nysiis   mrating |
         |-------------------------------------------------------|
      1. | AMC Concord        AMKN       AMKK   ANCANC    AMCLNL |
         +-------------------------------------------------------+
    
    . li make bmencode in 1
    
         +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      1. |                                                                                                  make                                                                                                            |
         |                                                                                                  AMC Concord                                                                                                     |
         |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
         | bmencode                                                                                                                                                                                                         |
         | amgzonkordnulnulnulnulnulnulnulnulnulnulnulnul|amgzonzordnulnulnulnulnulnulnulnulnulnulnulnul|amkonkordnulnulnulnulnulnulnulnulnulnulnulnul|amkonkurdnulnulnulnulnulnulnulnulnulnulnulnul|amkontsordnulnulnuln.. |
         +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    There are also several different string distance and similarity metrics. Some of the algorithms also allow users to control the size of the n-grams used for the estimation of the distances as well:

    Code:
    . sysuse census, clear
    (1980 Census data by state)
    
    . keep state state2
    
    . // Get all of the different distance and similarity metrics
    . strdist state state2, coss(cosine_sim) cosd(cosine_dist) damerau(dam)            ///
    > jaccards(jaccard_sim) jaccardd(jaccard_dist) lev(levenshtein)                    ///
    > longsubstr(longsubstring) met(metriclcs) ngramd(ngram_distance) ngramc(4)        ///
    > normlevs(normlev_similarity) normlevd(normlev_distance) qgramd(qgram_dist)       ///
    > qgramc(4) dices(sorensen_similarity) diced(sorensen_distance)                    ///
    > jarowinklers(jw_sim) jarowinklerd(jw_dist)
    
    . // Get the Jaro only metrics
    . strdist state state2, jarowinklers(jaro_sim) jarowinklerd(jaro_dist) jarowinklerc("-1")
    
    . // Describe the data set
    . desc
    
    Contains data from C:\Program Files (x86)\Stata14\ado\base/c/census.dta
      obs:            50                          1980 Census data by state
     vars:            20                          6 Apr 2014 15:43
     size:         8,000
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
    state           str14   %-14s                 State
    state2          str2    %-2s                  Two-letter state abbreviation
    cosine_sim      double  %10.0g                Cosine String Similarity
    cosine_dist     double  %10.0g                Cosine String Distance
    dam             double  %10.0g                Damerau String Distance
    jaccard_sim     double  %10.0g                Jaccard String Similarity
    jaccard_dist    double  %10.0g                Jaccard String Distance
    jw_sim          double  %10.0g                Jaro Winkler String Similarity
    jw_dist         double  %10.0g                Jaro Winkler String Distance
    levenshtein     double  %10.0g                Levenshtein String Distance
    longsubstring   double  %10.0g                Longest Common Substring Distance
    metriclcs       double  %10.0g                Bakkelund String Distance
    ngram_distance  double  %10.0g                N-Gram String Distance
    normlev_simil~y double  %10.0g                Normalized Levenshtein String Similarity
    normlev_dista~e double  %10.0g                Normalized Levenshtein String Distance
    qgram_dist      double  %10.0g                Q-Gram String Distance
    sorensen_simi~y double  %10.0g                Sorensen Dice String Similarity
    sorensen_dist~e double  %10.0g                Sorensen Dice String Distance
    jaro_sim        double  %10.0g                Jaro String Similarity
    jaro_dist       double  %10.0g                Jaro String Distance
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Sorted by:
         Note: Dataset has changed since last saved.
    
    . // Display some of the metrics along side their respective strings
    . li state state2 jw_dist jaro_dist jw_sim jaro_sim in 1/5, ab(40)
    
         +---------------------------------------------------------------------+
         | state        state2     jw_dist   jaro_dist      jw_sim    jaro_sim |
         |---------------------------------------------------------------------|
      1. | Alabama      AL       .19047624   .19047624   .80952376   .80952376 |
      2. | Alaska       AK       .44444442   .39999998   .55555558   .60000002 |
      3. | Arizona      AZ       .21428573   .21428573   .78571427   .78571427 |
      4. | Arkansas     AR       .19999999   .19999999   .80000001   .80000001 |
      5. | California   CA       .21333331   .21333331   .78666669   .78666669 |
         +---------------------------------------------------------------------+
    
    . li state state2 dam jaccard* levenshtein in 1/5, ab(40)
    
         +----------------------------------------------------------------------+
         | state        state2   dam   jaccard_sim   jaccard_dist   levenshtein |
         |----------------------------------------------------------------------|
      1. | Alabama      AL         5             0              1             5 |
      2. | Alaska       AK         4             0              1             4 |
      3. | Arizona      AZ         5             0              1             5 |
      4. | Arkansas     AR         6             0              1             6 |
      5. | California   CA         8             0              1             8 |
         +----------------------------------------------------------------------+
    
    . li state state2 longsubstring metriclcs norm*  in 1/5, ab(40)
    
         +-----------------------------------------------------------------------------------------+
         | state        state2   longsubstring   metriclcs   normlev_similarity   normlev_distance |
         |-----------------------------------------------------------------------------------------|
      1. | Alabama      AL                   5   .71428571            .28571429          .71428571 |
      2. | Alaska       AK                   4   .66666667            .33333333          .66666667 |
      3. | Arizona      AZ                   5   .71428571            .28571429          .71428571 |
      4. | Arkansas     AR                   6         .75                  .25                .75 |
      5. | California   CA                   8          .8                   .2                 .8 |
         +-----------------------------------------------------------------------------------------+
    
    . li state state2 ngram* qgram* sorensen* in 1/5, ab(40)
    
         +---------------------------------------------------------------------------------------------+
         | state        state2   ngram_distance   qgram_dist   sorensen_similarity   sorensen_distance |
         |---------------------------------------------------------------------------------------------|
      1. | Alabama      AL             .2857143            4                     0                   1 |
      2. | Alaska       AK            .16666667            3                     0                   1 |
      3. | Arizona      AZ            .14285715            4                     0                   1 |
      4. | Arkansas     AR                  .25            5                     0                   1 |
      5. | California   CA                   .2            7                     0                   1 |
         +---------------------------------------------------------------------------------------------+
    The package can be installed using:

    Code:
    net inst strutil, from("http://wbuchanan.github.io/StataStringUtilities/")
    If you notice any bugs and/or have any questions, feel free to submit issues to the project repository

  • #2
    I've installed the above code and tried the demonstration program. However, I got the below result. I'm using Stata 15 over a remote server. I wondered if anything similar has happened to other users. The data has loaded correctly.

    strdist state state2, coss(cosine_sim) cosd(cosine_dist) damerau(dam) jaccards(jaccard_sim) jaccardd(jaccard_dist)

    java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Nativ e Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Native MethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(De legatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.stata.Javacall.load(Javacall.java:132)
    at com.stata.Javacall.load(Javacall.java:92)
    Caused by: java.lang.NoClassDefFoundError: org/paces/Stata/MetaData/Meta
    at org.paces.Stata.StringUtils.Similarity.DistanceMet rics.<init>(DistanceMetrics.java:169)
    at org.paces.Stata.StringUtils.StringUtilities.distan ce(StringUtilities.java:53)
    ... 6 more
    r(5100);

    Any advice would be welcome. Thanks, Matthew
    Last edited by Matt Gibbons; 28 Feb 2018, 23:19.

    Comment


    • #3
      Matt Gibbons
      Looks like you need the StataJavaUtilities binary as well. It is a set of classes I put together to unify things across Stata 13 and 14 and just provides classes used to retrieve the data to avoid having to regularly use the lower level methods from the Java API. That said if you could create an issue in the repository it will remind me to add information to the package page about the installation requirements.

      Comment


      • #4
        I think one of the problems was that I am working on a remote server due to the size of the datasets. However, I got the programs working and found them useful.

        Comment


        • #5
          Dear William,

          Thank you very much for publishing this module, it looks very useful!

          How does one go about installing StataJavaUtilities in Windows? I found a page from Oracle where I could download a JRE version but it does not yet work.

          Also, while installing your package it told me that strdist already existed (I had installed it earlier). Thus, I replaced the copy on my machine with your version. Is yours the same as the original strdist?

          Thank you very much and all the best
          Leon
          Last edited by Leon Schmidt; 15 May 2020, 03:59.

          Comment


          • #6
            Leon Schmidt
            you can get the StataJavaUtilities binary here: https://github.com/wbuchanan/StataJa...aUtilities.jar


            You’ll need to make sure it is available on the classpath that Stata checks. I’m not sure what strdist package you had installed on your machine previously, so I can’t say for sure whether or not they are the same. There started to be some issues with the Java based packages I wrote a while ago when I would make an update to StataJavaUtilities and other packages had an older/different version of that library. So, I decoupled things so I could compile other packages with the assumption that StataJavaUtilities would be available on the classpath in order to avoid those problems.

            Comment


            • #7
              Thank you very much! How do I know the classpath that Stata checks? I typed
              Code:
              query java
              and then copied the JavaUtilities into the "java_home" path but it didn´t work. I´m not sure if this procedure is correct though.

              Comment


              • #8
                Leon Schmidt
                That definitely isn't the place to put it. When Stata initializes the JVM it uses paths that are more specific to Stata to define what to add to the classpath. If you do:

                Code:
                ls `"`c(sysdir_plus)'"'
                You should see a bunch of directories that all start with just a single letter and maybe some named "style" or "jar". If you don't see a directory there named "jar", create that directory and then add the StataJavaUtilities.jar file there. (Also, sorry for the delay getting back to you.)

                Comment


                • #9
                  FYI, the longsubstring command didn't seem to work, but longsubseq, which I found on a related page, seems to.

                  Comment


                  • #10
                    Matt Gibbons
                    If you submit an issue in the GitHub repository I should be able to update the documentation fairly quickly/easily.

                    Comment

                    Working...
                    X