Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • compare, calculate distance for strings made up of binary digits


    Hello,

    I am working with variables consist of strings made up of binary digits. For example, “11011”, “10000”, and “01010”, etc.

    I need to compute various distance or similarity measures between any two or multiple strings. So far, I have managed to break up each string into 5 columns or variables, change them to numerical digits, export them to Excel, and calculate cosine similarities. Also, strdist can only calculate the Levenshtein distance between two strings each time.

    I am wondering if there is any easier approach? To compare distance or calculate cosine similarities among multiple strings of binary digits? Thanks!

    Best,
    Henry

  • #2
    I think you are using the -strdist- command that is from SSC. But William Buchanan has a string utility package that includes a -strdist- that is not restricted to the Levenshtein distance--it can do a large array of distance measures. You can download the entire package from https://github.com/wbuchanan/StataStringUtilities. It is restricted to two variables at a time, but you can write a loop to cover all pairs among your variable list. If you had provided example data, I could show you code for such a loop, but absent that, I can't say anything more specific. If you need help creating such a loop, please post back, using the -dataex- command to show example data. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    When asking for help with code, always show example data. When showing example data, always use -dataex-.

    Comment


    • #3
      Hi Clyde, thanks for the information!
      Last edited by Henry Lewis; 16 Sep 2021, 19:05.

      Comment


      • #4
        I just got the package mentioned by Clyde installed and running, after a few trials. I initially encountered some problems, so I will share the solutions here. I hope it will make it easier for people who want to use it.
        First, download the pack from Stata using
        Code:
        net inst strutil, from("http://wbuchanan.github.io/StataStringUtilities/")
        Add adopath:
        Code:
        adopath + "C:\ado\plus/StataJavaUtilities.jar"
        Download StataJavaUtilities.jar file from:
        Code:
        https://github.com/wbuchanan/StataJavaUtilities/blob/master/target/StataJavaUtilities.jar
        and past it to the folder: "C:\ado\plus"
        This should make your package up and running in Stata!

        Comment

        Working...
        X