Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dealing with nasty string variables

    Hello stata users,

    I have a very ugly string variable that I would like to clean up.

    Sometimes, the observations look like this:

    > DIPTHEROIDS
    > DIPTEROIDS
    > DIPHTEROIDS

    Which should all tabulate under one name, but don't because of slightly different spelling ...

    Other times I have observations that look like this..

    > *CANCELLED* ACHROMOBACTER XYLISOXIDANS
    > ACHROMOBACTER XYLISOXIDANS *CANCELLED*

    Where I have the name of a bacteria/fungi but also a cancelled portion (either at the beginning or end). In all cases, I only want to keep the organisms name.

    As a final product, I would like to tabulate the variable and see a clean list after fixing the random misspellings and word swaps.

    I was thinking of using the replace, strpos(var, "Diptheroids") command but that seems very tedious, especially when I am dealing with so many organisms and so many slight spelling errors; I am trying to come up with any possible shortcuts.

    Thank you!

  • #2
    Unfortunately, this is a frequent but tough issue to deal with, and there is no direct way to fix it.

    You should start by taking a look at this Stata Journal article (Herrin and Poen, 2008) http://www.stata-journal.com/sjpdf.h...iclenum=dm0039 , since it details some command to begin (itrim, proper.), and give hints if you have other identifiers in your data.

    The -soundex- command might help for the precise DIPTHEROIDS case, but won't work for the second case you specified.

    You should also search in past topics about spelling issues.

    Anyway, you should expect long and patient work to clean your base.
    Best,
    Charlie

    Comment


    • #3
      For the cancelled portion:
      Code:
      input str40 var
      "*CANCELLED* ACHROMOBACTER XYLISOXIDANS"
      "ACHROMOBACTER XYLISOXIDANS *CANCELLED*"
      end
      replace var = strtrim(subinstr(var,"*CANCELLED*","",.))

      Comment


      • #4
        I would also take a look at -strgoup- from SSC.

        Comment


        • #5
          Great.. Many thanks to you all. Going to give it a shot today ;P

          Comment


          • #6
            Nestor Mojica I've just pushed out a package for user testing that includes a bunch of different phonetic string encoding algorithms (as well as string similarity/distance algorithms). If nothing else, you may want to consider the phoneticenc command in the package to generate the different phonetically encoded strings simultaneously and use one or more of them based on which appears to be fitting the unique types of strings you have in your data best.

            Code:
            net inst strutil, from("http://wbuchanan.github.io/StataStringUtilities/")
            It includes all of the phonetic string encoding algorithms (with the exception being the Soundex and Refined Soundex algorithms) implemented in the Apache Commons Codec library. Additionally, there is also a command in the package that provides several different string similarity/distance algorithms from a single interface (e.g., you can estimate several different distance/similarity metrics from a single command issued once instead of having to use several commands for each distance/similarity metric).

            Comment


            • #7
              In addition to the excellent suggestions above, several solutions of late have relied on regular expressions (though documentation is somewhat limited). Here's a recent thread that may be helpful: http://www.statalist.org/forums/foru...-strpos-regexr
              Stata/MP 14.1 (64-bit x86-64)
              Revision 19 May 2016
              Win 8.1

              Comment


              • #8
                Carole J. Wilson you can find more documentation related to the underlying regular expression used for the ustrregex* functions here: http://userguide.icu-project.org/strings/regexp

                Comment

                Working...
                X