Dealing with nasty string variables

Nestor Mojica

Join Date: Apr 2015

Posts: 23
#1

Dealing with nasty string variables

05 May 2015, 03:34

Hello stata users,

I have a very ugly string variable that I would like to clean up.

Sometimes, the observations look like this:

> DIPTHEROIDS
> DIPTEROIDS
> DIPHTEROIDS

Which should all tabulate under one name, but don't because of slightly different spelling ...

Other times I have observations that look like this..

> *CANCELLED* ACHROMOBACTER XYLISOXIDANS
> ACHROMOBACTER XYLISOXIDANS *CANCELLED*

Where I have the name of a bacteria/fungi but also a cancelled portion (either at the beginning or end). In all cases, I only want to keep the organisms name.

As a final product, I would like to tabulate the variable and see a clean list after fixing the random misspellings and word swaps.

I was thinking of using the replace, strpos(var, "Diptheroids") command but that seems very tedious, especially when I am dealing with so many organisms and so many slight spelling errors; I am trying to come up with any possible shortcuts.

Thank you!
Tags: None
Charlie Joyez

Join Date: Dec 2014

Posts: 421
#2

05 May 2015, 04:18

Unfortunately, this is a frequent but tough issue to deal with, and there is no direct way to fix it.

You should start by taking a look at this Stata Journal article (Herrin and Poen, 2008) http://www.stata-journal.com/sjpdf.h...iclenum=dm0039 , since it details some command to begin (itrim, proper.), and give hints if you have other identifiers in your data.

The -soundex- command might help for the precise DIPTHEROIDS case, but won't work for the second case you specified.

You should also search in past topics about spelling issues.

Anyway, you should expect long and patient work to clean your base.
Best,
Charlie
1 like
Comment

Friedrich Huebler

Join Date: Apr 2014
Posts: 1053

05 May 2015, 07:26

For the cancelled portion:

Code:

input str40 var
"*CANCELLED* ACHROMOBACTER XYLISOXIDANS"
"ACHROMOBACTER XYLISOXIDANS *CANCELLED*"
end
replace var = strtrim(subinstr(var,"*CANCELLED*","",.))

Comment

Dimitriy V. Masterov

Join Date: Mar 2014

Posts: 609
#4

05 May 2015, 16:21

I would also take a look at -strgoup- from SSC.
Comment
Nestor Mojica

Join Date: Apr 2015

Posts: 23
#5

05 May 2015, 21:39

Great.. Many thanks to you all. Going to give it a shot today ;P
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#6

11 Mar 2016, 03:50

Nestor Mojica I've just pushed out a package for user testing that includes a bunch of different phonetic string encoding algorithms (as well as string similarity/distance algorithms). If nothing else, you may want to consider the phoneticenc command in the package to generate the different phonetically encoded strings simultaneously and use one or more of them based on which appears to be fitting the unique types of strings you have in your data best.

Code:

net inst strutil, from("http://wbuchanan.github.io/StataStringUtilities/")

It includes all of the phonetic string encoding algorithms (with the exception being the Soundex and Refined Soundex algorithms) implemented in the Apache Commons Codec library. Additionally, there is also a command in the package that provides several different string similarity/distance algorithms from a single interface (e.g., you can estimate several different distance/similarity metrics from a single command issued once instead of having to use several commands for each distance/similarity metric).
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#7

11 Mar 2016, 07:54

In addition to the excellent suggestions above, several solutions of late have relied on regular expressions (though documentation is somewhat limited). Here's a recent thread that may be helpful: http://www.statalist.org/forums/foru...-strpos-regexr

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#8

11 Mar 2016, 08:36

Carole J. Wilson you can find more documentation related to the underlying regular expression used for the ustrregex* functions here: http://userguide.icu-project.org/strings/regexp
1 like
Comment

Announcement

Dealing with nasty string variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment