Hello stata users,
I have a very ugly string variable that I would like to clean up.
Sometimes, the observations look like this:
> DIPTHEROIDS
> DIPTEROIDS
> DIPHTEROIDS
Which should all tabulate under one name, but don't because of slightly different spelling ...
Other times I have observations that look like this..
> *CANCELLED* ACHROMOBACTER XYLISOXIDANS
> ACHROMOBACTER XYLISOXIDANS *CANCELLED*
Where I have the name of a bacteria/fungi but also a cancelled portion (either at the beginning or end). In all cases, I only want to keep the organisms name.
As a final product, I would like to tabulate the variable and see a clean list after fixing the random misspellings and word swaps.
I was thinking of using the replace, strpos(var, "Diptheroids") command but that seems very tedious, especially when I am dealing with so many organisms and so many slight spelling errors; I am trying to come up with any possible shortcuts.
Thank you!
I have a very ugly string variable that I would like to clean up.
Sometimes, the observations look like this:
> DIPTHEROIDS
> DIPTEROIDS
> DIPHTEROIDS
Which should all tabulate under one name, but don't because of slightly different spelling ...
Other times I have observations that look like this..
> *CANCELLED* ACHROMOBACTER XYLISOXIDANS
> ACHROMOBACTER XYLISOXIDANS *CANCELLED*
Where I have the name of a bacteria/fungi but also a cancelled portion (either at the beginning or end). In all cases, I only want to keep the organisms name.
As a final product, I would like to tabulate the variable and see a clean list after fixing the random misspellings and word swaps.
I was thinking of using the replace, strpos(var, "Diptheroids") command but that seems very tedious, especially when I am dealing with so many organisms and so many slight spelling errors; I am trying to come up with any possible shortcuts.
Thank you!
Comment