detect typos in long string variable

Ylenia Curci

Join Date: Sep 2017

Posts: 72
#1

detect typos in long string variable

18 Apr 2023, 07:15

Hello,
I have a very long string variable and I need to create a dummy if that variable contains a certain term. The term however is mispelled most of the time and I need to detect the different variations of it along the different observations. I was thinking about using some string distance command to detect the variations by using a threshold like in the strgroup command. Is there any way to do so in Stata?
Thank you!
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

18 Apr 2023, 11:10

My thinking here -- based on limited experience -- is that a ready-made solution to what you want will be hard to find. (I hope I'm wrong.) My impression is that the various string distance user-written commands are oriented to comparing two strings, not to doing a "fuzzy" version of -strpos()-, as I would describe what you want. (See also the commands -jarowinkler- and -strdist-, available via -ssc-).

I think that what you want would be a useful tool, so I hope some good ideas turn up here. I have a few thoughts, which I have not tested, but I'll put them out to get something started.

1) Could you usefully look for an individual word that is diagnostic? If so, here's a slow but possibly useful approach using the -strdist- command, which I can crudely illustrate as follows:

Code:

local threshold = Some threshold distance you like based on experimentation with -strdist()- local YourWord = "whatever" gen nwords = wordcount(YourLongString) summ nwords local max = r(max) gen byte found = 0 forval i = 1/`max' { strdist YourLongString "`YourWord'", gen(dist) quiet replace found = (dist < `threshold') if !found & (nword <= `max') drop dist }

2) You might take a look at the -txttool- package. See -net describe dm0077- It is supposed to simplify/clean up bodies of text, among other things.

3) Is it possible to come up with a good list of the (mis)spellings of the term or *part* of the term of interest? If so, I have had some luck with things as simple as the following:

Code:

local TermList = `""this" "that thing" "something" "the other""' gen found = 0 foreach term of local TermList quiet replace found = (strpos(YourLongString, `"`term'"') > 0) if !found }

The business with compound double quotes can be tricky to see onscreen, hence my use of the red font. Something like 3) could use the built-in -inlist()- function, but it's limited to a list of 10 strings.
1 like
Comment
Ylenia Curci

Join Date: Sep 2017

Posts: 72
#3

19 Apr 2023, 13:07

Thank you Mike for your suggestions. I was wondering, what about splitting my long string in variables (hundreds and hundreds.... and hunderds) using spaces as separator and then using the string distance command looping over all the observations? Is it feasible? Will it take forever?
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#4

19 Apr 2023, 14:33

You certainly could do this, and as regards timing, it's easy enough to try some experiments on smaller numbers of observations and see what happens.

However, if you are splitting it up at the spaces, that's splitting into words, right? If so, you'd be searching individual words for (near) matches, and that's what my first code fragment would do, but without having to split up the string. Perhaps you have a different idea of what you would want to do if you split the words into different variables? All I can guess is that you would want to use strdist() to get the string distance between your target word across all observations but for each variable. I would not think that would give you very useful information, and I don't think it would be faster than my first suggestion, which looks across all the words, but within observations. Your problem is interesting, so I'm curious to see where this goes for you.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#5

19 Apr 2023, 14:40

The more I think about it, the more I think my first approach should work. Can you try it on a few observations and report what happens? That approach could, by the way, be modified to look for string distances using a target term that is more than one word long, in case that matters in your situation.
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2416

19 Apr 2023, 18:19

I tried my first suggested coding approach on some example data. It had several mistakes, and infelicities. (If you post in the future, please help us help you by supplying example data we can work with per the StataList FAQ.)

Try the following code, which works on a small sample for me. You should try a sample of say 10 or maybe 100 observations out of your data set. It *will* be slow, but in my experience, *all* code involving string distances is slow. If it works, we might be able to find ways to speed it up, but the first step is to get something that works.

Code:

local YourWord = "whatever ...."
// Split text into words and go to long layout.  This will be slow, but not outrageous.
// It will not work if the max. number of words in some string exceeds the
// -maxvar- setting on your Stata.
gen nwords = wordcount(YourLongString)
summ nwords
di "Max words in a string = " r(max) ", maxvar setting = " c(maxvar)
//
split YourLongString, gen(w) parse(" ")
drop YourLongString
reshape long w, i(id) j(seq)
//
// Each observation is now a word from one of your string observations.
// Find distance of each to your target word.
strdist w "`YourWord'", gen(dist)
// List words and string ids if the distance is close enough.
local close = 5 // a Levenshtein distance of 5
list id w if dist < `close'

Comment

Ylenia Curci

Join Date: Sep 2017

Posts: 72
#7

20 Apr 2023, 03:58

Hi Mike,
I tried and I get the following error (bordereau is my target term)
strdist w "`YourWord'", gen(dist) "bordereau invalid name am I messing with the quotes? Thanks!
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#8

20 Apr 2023, 08:47

Your use of quotes looks fine to me, so the problem comes from some other feature of your code.

To diagnose this, I'd need to see a sample of 3 observations for the relevant variables here, using -dataex- to do this per the StataList FAQ. (I know your string variable is long, so you'll want to edit it down to say 50 characters or so in your example. ) In addition, I'd need to see the exact syntax you used and exactly how Stata responded, starting at
"gen nwords = wordcount(YourLongString)" Copy/paste this material from the Stata results window to your posting here. Currently, I have no idea where "bordereau ..." etc. came from, so seeing variables, variable types, and so forth is essential.
Comment

Ylenia Curci

Join Date: Sep 2017
Posts: 72

21 Apr 2023, 02:33

Ehh, life would be too easy! My data are stored in a secured server, I have to use my fingerprint to access them and I cannot export even aggregated info!

I have a similar database on my laptop but still I cannot share the texts, but I guess a sample after the cutting, splitting and reshaping will not be an issue...

I tried with convention as target term and I get the same error. Here the code and the feedback from stata, everything looks fine until the string distance.

Thank you!

Code:

gen nwords = wordcount(text)

keep if nw<500
(171 observations deleted)

split tex, gen(w) parse(" ")
variables created as string:

drop text

reshape long w, i(filenamebr) j(seq)

 drop if w==""
(1,062 observations deleted)

local YourWord = "conven"

strdist w "`YourWord'", gen(dist)
"conven invalid name

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str96 filenamebr str31 w
"accord2008assurances" "convention"              
"accord2008assurances" "collective"              
"accord2008assurances" "nationale"               
"accord2008assurances" "personnel"               
"accord2008assurances" "agences"                 
"accord2008assurances" "generales"               
"accord2008assurances" "assurances"              
"accord2008assurances" "septembre"               
"accord2008assurances" "avenant"                 
"accord2008assurances" "septembre"               
"accord2008assurances" "etendue"                 
"accord2008assurances" "arrete"                  
"accord2008assurances" "mai"                     
"accord2008assurances" "jorf"                    
"accord2008assurances" "juin"                    
"accord2008assurances" "textes"                  
"accord2008assurances" "attaches"                
"accord2008assurances" "accord"                  
"accord2008assurances" "novembre"                
"accord2008assurances" "relatif"                 
"accord2008assurances" "egalite"                 
"accord2008assurances" "salariale"               
"accord2008assurances" "femmes"                  
"accord2008assurances" "hommes
etendu"          
"accord2008assurances" "arrete"                  
"accord2008assurances" "juin"                    
"accord2008assurances" "jorf"                    
"accord2008assurances" "juin"                    
"accord2008assurances" "

textes"              
"accord2008assurances" "attaches
accord"        
"accord2008assurances" "novembre"                
"accord2008assurances" "relatif"                 
"accord2008assurances" "egalite"                 
"accord2008assurances" "salariale"               
"accord2008assurances" "femmes"                  
"accord2008assurances" "hommes
idcc
"          
"accord2008assurances" "
signataires
fait"     
"accord2008assurances" "
fait"                  
"accord2008assurances" "paris"                   
"accord2008assurances" "novembre"                
"accord2008assurances" "
organisations"         
"accord2008assurances" "employeurs"              
"accord2008assurances" "
federation"            
"accord2008assurances" "nationale"               
"accord2008assurances" "syndicats"               
"accord2008assurances" "agents"                  
"accord2008assurances" "generaux"                
"accord2008assurances" "assurances"              
"accord2008assurances" "agea"                    
"accord2008assurances" "
organisations"         
"accord2008assurances" "syndicales"              
"accord2008assurances" "salaries"                
"accord2008assurances" "
cfdt"                  
"accord2008assurances" "sn"                      
"accord2008assurances" "cftc"                    
"accord2008assurances" "cfe"                     
"accord2008assurances" "cgc
numero"             
"accord2008assurances" "bo
"                    
"accord2008assurances" "
liste"                 
"accord2008assurances" "conventions"             
"accord2008assurances" "texte"                   
"accord2008assurances" "rattache
convention"    
"accord2008assurances" "collective"              
"accord2008assurances" "nationale"               
"accord2008assurances" "personnel"               
"accord2008assurances" "agences"                 
"accord2008assurances" "generales"               
"accord2008assurances" "assurances"              
"accord2008assurances" "septembre"               
"accord2008assurances" "avenant"                 
"accord2008assurances" "septembre"               
"accord2008assurances" "etendue"                 
"accord2008assurances" "arrete"                  
"accord2008assurances" "mai"                     
"accord2008assurances" "jorf"                    
"accord2008assurances" "juin"                    
"accord2008assurances" "
preambule
article
en"
"accord2008assurances" "vigueur"                 
"accord2008assurances" "etendu

les"           
"accord2008assurances" "partenaires"             
"accord2008assurances" "sociaux"                 
"accord2008assurances" "branche"                 
"accord2008assurances" "rappellent"              
"accord2008assurances" "attachement"             
"accord2008assurances" "principe"                
"accord2008assurances" "egalite"                 
"accord2008assurances" "remuneration"            
"accord2008assurances" "definie"                 
"accord2008assurances" "article"                 
"accord2008assurances" "code"                    
"accord2008assurances" "travail"                 
"accord2008assurances" "femmes"                  
"accord2008assurances" "hommes"                  
"accord2008assurances" "travail"                 
"accord2008assurances" "parcours"                
"accord2008assurances" "professionnel"           
"accord2008assurances" "valeura"                 
"accord2008assurances" "fin"                     
"accord2008assurances" "elements"                
"accord2008assurances" "composant"               
end

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2416
#10

21 Apr 2023, 11:36

I took your example data from above, which I presume was meant to show what your data would look like after the -reshape- command. I fixed a few items such as

Code:

"accord2008assurances" "hommes etendu" // fixed "accord2008assurances" "hommesetendu"

I assumed the line break for this and similar items in your example was just an accident, and not intended. I then ran just the two important lines of code:

Code:

local YourWord = "conven" strdist w "`YourWord'", gen(dist)

When I did this, I had no errors. I checked the results:

Code:

. list if dist <=4 +------------------------------------------+ | filenamebr w dist | |------------------------------------------| 1. | accord2008assurances code 3 | 2. | accord2008assurances convention 4 | 3. | accord2008assurances hommes 4 | 4. | accord2008assurances cfe 4 | +------------------------------------------+

This looks correct to me. Perhaps this has to do with a difference in your version of Stata or your version of -strdist-? I'm using Stata 15.1, and

Code:

-which strdist-

shows

Code:

*! version 1.2 09dec2017 Michael D Barker Felix Pöge

When I looked inside -stridist-, I see that it has some differences for different versions of Stata, e.g., v. 12, so there might be a problem there if you have an older Stata version.

Beyond those things, I'd suggest you reinstall -strdist-. If the problem still happens, you can run the program (as I just did) with -set trace on-. I think the trace output will be small enough that you can post the whole thing, but if not, you can contact me through the Stata message system and arrange for you to send me the output offline.
Comment
Ylenia Curci

Join Date: Sep 2017

Posts: 72
#11

25 Apr 2023, 02:13

Hi Mike, I have the same version you have. I got rid of the strdist version I had (1.0) and got the one you are using. And here the error I get

<istmt>: 3499 matalev1var() not found

And here the trace

--------------------------------------------------------------------------------- begin strdist ---
- syntax anything [if] [in] , [Generate(name)]
- gettoken first remain : anything , qed(isstring1)
- if `"`first'"' == "" error 102
= if `"w"' == "" error 102
- gettoken second remain : remain , qed(isstring2)
- if `"`second'"' == "" error 102
= if `"conven"' == "" error 102
- if `"`remain'"' != "" error 103
= if `""' != "" error 103
- if `isstring1' & `isstring2' strdist0var `if' `in' , first(`"`first'"') second(`"`second'"') gen(
> `generate')
= if 0 & 1 strdist0var , first(`"w"') second(`"conven"') gen(dist)
- else {
- local strscalar ""
- if `isstring1' {
= if 0 {
local strscalar = `"`first'"'
local first ""
}
- else if `isstring2' {
= else if 1 {
- local strscalar = `"`second'"'
= local strscalar = `"conven"'
- local second ""
- }
- strdist12var `first' `second' `if' `in' , m(`"`strscalar'"') gen(`generate')
= strdist12var w , m(`"conven"') gen(dist)
-------------------------------------------------------------------------- begin strdist12var ---
- syntax varlist(min=1 max=2 string) [if] [in] , [Match(string)] [GENerate(name)]
- marksample touse , strok
- if `"`generate'"' == "" local generate "strdist"
= if `"dist"' == "" local generate "strdist"
- confirm new variable `generate'
= confirm new variable dist
- tokenize "`varlist'"
= tokenize "w"
- if "`2'"=="" mata: matalev1var("`1'" , `"`match'"' , "`generate'" , "`touse'")
= if ""=="" mata: matalev1var("w" , `"conven"' , "dist" , "__000000")
<istmt>: 3499 matalev1var() not found
---------------------------------------------------------------------------- end strdist12var ---
}
----------------------------------------------------------------------------------- end strdist ---

Thank you!
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#12

25 Apr 2023, 12:28

Hi Ylenia,
I'm sorry to say that I don't know enough to tell what the problem is, but I think that the authors of -strdist- might easily diagnose it by looking at your trace listing. Something simple is going wrong so that the Mata function -matalev1var- that is defined and plainly visible within -strdist.ado- is not getting properly recognized in your installation of -strdist-. I have two suggestions for you. 2) --- abandon -stridist- -- is probably the better suggestion <grin>.

1) First, you could try the newer version of -strdist- that the authors mention, namely -ustrdist- It should have been installed when you installed -strdist- Check that it's available with -help ustrdist- There's a possibility that there is some strange problem that is fixed in that version of the program.

Code:

local YourWord = "conven" ustrdist w "`YourWord'", gen(dist)

If you still get an error, I'd advise you to clean up the little problems with your data example above, and send that and the trace listing to one of the authors, who have offered their names and addresses in the help file. Even presuming you take my suggestion 2) (which I recommend), knowing your problem might be useful to the authors.

2) A quicker and maybe better solution might be found by using a different string distance program! Here are two you can install:

Code:

ssc describe matchit ssc describe jarowinkler

Each of these has various options regarding string distance, but I'm not expert about those. Also, I think some of the string matching procedures in these programs depend on English spelling/pronounciation. However, I tried each of these programs with default choices and got sensible results on your data. Why don't you try these programs, and report what happens? I would guess that the finding the best choices of options in these programs might improve results for you, and I think you could determine that by trial and error.

Code:

gen yourword = "conven" // these programs require your search word be in a variable replace w = lower(w) // -matchit- is case-sensitive matchit w yourword, gen(simscore) // jarowinkler w yourword, gen(distjw) // Note that both programs measure similarity, not distance, so larger scores are better. // Examine the results, best matches at the top. gsort -simscore browse w yourword simscore distjw

Last edited by Mike Lacy; 25 Apr 2023, 12:52.
Comment
Ylenia Curci

Join Date: Sep 2017

Posts: 72
#13

26 Apr 2023, 01:54

Thank you Mike, jarowinkler works perfectly for my case, escpecially because of the possibilty of using pwinkler, as my target term is usually mispelled at the end (basically is a french word that not even french people know how to write, but they usually get right the first 3 or 4 letters :-) )
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#14

26 Apr 2023, 09:28

Good to hear this worked. Questions similar to yours are fairly common on StataList and having some advice for other people in a situation like yours would be useful. I'd encourage you to present post a short summary and worked example to instruct other users after you get your final solution worked out. Copy/pasting text from the viewer window after a -help- command is an easy way I found to get publicly available example text.
Comment

Announcement