String: Regexm for a specific word in a string (1st)

Marvin Aliaga

Join Date: Feb 2015

Posts: 255
#1

String: Regexm for a specific word in a string (1st)

06 Jul 2018, 13:02

Hello everyone,

I am cleaning a string variable that contains cities. I am using different techniques to clean this variable such as soundex, regexm, substr. At this point, I would like to replace city equals "East Elmhurst" if the first word equals "E" and the 4 last letters of the second word equals "URST". You can see my attempt below. However it doesn't work since my codes look at the entire string instead of a specific word (1st, 2nd, etc). For example my code changes ELMURST for EAST ELMHURST and this is not correct. Is there a way to say to use the regexm expression but only for a specific word in the string or what other codes could be used?

Code:

replace city="EAST ELMHURST" if regexm(city, "E") & substr((city),-4,.)=="URST" replace city="EAST ELMHURST" if regexm(city, "EAS") & regexm(city, "EL")

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str22 city "EAST ELMHURST NEW YORK" "EAST ELMHORST" "EST ELMURST" "EAST ELMHURST" "EAST NORWICK" "EAST MARION" "EAST MEADOW" "EAST EMHURST" "EAST HEMPSTEAD" "EAST NORTHPORT" "EAST NEW YORK" "EAST HAMPTON" "EASTON" "EAST NY" "ELMURST" "ELMHURST" "E ELMHURST" "E. ELMHURST" "GLENHURST" end

Thank you in advance,
Marvin
Tags: None
Igor Paploski

Join Date: Oct 2014

Posts: 174
#2

06 Jul 2018, 13:24

Are there new data being entered on the database or is this just a one-time cleaning step? If it's a one-time cleaning step, I suggest running - tab city - and manually coding it like:

Code:

replace city = "EAST ELMHURST" if (city=="E ELMHURST" | city=="EAST ELMHURST NEW YORK" | city=="...")

and so on. It's not simple, but since you are actually going every answer (which will repeat, so it can not be a huge amount of different variants), and can be sure that you you are coding all answers the way you want. If data will still be entered, I suggest training the staff collecting data, perhaps making this variable a categorical one with labels (not an open question).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#3

06 Jul 2018, 13:25

I've always found the regular expressions functions complicated to use. Sometimes they are needed, but you can do this without them:

Code:

gen city2 = city replace city2 = "EAST ELMHURST" if /// substr(word(city, 1), 1, 1) == "E" & /// substr(word(city, 2), -3, .) == "RST"

Note: I've loosened up your condition from last four characters being URST to just the last 3 characters being RST. That way you can pick up the mis-spelling of EAST ELMHORST in observation 2, without introducing any false corrections, at least in your example.

Added: Crossed with #2 which presents a different perspective on the problem.
1 like
Comment
Marvin Aliaga

Join Date: Feb 2015

Posts: 255
#4

06 Jul 2018, 13:39

Thank you so much both. The documentation for substr is on the String manual, right? Or there is another manual for just substr with more examples, etc?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#5

06 Jul 2018, 13:43

It's in the [FN] volume of the manual.

The easiest way to get to the Stata manual section for any Stata command or function is to first run -help name_of_command_or_function-, and then look for the link to the manual section near the top of the page that opens in the Viewer. The link is in blue.
Comment

Marvin Aliaga

Join Date: Feb 2015
Posts: 255

06 Jul 2018, 14:07

hi Clyde Schechter but the substr part is very small. Anyhow, I modify your code to clean the ELMHURST entries.

Code:

replace city2 = "ELMHURST" if substr(word(city, 1), 1, 2) ///
== "EL" & substr(word(city, 1), -3, .) == "RST"
replace city2 = "ELMHURST" if substr(word(city, 1), 1, 2) ///
== "EL" & substr(word(city, 1), -3, .) == "URT"

I tried to combine these codes by the line below but it didn't work. Is that approach valid?

Code:

replace city2 = "ELMHURST" if substr(word(city, 1), 1, 2) ///
== "EL" & substr(word(city, 1), -3, .) == "[R|U][S|R]T"

Last edited by Marvin Aliaga; 06 Jul 2018, 14:32.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#7

06 Jul 2018, 15:06

No, -substr()- doesn't understand [R|U][S|R]T. That's regular-expression syntax, and -substr()- just works with regular old strings. To combine these into one, it would be:

Code:

replace city2 = "ELMHURST" if substr(word(city, 1), 1, 2) /// == "EL" & inlist(substr(word(city, 1), -3, .) , "RST", "URT")
Comment
Marvin Aliaga

Join Date: Feb 2015

Posts: 255
#8

09 Jul 2018, 07:21

Great! Thank you so much Clyde!
Comment

Marvin Aliaga

Join Date: Feb 2015
Posts: 255

12 Jul 2018, 15:20

Hi Clyde Schechter ,

I am sorry to go back to this post but this is related. Some of the entries for my string variable "Cities" have the word NY on it. For example, Rego Park NY o, BX NY, Queens NY. I would like to get rid off the NY word only if the string have the word NY and have more than 1 word. My attempt was

Code:

replace city = subinstr(city, "NY", "", .) if regexm(city,  "NY") & wordcount(city)>1

. This doesn't work since my code strings that contains NY in any part of the string (ex. SUNNY SIDE). I need a code that finds a word in the string that equals ""NY" and that wordcount >1. Adding one more complications, there are cases that the entire word New York is included ( WOODSIDE NEW YORK). Any thoughts?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str22 city
"EAST ELMHURST NEW YORK"
"EAST ELMHORST"         
"EST ELMURST"           
"EAST ELMHURST"         
"EAST NORWICK"          
"EAST MARION"           
"EAST MEADOW"           
"EAST EMHURST"          
"EAST HEMPSTEAD"        
"EAST NORTHPORT"        
"EAST NEW YORK"         
"EAST HAMPTON"          
"EASTON"                
"EAST NY"               
"ELMURST"               
"ELMHURST"              
"E ELMHURST"            
"E. ELMHURST"           
"GLENHURST"             
"ELMHURT"               
"ELMHURST"              
"SUFFOLK NY"            
"SADELBROOK NY"         
"ST. ALBANY"            
"STONY POINT"           
"SUNNYSIDE QUEENS"      
"SUNNY SIDE"            
"ULSTER NY"             
"WESTCHESTER NY"        
"WEST NY"               
"WEST NYACK"            
"WOODSIDE NY"           
"WATERBURY, NY"         
"WOODSIDE NEW YORK"     
"NEW YORK ELMHURST"     
end

----------

Comment

Igor Paploski

Join Date: Oct 2014

Posts: 174
#10

12 Jul 2018, 16:07

One thing that might be worth doing is cleaning the answers in general. For example, removing . and , from the cities names:

Code:

replace city = subinstr(city, ".", "", .) replace city = subinstr(city, ",", "", .)

I need a code that finds a word in the string that equals ""NY" and that wordcount >1.
Assuming NY will never be on the beginning of the city name, you could add a space to your subinstr argument:

From:

Code:

replace city = subinstr(city,"NY", "", .) if regexm(city,"NY") & wordcount(city)>1

To:

Code:

replace city = subinstr(city," NY", "", .) if regexm(city,"NY") & wordcount(city)>1

This will catch "WEST NYACK" though. I would work with exceptions now (do the code above and add if city != "WEST NYACK").

Adding one more complications, there are cases that the entire word New York is included ( WOODSIDE NEW YORK). Any thoughts?
Take for example EAST NEW YORK, EAST ELMHURST NEW YORK and WOODSIDE NEW YORK. I assume you only want remove NEW YORK from East Elmhurst? I would move forward adding exceptions again (the same as with West Nyack above). If you want to remove all NEW YORK, I would replace all its occurrences and run a similar code as the one for the first requirement of your post:

Code:

replace city = subinstr(city,"NEW YORK","NY", .) replace city = subinstr(city," NY", "", .) if regexm(city,"NY") & wordcount(city)>1 & city!="WEST NYACK"

Best;

Last edited by Igor Paploski; 12 Jul 2018, 16:08. Reason: Minor fixes.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30121

#11

12 Jul 2018, 16:43

Igor is right that this becomes very complicated quickly as more and more possibilities arise. You might actually be better off creating a new data set consisting just of the distinct values of the variable city, and then hand editing in a corresponding cleaned value, and then -merge-ing back to the original data.

That said, the following code will handle all the complications in your example:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str22 city
"EAST ELMHURST NEW YORK"
"EAST ELMHORST"         
"EST ELMURST"           
"EAST ELMHURST"         
"EAST NORWICK"          
"EAST MARION"           
"EAST MEADOW"           
"EAST EMHURST"          
"EAST HEMPSTEAD"        
"EAST NORTHPORT"        
"EAST NEW YORK"         
"EAST HAMPTON"          
"EASTON"                
"EAST NY"               
"ELMURST"               
"ELMHURST"              
"E ELMHURST"            
"E. ELMHURST"           
"GLENHURST"             
"ELMHURT"               
"ELMHURST"              
"SUFFOLK NY"            
"SADELBROOK NY"         
"ST. ALBANY"            
"STONY POINT"           
"SUNNYSIDE QUEENS"      
"SUNNY SIDE"            
"ULSTER NY"             
"WESTCHESTER NY"        
"WEST NY"               
"WEST NYACK"            
"WOODSIDE NY"           
"WATERBURY, NY"         
"WOODSIDE NEW YORK"     
"NEW YORK ELMHURST"     
end


gen _city = reverse(trim(itrim(upper(city))))
replace _city = substr(_city, 4, .) if substr(_city, 1, 3) == "YN " ///
    & !inlist(_city, reverse("EAST NY"), reverse("WEST NY"))
replace _city = subinstr(_city, reverse(" NEW YORK"), "", 1) ///
    if substr(_city, 1, length(" NEW YORK")) == reverse(" NEW YORK") ///
    & ! inlist(_city, reverse("EAST NEW YORK"), reverse("WEST NEW YORK"))
replace _city = subinstr(_city, ",", "", 1) if substr(_city, 1, 1) == ","
gen wanted = reverse(_city)
list city wanted, noobs clean

Note, I have put in special handling for EAST/WEST NEW YORK/NY because EAST NEW YORK is in fact a postal area in Brooklyn, and West New York is a city in New Jersey, so you should not reduce those to just East and West, respectively.

Again, I have avoided the use of regular expressions, though at this point it would arguably be better to use them. Still, it is worth knowing about the -reverse()- function, which is very useful for cleaning up the ends of strings.

Comment

Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#12

12 Jul 2018, 21:21

Other solution, which seems working. It might also require additional work for exceptional cases, such as "East NY" "NYABC NY" "ABC YORK" (if any).

Code:

egen a=ends(city), last replace a="NEW YORK" if a=="YORK" gen wanted=cond((a=="NY"| a=="NEW YORK") & wordcount(city)>1, subinstr(city, " "+a, "", .),city)
Comment
Marvin Aliaga

Join Date: Feb 2015

Posts: 255
#13

13 Jul 2018, 09:47

Thank you all so much for the response Romalpa Akzo Clyde Schechter Igor Paploski !

I decided to use Igor codes since it is a little bit more straight forward.

Code:

replace city = subinstr(CITY," NY", "", .) if regexm(CITY,"NY") & wordcount(CITY)>1 & /// !inlist(CITY, "WEST NYACK", "WEST NY", "EAST NY", "E NY BK", "E NY BK") replace city = subinstr(CITY," NEW YORK", "", .) if regexm(CITY,"NEW YORK") /// & wordcount(CITY)>2 & !inlist(CITY, "EAST NEW YORK", "WEST NEW YORK")

Clyde Schechter You code is awesome as well but I don't understand the rationale for reversing the order of the strings. I guess it has to do to with catching the NY word?

ou might actually be better off creating a new data set consisting just of the distinct values of the variable city, and then hand editing in a corresponding cleaned value, and then -merge-ing back to the original data.

Yes, I am doing this. I am only keeping distinct city values to make data cleaning easier. Once I have all my codes ready, I will run it on my complete dataset. I don't think merging is necessary.

Again, thank you all for making me be better at strings!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#14

13 Jul 2018, 12:34

...but I don't understand the rationale for reversing the order of the strings. I guess it has to do to with catching the NY word?

The reason for using reverse-ordered strings in my code is that while it is very easy in Stata to do something to the beginning of a string (-substr(myvar, 1, whatever)-), it is harder to do something to the end of a string, because you don't always know where in the string it begins. In your problem, the NY's and NEW YORK's of interest are all at the string's end, but the strings all have different lengths, so it is difficult to target them with the usual string functions. You can't even rely on -strpos()- to find them for you because, -strpos("ALBANY, NY", "NY")- will pick the NY in ALBANY, not the final NY.

By reversing the order of the strings, " NY" at the end becomes "YN " at the beginning and can be "attacked" by focusing on -substr(myvar 1, 3)-. So you do the surgery on the reversed string, and then at the very end, you just undo the reversal--which is easy because reverse() is its own inverse function.

I could have made a bit less use of -reverse()- by, for example, referring to "KROY WEN ". I didn't do that because:
1. I'm not all that good at spelling backwards and would likely have made mistakes.
2. It would, to me at least, be less obvious what the code was doing at first glance.
Comment

Romalpa Akzo

Join Date: Oct 2017
Posts: 369

#15

13 Jul 2018, 16:08

One more way to go, with -substr-, which might be lesser in dealing with exceptional cases.

Code:

gen cut = 3 if substr(city, -3, .) == " NY" & !inlist(city, "WEST NY", "EAST NY")
replace cut = 4 if substr(city, -4, .) == ", NY"
replace cut = 9 if substr(city, -9, .) == " NEW YORK" &!inlist(city, "EAST NEW YORK", "WEST NEW YORK")
gen wanted = substr(city, 1, length(city)-cut)

Announcement