Remove spaces from string if consecutive one letter characters or numbers

michael joe

Join Date: May 2015

Posts: 50
#1

Remove spaces from string if consecutive one letter characters or numbers

13 Feb 2019, 21:13

Hi how would I go about removing spaces from strings such as the following: 1 2 B L GROW A I M INC becomes 12 BL GROW AIM INC

Last edited by michael joe; 13 Feb 2019, 21:17.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 36070
#2

14 Feb 2019, 01:18

There is no easy solution obvious to me. You know that some internal spaces are correct, some not.

Code:

replace whatever = subinstr(whatever, "B L", "BL", .) replace whatever = subinstr(whatever, "A I M", "AIM", .)

are the kinds of edit needed: you must still watch out for false positives.
Comment
michael joe

Join Date: May 2015

Posts: 50
#3

14 Feb 2019, 01:59

Hi Nick. Thanks for the help. I tried regexs and regexm, starting with closing the space between [A-Z][ ][A-Z] if [ ][A-Z][ ][A-Z][ ]. So if I had a word like W A S P or words like W A S PROOF I could remove the space between A and S and then work from there. Sadly, I failed at many attempts using the functions. Looking for help with performing this single task.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36070
#4

14 Feb 2019, 02:23

Your question still seems to be that in #1, so I can't add to my answer. As said, you know that some spaces are incorrect and must remove those, but there isn't code for "remove incorrect spaces".
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

14 Feb 2019, 06:45

As someone who has had a lot of experience with regular expressions in Perl, my approach to this problem would be to create a text file and use Perl to apply the changes to that file, then merge the results back into the Stata dataset. Stata's handling of regular expressions, while improved in the unicode version of the functions, still makes it difficult to implement the equivalent of

Code:

s/ (\d) (\d) / \1\2 /g

to turn "A 1 2 B 3 4 C" into "A 12 B 34 C" in Perl (but I would first refresh my memory of Perl, it's been a while and this example is untested).
Comment
Daniel Bela

Join Date: Apr 2014

Posts: 246
#6

14 Feb 2019, 07:55

Originally posted by William Lisowski View Post

[...] Stata's handling of regular expressions, while improved in the unicode version of the functions, still makes it difficult to implement the equivalent of

Code:

s/ (\d) (\d) / \1\2 /g

to turn "A 1 2 B 3 4 C" into "A 12 B 34 C" in Perl (but I would first refresh my memory of Perl, it's been a while and this example is untested).

I slightly disagree; the Unicode regex engine Stata uses since version 14 seems quite comprehensive to me, and IMHO is a tremendous improvement to the former regex functions. I do agree, however, to anyone stressing out that users would need much more documentation on the features of the engine (to my knowledge, we don't have any).

From trial and error, I can say that the engine even supports lookahead and lookbehind; this makes it easy to solve the task, if I understood it correctly, in one line:

Code:

version 14 clear input str30(stringvar) "1 2 B L GROW A I M INC" "W H O CREATED T H I S MESS" "R E G E X ROCK" end replace stringvar=ustrregexra(stringvar,"(?<![A-Z0-9][A-Z0-9])(?=[ ][A-Z0-9]( |$)) ","",0) list

As an explanation: I understood the question as "remove any space character that (1) is followed by only single characters (and, maybe, consecutive white space) or END OF LINE, and (2) is not preceded by more than a single character".

Does this do the trick?

Kind regards
Bela

Last edited by Daniel Bela; 14 Feb 2019, 07:56. Reason: formatting
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

14 Feb 2019, 08:53

Daniel Bela - Many thanks. Your elegant code deserves my study; my knowledge of contemporary regular expression syntax is woefully incomplete.

I slightly disagree; the Unicode regex engine Stata uses since version 14 seems quite comprehensive to me, and IMHO is a tremendous improvement to the former regex functions. I do agree, however, to anyone stressing out that users would need much more documentation on the features of the engine (to my knowledge, we don't have any)

I agree wholeheartedly with the entire quotation.

To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at
http://userguide.icu-project.org/strings/regexp
While writing this response, I discovered that whatever I did the other day that convinced me at that time that ustrregexra() did not support back references in the substitution string was incorrect. I would not need Perl to do what I hoped, leveraging my ancient understanding of regex syntax.

Code:

. clear . set obs 1 number of observations (_N) was 0, now 1 . generate str20 text = "A 1 2 B 3 4 C" . generate str20 new = ustrregexra(text, " (\d) (\d) "," \1\2 ") . list, noobs +-----------------------------+ | text new | |-----------------------------| | A 1 2 B 3 4 C A 12 B 12 C | +-----------------------------+
1 like
Comment
Daniel Bela

Join Date: Apr 2014

Posts: 246
#8

14 Feb 2019, 09:03

Thanks, William Lisowski, for pointing me to that Statalist post mentioning the implementation of the ICU regex engine; I missed it until now. This is really helpful!

(And: I did as well not know until know that you can use Perl-like back-references in ustrregexra(); thanks as well!

Regards
Bela
Comment
Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#9

14 Feb 2019, 10:32

consecutive one letter characters or numbers:

1 2 B L GROW A I M INC becomes 12 BL GROW AIM INC

The implication of separating numbers and letters (like in the example) makes the puzzle more complicated than its description. And Daniel Bela's regular expression, while being a tricky and enjoyable one, still does not solve for this separation. Then I am curious for any improvement of the regex solution: still 1-line coding?

For now, below code, a detour with -split-, would give out the good hints to the desired target, whereas the blue qualifier is dedicated for the above blue implication. Just the hints they are, since afterward, cautious edits, for any false deduction as it might be, would still be needed.

Code:

clear input str39 stringvar "1 2 B L GROW A I M INC" "3 4 1 B L D G NUMBER LETTER SEPARATE" "R O O M 1 4 5 KINGCROSS R O A D" "SUCH A BEAUTIFUL DAY" "OOPS, THIS I S A W R O N G ONE" end replace stringvar = trim(itrim(stringvar)) split stringvar, gen(v) forval i = `r(nvars)'(-1)2 { replace v`i' = " "+ v`i' if length(v`i')*length(v`=`i'-1')>1 | 0*real(v`i') != 0*real(v`=`i'-1') } egen new_stringvar = concat(v*) drop v*
Comment

michael joe

Join Date: May 2015
Posts: 50

#10

14 Feb 2019, 12:06

Thanks guys. I will try out your methods to see how they work. I actually came up with the following method before reading the above. It works for what I have as it turns out numbers don't need to be separated from non-numeric characters as long as they belong to the same consecutive spaced out character pattern. Like Romalpa's answer, I also took a detour with split, but hers looks much nicer than mine.

Code:

gen companyname2 = subinstr(companynamecrsp, " ", ".",.)
split companyname2, parse(.) gen(companies)

gen companyappend1=companies1

local i=2
local j=1
local k=3

foreach comps of varlist companies* { 
replace companyappend1=companyappend1+" "+companies`i' if strlen(companies`i')>1 & companies`i'!="&"
replace companyappend1=companyappend1+companies`i' if strlen(companies`i')==1 & companies`i'!="&"
replace companyappend1=companyappend1+companies`i' if strlen(companies`j')==1 & strlen(companies`k')==1 & companies`i'=="&" 
replace companyappend1=companyappend1+" "+companies`i' if strlen(companies`j')>1 & strlen(companies`k')>1 & companies`i'=="&" 
replace companyappend1=companyappend1+" "+companies`i' if strlen(companies`j')>1 & strlen(companies`k')==1 & companies`i'=="&"
replace companyappend1=companyappend1+" "+companies`i' if strlen(companies`j')==1 & strlen(companies`k')>1 & companies`i'=="&" 
local i=`i'+1
}

drop companyname31 companies*

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#11

01 Mar 2019, 14:43

Regarding post #7: The code I presented there is incorrect, and the description of what I thought I had accomplished is imperfect. Ignoring what I wrote above, let me state my current understanding.

Stata's ustrregexra() functions supports "capture group" references in the substitution string. Capture groups are surrounded with parentheses in the regular expression being matched and capture groups are referenced as $1, $2, ... .

Code:

. clear . set obs 1 number of observations (_N) was 0, now 1 . generate str20 text = "A 1 2 B 3 4 C" . generate str20 new = ustrregexra(text, " (\d) (\d) "," $1$2 ") . list, noobs +-----------------------------+ | text new | |-----------------------------| | A 1 2 B 3 4 C A 12 B 34 C | +-----------------------------+
Comment

Announcement

Remove spaces from string if consecutive one letter characters or numbers

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment