Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Remove spaces from string if consecutive one letter characters or numbers

    Hi how would I go about removing spaces from strings such as the following: 1 2 B L GROW A I M INC becomes 12 BL GROW AIM INC
    Last edited by michael joe; 13 Feb 2019, 21:17.

  • #2
    There is no easy solution obvious to me. You know that some internal spaces are correct, some not.

    Code:
    replace whatever = subinstr(whatever, "B L", "BL", .) 
    replace whatever = subinstr(whatever, "A I M", "AIM", .)
    are the kinds of edit needed: you must still watch out for false positives.

    Comment


    • #3
      Hi Nick. Thanks for the help. I tried regexs and regexm, starting with closing the space between [A-Z][ ][A-Z] if [ ][A-Z][ ][A-Z][ ]. So if I had a word like W A S P or words like W A S PROOF I could remove the space between A and S and then work from there. Sadly, I failed at many attempts using the functions. Looking for help with performing this single task.

      Comment


      • #4
        Your question still seems to be that in #1, so I can't add to my answer. As said, you know that some spaces are incorrect and must remove those, but there isn't code for "remove incorrect spaces".

        Comment


        • #5
          As someone who has had a lot of experience with regular expressions in Perl, my approach to this problem would be to create a text file and use Perl to apply the changes to that file, then merge the results back into the Stata dataset. Stata's handling of regular expressions, while improved in the unicode version of the functions, still makes it difficult to implement the equivalent of
          Code:
          s/ (\d) (\d) / \1\2 /g
          to turn "A 1 2 B 3 4 C" into "A 12 B 34 C" in Perl (but I would first refresh my memory of Perl, it's been a while and this example is untested).

          Comment


          • #6
            Originally posted by William Lisowski View Post
            [...] Stata's handling of regular expressions, while improved in the unicode version of the functions, still makes it difficult to implement the equivalent of
            Code:
            s/ (\d) (\d) / \1\2 /g
            to turn "A 1 2 B 3 4 C" into "A 12 B 34 C" in Perl (but I would first refresh my memory of Perl, it's been a while and this example is untested).
            I slightly disagree; the Unicode regex engine Stata uses since version 14 seems quite comprehensive to me, and IMHO is a tremendous improvement to the former regex functions. I do agree, however, to anyone stressing out that users would need much more documentation on the features of the engine (to my knowledge, we don't have any).

            From trial and error, I can say that the engine even supports lookahead and lookbehind; this makes it easy to solve the task, if I understood it correctly, in one line:
            Code:
            version 14
            clear
            input str30(stringvar)
            "1 2 B L GROW A I M INC"
            "W H O CREATED T H I S MESS"
            "R E G E X ROCK"
            end
            replace stringvar=ustrregexra(stringvar,"(?<![A-Z0-9][A-Z0-9])(?=[ ][A-Z0-9]( |$)) ","",0)
            list
            As an explanation: I understood the question as "remove any space character that (1) is followed by only single characters (and, maybe, consecutive white space) or END OF LINE, and (2) is not preceded by more than a single character".

            Does this do the trick?

            Kind regards
            Bela
            Last edited by Daniel Bela; 14 Feb 2019, 07:56. Reason: formatting

            Comment


            • #7
              Daniel Bela - Many thanks. Your elegant code deserves my study; my knowledge of contemporary regular expression syntax is woefully incomplete.

              I slightly disagree; the Unicode regex engine Stata uses since version 14 seems quite comprehensive to me, and IMHO is a tremendous improvement to the former regex functions. I do agree, however, to anyone stressing out that users would need much more documentation on the features of the engine (to my knowledge, we don't have any)
              I agree wholeheartedly with the entire quotation.

              To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at
              http://userguide.icu-project.org/strings/regexp
              While writing this response, I discovered that whatever I did the other day that convinced me at that time that ustrregexra() did not support back references in the substitution string was incorrect. I would not need Perl to do what I hoped, leveraging my ancient understanding of regex syntax.
              Code:
              . clear
              
              . set obs 1
              number of observations (_N) was 0, now 1
              
              . generate str20 text = "A 1 2 B 3 4 C"
              
              . generate str20 new = ustrregexra(text, " (\d) (\d) "," \1\2 ")
              
              . list, noobs
              
                +-----------------------------+
                |          text           new |
                |-----------------------------|
                | A 1 2 B 3 4 C   A 12 B 12 C |
                +-----------------------------+

              Comment


              • #8
                Thanks, William Lisowski, for pointing me to that Statalist post mentioning the implementation of the ICU regex engine; I missed it until now. This is really helpful!

                (And: I did as well not know until know that you can use Perl-like back-references in ustrregexra(); thanks as well!

                Regards
                Bela

                Comment


                • #9
                  consecutive one letter characters or numbers:

                  1 2 B L GROW A I M INC becomes 12 BL GROW AIM INC
                  The implication of separating numbers and letters (like in the example) makes the puzzle more complicated than its description. And Daniel Bela's regular expression, while being a tricky and enjoyable one, still does not solve for this separation. Then I am curious for any improvement of the regex solution: still 1-line coding?

                  For now, below code, a detour with -split-, would give out the good hints to the desired target, whereas the blue qualifier is dedicated for the above blue implication. Just the hints they are, since afterward, cautious edits, for any false deduction as it might be, would still be needed.
                  Code:
                  clear
                  input str39 stringvar
                  "1 2 B L GROW A I M INC"                 
                  "3 4 1 B L D  G NUMBER   LETTER SEPARATE"
                  "R O O M 1 4 5  KINGCROSS R O A D"       
                  "SUCH A BEAUTIFUL DAY"                   
                  "OOPS, THIS I S A W R O N G  ONE"        
                  end
                  
                  replace stringvar = trim(itrim(stringvar))
                  split stringvar, gen(v)
                  
                  forval i = `r(nvars)'(-1)2 {
                  replace v`i' = " "+ v`i' if length(v`i')*length(v`=`i'-1')>1 | 0*real(v`i') != 0*real(v`=`i'-1')
                  }
                  
                  egen new_stringvar = concat(v*)
                  drop v*

                  Comment


                  • #10
                    Thanks guys. I will try out your methods to see how they work. I actually came up with the following method before reading the above. It works for what I have as it turns out numbers don't need to be separated from non-numeric characters as long as they belong to the same consecutive spaced out character pattern. Like Romalpa's answer, I also took a detour with split, but hers looks much nicer than mine.

                    Code:
                    gen companyname2 = subinstr(companynamecrsp, " ", ".",.)
                    split companyname2, parse(.) gen(companies)
                    
                    gen companyappend1=companies1
                    
                    local i=2
                    local j=1
                    local k=3
                    
                    foreach comps of varlist companies* { 
                    replace companyappend1=companyappend1+" "+companies`i' if strlen(companies`i')>1 & companies`i'!="&"
                    replace companyappend1=companyappend1+companies`i' if strlen(companies`i')==1 & companies`i'!="&"
                    replace companyappend1=companyappend1+companies`i' if strlen(companies`j')==1 & strlen(companies`k')==1 & companies`i'=="&" 
                    replace companyappend1=companyappend1+" "+companies`i' if strlen(companies`j')>1 & strlen(companies`k')>1 & companies`i'=="&" 
                    replace companyappend1=companyappend1+" "+companies`i' if strlen(companies`j')>1 & strlen(companies`k')==1 & companies`i'=="&"
                    replace companyappend1=companyappend1+" "+companies`i' if strlen(companies`j')==1 & strlen(companies`k')>1 & companies`i'=="&" 
                    local i=`i'+1
                    }
                    
                    drop companyname31 companies*

                    Comment


                    • #11
                      Regarding post #7: The code I presented there is incorrect, and the description of what I thought I had accomplished is imperfect. Ignoring what I wrote above, let me state my current understanding.

                      Stata's ustrregexra() functions supports "capture group" references in the substitution string. Capture groups are surrounded with parentheses in the regular expression being matched and capture groups are referenced as $1, $2, ... .

                      Code:
                      . clear
                      
                      . set obs 1
                      number of observations (_N) was 0, now 1
                      
                      . generate str20 text = "A 1 2 B 3 4 C"
                      
                      . generate str20 new = ustrregexra(text, " (\d) (\d) "," $1$2 ")
                      
                      . list, noobs
                      
                        +-----------------------------+
                        |          text           new |
                        |-----------------------------|
                        | A 1 2 B 3 4 C   A 12 B 34 C |
                        +-----------------------------+

                      Comment

                      Working...
                      X