How can I match a carriage return/line feed using regular expressions?

Alex Gamma

Join Date: Mar 2014

Posts: 18
#1

How can I match a carriage return/line feed using regular expressions?

22 Oct 2014, 17:22

Dear Statasticians,

I'm trying to match a pattern in a plain text file containing line breaks, e.g. from...

Authors
Mueller A. Candrian G. Kropotov JD. Ponomarev VA. Baschera GM.
Authors Full Name
Mueller, Andreas. Candrian, Gian. Kropotov, Juri D. Ponomarev, Valery A.

... I'm trying to extract the authors, i.e. the second line of the text. The word "Authors" is followed by a line break, then two blanks (which you probably can't see because they've been stripped off by the forum software) , then come the authors, and the line ends with another line break.

I've tried to match this using

Code:

loc rgx (Authors[^!-~] )(.*)([^!-~]) replace authors = regexs(2) if regexm(abstract, "`rgx'")

but it returns an empty string.
How can I match the CR/LF character? I've searched far and wide, but there doesn't seem to be any dedicated regex symbol in Stata?

Alex

Equipment: Mac OS 10.10, Stata 13
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35703
#2

22 Oct 2014, 18:11

I don't know, but a strategy is to simply to blank out "Authors", etc.
Comment
Daniel Bela

Join Date: Apr 2014

Posts: 246
#3

23 Oct 2014, 01:09

Hi Statalisters,

I'm also interested in an answer to this question. When encountering a comparable issue (reading in data from ODBC having to erase CR and LF characters), I also did not find a solution using regular expressions. It may be that Stata's regular expression engine is not capable of handling this, as it seems to be created to process values of a data set's variable (i.e. one-line strings) only.

I ended up using the -subinstr- function to replace ASCII codes 10 and 13 by spaces after reading in the data, and parsing the result (as Nick suggests) by ignoring the parts that should not be interpreted ("Authors" in your example).

As long as your input data is saved in a plain text file, you could also (instead of using -import delimited- or -insheet-) use -file read- to import the data line by line. This will take care of carriage returns and line feeds for you. All you have to do is ignore any line starting with "Authors".

Regards
Bela
1 like
Comment
daniel klein

Join Date: Mar 2014

Posts: 3857
#4

23 Oct 2014, 02:04

Maybe filefilter can be useful, too.

Best
Daniel
1 like
Comment
Alex Gamma

Join Date: Mar 2014

Posts: 18
#5

23 Oct 2014, 07:12

Hi all,

Thank you for your input. I think I solved the problem. The crucial bit of information is that, depending on your operating system and/or the source of your text file, a line break may be coded as either CR, LF, CR+LF or LF+CR (here). Although Mac OS X is supposed to use only LF, the file I have apparently uses a combination of both (it's an exported .txt file from an OVID search). That means I have to match two characters, not just one.

To match either of them, you can use the regex term

Code:

[^ -~]

which specifies to exclude (^) the range of printable ASCII characters (<space> to ~), leaving only control characters like LF and CR for matching. In the example from my first post

Authors
Mueller A. Candrian G. Kropotov JD. Ponomarev VA. Baschera GM.
Authors Full Name
Mueller, Andreas. Candrian, Gian. Kropotov, Juri D. Ponomarev, Valery A.

I can use

Code:

replace authors = regexs(2) if regexm(abstract, "(Authors[^ -~][^ -~] )(.*)([^ -~][^ -~]Authors)")

to extract only the second line:

Code:

+--------------------------------------------------------------------+ | authors | |--------------------------------------------------------------------| 1. | Mueller A. Candrian G. Kropotov JD. Ponomarev VA. Baschera GM. | +--------------------------------------------------------------------+

(Note that the "Authors" at the end of the regular expression is needed because otherwise Stata would match greedily to the very end of the text.)

Alex
Comment
Daniel Bela

Join Date: Apr 2014

Posts: 246
#6

23 Oct 2014, 09:31

Originally posted by Alex Gamma View Post

[...]
I can use

Code:

replace authors = regexs(2) if regexm(abstract, "(Authors[^ -~][^ -~] )(.*)([^ -~][^ -~]Authors)")

to extract only the second line[...]

This looks like a good solution for your data, assuming that it only consists of "plain" ASCII characters; but be aware of a caveat: Your regular expression does not include enhanced ASCII codes, like the ones used in ISO 8859-1 (equivalent to Windows Codepage 1252) with character codes 127 to 255, which is the character encoding Stata seems to use.

In short: Accented characters, German umlauts and other extended ASCII characters are excluded from your expression, but may be valid (at least with other data sources to match).

Of course, your solution could be adapted to something like

Code:

[^`=char(32)'-`=char(255)']

to prevent this.

Regards
Bela
Comment
Alex Gamma

Join Date: Mar 2014

Posts: 18
#7

23 Oct 2014, 12:16

Thanks, Bela,

I was not aware that notation like

Code:

`=char(255)'

was possible. A precise solution for my case is then

Code:

replace authors = regexs(2) if regexm(abstract, "(Authors[`=char(13)'][`=char(10)'] )(.*)([`=char(13)'][`=char(10)']Authors)")

This matches the CR+LF pairing that is used as the line break characters in my file.

Alex
Comment

Announcement

How can I match a carriage return/line feed using regular expressions?

Comment

Comment

Comment

Comment

Comment

Comment