Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How can I match a carriage return/line feed using regular expressions?

    Dear Statasticians,

    I'm trying to match a pattern in a plain text file containing line breaks, e.g. from...

    Authors
    Mueller A. Candrian G. Kropotov JD. Ponomarev VA. Baschera GM.
    Authors Full Name
    Mueller, Andreas. Candrian, Gian. Kropotov, Juri D. Ponomarev, Valery A.


    ... I'm trying to extract the authors, i.e. the second line of the text. The word "Authors" is followed by a line break, then two blanks (which you probably can't see because they've been stripped off by the forum software) , then come the authors, and the line ends with another line break.

    I've tried to match this using

    Code:
    loc rgx (Authors[^!-~]  )(.*)([^!-~])
    replace authors = regexs(2) if regexm(abstract, "`rgx'")
    but it returns an empty string.
    How can I match the CR/LF character? I've searched far and wide, but there doesn't seem to be any dedicated regex symbol in Stata?

    Alex

    Equipment: Mac OS 10.10, Stata 13

  • #2
    I don't know, but a strategy is to simply to blank out "Authors", etc.

    Comment


    • #3
      Hi Statalisters,

      I'm also interested in an answer to this question. When encountering a comparable issue (reading in data from ODBC having to erase CR and LF characters), I also did not find a solution using regular expressions. It may be that Stata's regular expression engine is not capable of handling this, as it seems to be created to process values of a data set's variable (i.e. one-line strings) only.

      I ended up using the -subinstr- function to replace ASCII codes 10 and 13 by spaces after reading in the data, and parsing the result (as Nick suggests) by ignoring the parts that should not be interpreted ("Authors" in your example).

      As long as your input data is saved in a plain text file, you could also (instead of using -import delimited- or -insheet-) use -file read- to import the data line by line. This will take care of carriage returns and line feeds for you. All you have to do is ignore any line starting with "Authors".

      Regards
      Bela

      Comment


      • #4
        Maybe filefilter can be useful, too.

        Best
        Daniel

        Comment


        • #5
          Hi all,

          Thank you for your input. I think I solved the problem. The crucial bit of information is that, depending on your operating system and/or the source of your text file, a line break may be coded as either CR, LF, CR+LF or LF+CR (here). Although Mac OS X is supposed to use only LF, the file I have apparently uses a combination of both (it's an exported .txt file from an OVID search). That means I have to match two characters, not just one.

          To match either of them, you can use the regex term
          Code:
           [^ -~]
          which specifies to exclude (^) the range of printable ASCII characters (<space> to ~), leaving only control characters like LF and CR for matching. In the example from my first post

          Authors
          Mueller A. Candrian G. Kropotov JD. Ponomarev VA. Baschera GM.
          Authors Full Name
          Mueller, Andreas. Candrian, Gian. Kropotov, Juri D. Ponomarev, Valery A.


          I can use

          Code:
          replace authors = regexs(2) if regexm(abstract, "(Authors[^ -~][^ -~]  )(.*)([^ -~][^ -~]Authors)")
          to extract only the second line:

          Code:
               +--------------------------------------------------------------------+
               |                                                            authors |
               |--------------------------------------------------------------------|
            1. | Mueller A.  Candrian G.  Kropotov JD.  Ponomarev VA.  Baschera GM. |
               +--------------------------------------------------------------------+
          (Note that the "Authors" at the end of the regular expression is needed because otherwise Stata would match greedily to the very end of the text.)

          Alex

          Comment


          • #6
            Originally posted by Alex Gamma View Post
            [...]
            I can use

            Code:
            replace authors = regexs(2) if regexm(abstract, "(Authors[^ -~][^ -~] )(.*)([^ -~][^ -~]Authors)")
            to extract only the second line[...]
            This looks like a good solution for your data, assuming that it only consists of "plain" ASCII characters; but be aware of a caveat: Your regular expression does not include enhanced ASCII codes, like the ones used in ISO 8859-1 (equivalent to Windows Codepage 1252) with character codes 127 to 255, which is the character encoding Stata seems to use.

            In short: Accented characters, German umlauts and other extended ASCII characters are excluded from your expression, but may be valid (at least with other data sources to match).

            Of course, your solution could be adapted to something like
            Code:
            [^`=char(32)'-`=char(255)']
            to prevent this.

            Regards
            Bela

            Comment


            • #7
              Thanks, Bela,

              I was not aware that notation like
              Code:
              `=char(255)'
              was possible. A precise solution for my case is then

              Code:
              replace authors = regexs(2) if regexm(abstract, "(Authors[`=char(13)'][`=char(10)']  )(.*)([`=char(13)'][`=char(10)']Authors)")
              This matches the CR+LF pairing that is used as the line break characters in my file.

              Alex

              Comment

              Working...
              X