-import Excel- with multiple lines in a cell

skolenik

Join Date: Mar 2014

Posts: 102
#1

-import Excel- with multiple lines in a cell

18 Feb 2016, 12:58

I have an Excel file where some of the cells have multiple lines. Does Stata retain the EOL character that separates them? I am inclined to think that it does: while -display- shows concatenated strings, -browse- shows the first string only, so Stata must be changing behavior when it encounters EOLs. If the EOL is indeed there, what is the ASCII code, and what's the best way to parse on it given that it is a special character that may be difficult to stick into the string functions?

-- Stas Kolenikov || http://stas.kolenikov.name
-- Principal Survey Scientist, Abt SRBI
-- Opinions stated in this post are mine only
Tags: Excel
wbuchanan

Join Date: Mar 2014

Posts: 1362
#2

18 Feb 2016, 16:51

I think you're looking for char(10) and char(13) which are the ASCII character codes for a new line and carriage return. You may be able to use regular expressions to handle this as well. I was trying to work on some more extensive regex extensions but I could put together something quickly if you needed some simple functions (e.g., regex replace that is POSIX compliant).
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#3

18 Feb 2016, 17:27

I believe that it's just ASCII decimal 10 (line feed) alone.
Attached Files

Data.xlsx (8.4 KB, 1 view)

Excel EOL Search.smcl (911 Bytes, 1 view)

Excel EOL Search.do (366 Bytes, 1 view)
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#4

19 Feb 2016, 06:13

Joseph Coveney the link you mentioned also mentions the ASCII code 13 (carriage return) in addition to ASCII code 10 (new line). If skolenik is working with Unicode values, it may not perform as expected. However, there's a new way to deal with these types of issues now:

Code:

net inst jregex, from("http://wbuchanan.github.io/StataRegex/") // Can replace into a new variable jregex replace varwithlinecharacters newvariablewithresult, p("\n") rep(" ") m // Or replace the values in place jregex replace varwithlinecharacters, p("\n") rep(" ") m

The m option in this case turns on multiline mode in case there are multiple new line characters in the same field. It isn't tested for this specific usage, but there are some simple examples that show how it can be used as well as links to the Javadocs with more details about the different metacharacters that the program exposes.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#5

19 Feb 2016, 08:59

As Joseph as hinted, the fix is simple, just replace all instances of char(10) with a space

Code:

gen svarfix = subinstr(svar,char(10)," ",.)

You can also do the same using regular expressions:

Code:

. * replace all newline, carriage return, and tab characters with a space . gen svarfix2 = ustrregexra(svar,"[\n\r\t]"," ")
1 like
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#6

19 Feb 2016, 09:46

Robert Picard the first solution makes a strong assumption about the underlying encoding of the data. Your second solution is likely to be safer, but assumes that the regular expression processor developed by Stata uses a multiline mode search. I ran into issues with this when I was trying to piece together the Mata HTML library because I was scraping the element definitions from the W3C website. I think ustrregexra is what I ended up using in the end, but there were a mix of new line, carriage return, force feed, and several other different characters that all serve the same function when rendered.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#7

19 Feb 2016, 10:10

I dont understand the issue of encoding, the character for a newline is the same in ASCII and UTF-8. In UTF-8, all non-ASCII characters are encoded using two to four bytes, each of these > 127 so you can't accidentally break a UTF-8 character by targeting char(10).

If you somehow managed to insert newlines and carriage returns into a string variable, then you can target them safely with subinstr():

Code:

. clear . set obs 1 number of observations (_N) was 0, now 1 . gen s = "a" + char(10) + char(13) + "b" + char(10) + char(13) + "c" . . gen s2 = subinstr(s,char(10),"",.) . replace s2 = subinstr(s2,char(13),"",.) (1 real change made) . dis length(s) 7 . dis length(s2) 3 . list s2 +-----+ | s2 | |-----| 1. | abc | +-----+

I also don't quite understand the multiline reference but this is how the regex equivalent works

Code:

. gen s3 = ustrregexra(s,"[\n\r]","") . dis length(s3) 3 . list s3 +-----+ | s3 | |-----| 1. | abc | +-----+ .
Comment
skolenik

Join Date: Mar 2014

Posts: 102
#8

19 Feb 2016, 11:35

Great, thank you guys wbuchanan Joseph Coveney Robert Picard .

-- Stas Kolenikov || http://stas.kolenikov.name
-- Principal Survey Scientist, Abt SRBI
-- Opinions stated in this post are mine only
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#9

21 Feb 2016, 04:32

Robert Picard I apologize if what I mentioned previously implied that I was saying it would break anything. It was more to point out that there are several different line terminating characters in unicode. That said, I would add an additional metacharacter to the \n and \r characters that you mentioned previously: \R. The \R metacharacter should capture the other line terminating characters that are not captured by the line feed and carriage return characters corresponding to the \n and \r characters.
Comment
Matias Iberico

Join Date: Jul 2016

Posts: 4
#10

30 Mar 2023, 18:35

Hello. I ran into a big issue importing from an excel sheet that had LS (line separators; U2028) embedded randomly. In my case I was replacing with " <br> " instead of " ". Anyhow, Robert Picard, your excellent solution using [\n\r\t] was a life-saver. Thank you. I did try incorporating wbuchanan's suggested \R but then all Uppercase characters also got replaced as " "!! I'm not sure I fully understand what \R does. Is there any index to the unicode regex properties?

I found this from unicode but I didn't find it very helpful.
Also this https://medium.com/the-stata-guide/r...a-6e5c200ef27c
guide, but it did not include \n, \r, or \t.

- Matias
Comment
Matias Iberico

Join Date: Jul 2016

Posts: 4
#11

30 Mar 2023, 20:54

After some more digging I think I figured it out. Are these the ANSI C escape sequences? Still don't see "\R", however.
-Matias
Comment

Announcement

-import Excel- with multiple lines in a cell

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment