Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • -import Excel- with multiple lines in a cell

    I have an Excel file where some of the cells have multiple lines. Does Stata retain the EOL character that separates them? I am inclined to think that it does: while -display- shows concatenated strings, -browse- shows the first string only, so Stata must be changing behavior when it encounters EOLs. If the EOL is indeed there, what is the ASCII code, and what's the best way to parse on it given that it is a special character that may be difficult to stick into the string functions?
    -- Stas Kolenikov || http://stas.kolenikov.name
    -- Principal Survey Scientist, Abt SRBI
    -- Opinions stated in this post are mine only


  • #2
    I think you're looking for char(10) and char(13) which are the ASCII character codes for a new line and carriage return. You may be able to use regular expressions to handle this as well. I was trying to work on some more extensive regex extensions but I could put together something quickly if you needed some simple functions (e.g., regex replace that is POSIX compliant).

    Comment


    • #3
      I believe that it's just ASCII decimal 10 (line feed) alone.
      Attached Files

      Comment


      • #4
        Joseph Coveney the link you mentioned also mentions the ASCII code 13 (carriage return) in addition to ASCII code 10 (new line). If skolenik is working with Unicode values, it may not perform as expected. However, there's a new way to deal with these types of issues now:

        Code:
        net inst jregex, from("http://wbuchanan.github.io/StataRegex/")
        
        // Can replace into a new variable
        jregex replace varwithlinecharacters newvariablewithresult, p("\n") rep(" ") m 
        
        // Or replace the values in place
        jregex replace varwithlinecharacters, p("\n") rep(" ") m
        The m option in this case turns on multiline mode in case there are multiple new line characters in the same field. It isn't tested for this specific usage, but there are some simple examples that show how it can be used as well as links to the Javadocs with more details about the different metacharacters that the program exposes.

        Comment


        • #5
          As Joseph as hinted, the fix is simple, just replace all instances of char(10) with a space

          Code:
          gen svarfix = subinstr(svar,char(10)," ",.)
          You can also do the same using regular expressions:

          Code:
          . * replace all newline, carriage return, and tab characters with a space
          . gen svarfix2 = ustrregexra(svar,"[\n\r\t]"," ")

          Comment


          • #6
            Robert Picard the first solution makes a strong assumption about the underlying encoding of the data. Your second solution is likely to be safer, but assumes that the regular expression processor developed by Stata uses a multiline mode search. I ran into issues with this when I was trying to piece together the Mata HTML library because I was scraping the element definitions from the W3C website. I think ustrregexra is what I ended up using in the end, but there were a mix of new line, carriage return, force feed, and several other different characters that all serve the same function when rendered.

            Comment


            • #7
              I dont understand the issue of encoding, the character for a newline is the same in ASCII and UTF-8. In UTF-8, all non-ASCII characters are encoded using two to four bytes, each of these > 127 so you can't accidentally break a UTF-8 character by targeting char(10).

              If you somehow managed to insert newlines and carriage returns into a string variable, then you can target them safely with subinstr():

              Code:
              . clear
              
              . set obs 1
              number of observations (_N) was 0, now 1
              
              . gen s = "a" + char(10) + char(13) + "b" + char(10) + char(13) + "c"
              
              . 
              . gen s2 = subinstr(s,char(10),"",.)
              
              . replace s2 = subinstr(s2,char(13),"",.)
              (1 real change made)
              
              . dis length(s)
              7
              
              . dis length(s2)
              3
              
              . list s2
              
                   +-----+
                   |  s2 |
                   |-----|
                1. | abc |
                   +-----+
              I also don't quite understand the multiline reference but this is how the regex equivalent works
              Code:
              . gen s3 = ustrregexra(s,"[\n\r]","")
              
              . dis length(s3)
              3
              
              . list s3
              
                   +-----+
                   |  s3 |
                   |-----|
                1. | abc |
                   +-----+
              
              .

              Comment


              • #8
                Great, thank you guys wbuchanan Joseph Coveney Robert Picard .
                -- Stas Kolenikov || http://stas.kolenikov.name
                -- Principal Survey Scientist, Abt SRBI
                -- Opinions stated in this post are mine only

                Comment


                • #9
                  Robert Picard I apologize if what I mentioned previously implied that I was saying it would break anything. It was more to point out that there are several different line terminating characters in unicode. That said, I would add an additional metacharacter to the \n and \r characters that you mentioned previously: \R. The \R metacharacter should capture the other line terminating characters that are not captured by the line feed and carriage return characters corresponding to the \n and \r characters.

                  Comment


                  • #10
                    Hello. I ran into a big issue importing from an excel sheet that had LS (line separators; U2028) embedded randomly. In my case I was replacing with " <br> " instead of " ". Anyhow, Robert Picard, your excellent solution using [\n\r\t] was a life-saver. Thank you. I did try incorporating wbuchanan's suggested \R but then all Uppercase characters also got replaced as " "!! I'm not sure I fully understand what \R does. Is there any index to the unicode regex properties?

                    I found this from unicode but I didn't find it very helpful.
                    Also this https://medium.com/the-stata-guide/r...a-6e5c200ef27c
                    guide, but it did not include \n, \r, or \t.

                    - Matias

                    Comment


                    • #11
                      After some more digging I think I figured it out. Are these the ANSI C escape sequences? Still don't see "\R", however.
                      -Matias

                      Comment

                      Working...
                      X