Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to use import delimited command with "new line" as delimiter?

    I tried but failed to use new line as delimiter with import delimited command.
    How to "read" in text file data as if one observation per line in Stata regardless the length of one observation?

  • #2
    The delimiter specified in this command is used to define column delimiters. Have you read the Statalist FAQ and/or the help file for import delimited? Showing exactly what you tried and explaining what you are attempting to get for the outcome will make it much easier for others to help.

    Comment


    • #3
      Code:
      copy "https://www.bing.com/" bing.html
      import delimited bing.html, delimiter("") encoding("utf-8") stringcols(_all) varnames(nonames) clear
      import delimited bing.html, delimiter("\r") encoding("utf-8") stringcols(_all) varnames(nonames) clear
      import delimited bing.html, delimiter("\r\n") encoding("utf-8") stringcols(_all) varnames(nonames) clear
      import delimited bing.html, delimiter("\n") encoding("utf-8") stringcols(_all) varnames(nonames) clear
      All failed. I want to "read" in as if one observation per line. The .dta files shall only contains one column.

      The code that works:
      Code:
      clear
      copy "https://www.bing.com/" bing.html
      infix str v1 1-1024 using "bing.html", clear
      The problem is:
      I don't know the length of v1 ex ante.

      How to deal with it?

      I don't mean to read in html files in the real case. It's just an example.

      Comment


      • #4
        The problem may be that the HTML does not include new line characters. Have you tried using insheet?

        Code:
        tempfile x
        copy "https://www.bing.com/" `x'.html
        insheet using `x'.html

        Comment


        • #5
          It seems to me that you are imagining that "\r" has special meaning to Stata, when it does not.
          Code:
          . display `"\r"'
          \r
          Instead, consider the following.
          Code:
          . copy "https://www.bing.com/" bing.html
          
          . local dlm = char(10)
          
          . import delimited bing.html, delimiter(`"`dlm'"') ///
          >    encoding("utf-8") stringcols(_all) varnames(nonames) clear
          (1 var, 22 obs)
          
          . describe
          
          Contains data
            obs:            22                          
           vars:             1                          
           size:        91,301                          
          ------------------------------------------------------------------------------------------------
                        storage   display    value
          variable name   type    format     label      variable label
          ------------------------------------------------------------------------------------------------
          v1              strL    %9s                   
          ------------------------------------------------------------------------------------------------
          Sorted by: 
               Note: Dataset has changed since last saved.
          
          . list in 1/5
          
               +-----------------------------------------------------------------------------------------+
               | v1                                                                                      |
               |-----------------------------------------------------------------------------------------|
            1. | <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/.. |
            2. | si_ST=new Date;                                                                         |
            3. | //]]></script><head><meta content="text/html; charset=utf-8" http-equiv="content-type.. |
            4. | _G={ST:(si_ST?si_ST:new Date),Mkt:"en-US",RTL:false,Ver:"11",IG:"DF1B392485A141BCB82D.. |
            5. | //]]></script><style type="text/css">html{overflow:auto}a,body{font-family:"Segoe UI".. |
               +-----------------------------------------------------------------------------------------+

          Comment


          • #6
            Very good. Char(10) solved it.
            Since "\t" is included as a delimiter for "tab" (char(9)) in import delimited command.
            Why not use "\n" for char(10) and "\r" for char(13) in import delimiter command. Thank You!

            insheet cannot deal with utf-8 encoding.


            Preferably, Stata shall aim for including "encoding and decoding functionality in next version. Since it support utf-8 now, it's not difficult to be a versatile language at near future.
            Last edited by Jimmy Yang; 11 Mar 2016, 23:36.

            Comment


            • #7
              Well, the reason for not directly supporting char(10) {newline} and char(13) {return} as delimiters is that those two characters are interpreted by Stata as line end characters in the input text file, and are not available as delimiters that separate fields within lines. Since your objective was to read each text line into a single variable, any character that does not appear in your data would serve equally well as a delimiter for import delimited - substitute char(12) {formfeed} for char(10) in my example for a demonstration. I originally chose char(10) because I knew Stata would have any embedded within lines, and did not want to complicate the post by adding this explanation. But since you asked, this is "why not".

              Also, are you suggesting that Stata "include encoding and decoding functionality" for the insheet command? If so you will be disappointed, because help insheet reports that as of Stata 13 insheet is no longer an official part of Stata, having been superseded by import delimited.
              Last edited by William Lisowski; 12 Mar 2016, 19:21.

              Comment


              • #8
                Originally posted by William Lisowski View Post
                Well, the reason for not directly supporting char(10) {newline} and char(13) {return} as delimiters is that those two characters are interpreted by Stata as line end characters in the input text file, and are not available as delimiters that separate fields within lines. Since your objective was to read each text line into a single variable, any character that does not appear in your data would serve equally well as a delimiter for import delimited - substitute char(12) {formfeed} for char(10) in my example for a demonstration. I originally chose char(10) because I knew Stata would have any embedded within lines, and did not want to complicate the post by adding this explanation. But since you asked, this is "why not".

                Also, are you suggesting that Stata "include encoding and decoding functionality" for the insheet command? If so you will be disappointed, because help insheet reports that as of Stata 13 insheet is no longer an official part of Stata, having been superseded by import delimited.
                Thank you! Indeed, char(13) {return} results in a new line. It solved my problem!

                Comment

                Working...
                X