How to use import delimited command with "new line" as delimiter?

Jimmy Yang

Join Date: May 2015

Posts: 54
#1

How to use import delimited command with "new line" as delimiter?

11 Mar 2016, 00:32

I tried but failed to use new line as delimiter with import delimited command.
How to "read" in text file data as if one observation per line in Stata regardless the length of one observation?
Tags: None
wbuchanan

Join Date: Mar 2014

Posts: 1361
#2

11 Mar 2016, 02:36

The delimiter specified in this command is used to define column delimiters. Have you read the Statalist FAQ and/or the help file for import delimited? Showing exactly what you tried and explaining what you are attempting to get for the outcome will make it much easier for others to help.
Comment

Jimmy Yang

Join Date: May 2015
Posts: 54

11 Mar 2016, 02:59

Code:

copy "https://www.bing.com/" bing.html
import delimited bing.html, delimiter("") encoding("utf-8") stringcols(_all) varnames(nonames) clear
import delimited bing.html, delimiter("\r") encoding("utf-8") stringcols(_all) varnames(nonames) clear
import delimited bing.html, delimiter("\r\n") encoding("utf-8") stringcols(_all) varnames(nonames) clear
import delimited bing.html, delimiter("\n") encoding("utf-8") stringcols(_all) varnames(nonames) clear

All failed. I want to "read" in as if one observation per line. The .dta files shall only contains one column.

The code that works:

Code:

clear
copy "https://www.bing.com/" bing.html
infix str v1 1-1024 using "bing.html", clear

The problem is:
I don't know the length of v1 ex ante.

How to deal with it?

I don't mean to read in html files in the real case. It's just an example.

Comment

wbuchanan

Join Date: Mar 2014

Posts: 1361
#4

11 Mar 2016, 03:54

The problem may be that the HTML does not include new line characters. Have you tried using insheet?

Code:

tempfile x copy "https://www.bing.com/" `x'.html insheet using `x'.html
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

11 Mar 2016, 06:55

It seems to me that you are imagining that "\r" has special meaning to Stata, when it does not.

Code:

. display `"\r"'
\r

Instead, consider the following.

Code:

. copy "https://www.bing.com/" bing.html

. local dlm = char(10)

. import delimited bing.html, delimiter(`"`dlm'"') ///
>    encoding("utf-8") stringcols(_all) varnames(nonames) clear
(1 var, 22 obs)

. describe

Contains data
  obs:            22                          
 vars:             1                          
 size:        91,301                          
------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------
v1              strL    %9s                   
------------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

. list in 1/5

     +-----------------------------------------------------------------------------------------+
     | v1                                                                                      |
     |-----------------------------------------------------------------------------------------|
  1. | <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/.. |
  2. | si_ST=new Date;                                                                         |
  3. | //]]></script><head><meta content="text/html; charset=utf-8" http-equiv="content-type.. |
  4. | _G={ST:(si_ST?si_ST:new Date),Mkt:"en-US",RTL:false,Ver:"11",IG:"DF1B392485A141BCB82D.. |
  5. | //]]></script><style type="text/css">html{overflow:auto}a,body{font-family:"Segoe UI".. |
     +-----------------------------------------------------------------------------------------+

Comment

Jimmy Yang

Join Date: May 2015

Posts: 54
#6

11 Mar 2016, 23:17

Very good. Char(10) solved it.
Since "\t" is included as a delimiter for "tab" (char(9)) in import delimited command.
Why not use "\n" for char(10) and "\r" for char(13) in import delimiter command. Thank You!

insheet cannot deal with utf-8 encoding.

Preferably, Stata shall aim for including "encoding and decoding functionality in next version. Since it support utf-8 now, it's not difficult to be a versatile language at near future.

Last edited by Jimmy Yang; 11 Mar 2016, 23:36.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

12 Mar 2016, 18:22

Well, the reason for not directly supporting char(10) {newline} and char(13) {return} as delimiters is that those two characters are interpreted by Stata as line end characters in the input text file, and are not available as delimiters that separate fields within lines. Since your objective was to read each text line into a single variable, any character that does not appear in your data would serve equally well as a delimiter for import delimited - substitute char(12) {formfeed} for char(10) in my example for a demonstration. I originally chose char(10) because I knew Stata would have any embedded within lines, and did not want to complicate the post by adding this explanation. But since you asked, this is "why not".

Also, are you suggesting that Stata "include encoding and decoding functionality" for the insheet command? If so you will be disappointed, because help insheet reports that as of Stata 13 insheet is no longer an official part of Stata, having been superseded by import delimited.

Last edited by William Lisowski; 12 Mar 2016, 19:21.
1 like
Comment
Spence Dan

Join Date: Jun 2018

Posts: 3
#8

05 Apr 2023, 21:18

Originally posted by William Lisowski View Post

Well, the reason for not directly supporting char(10) {newline} and char(13) {return} as delimiters is that those two characters are interpreted by Stata as line end characters in the input text file, and are not available as delimiters that separate fields within lines. Since your objective was to read each text line into a single variable, any character that does not appear in your data would serve equally well as a delimiter for import delimited - substitute char(12) {formfeed} for char(10) in my example for a demonstration. I originally chose char(10) because I knew Stata would have any embedded within lines, and did not want to complicate the post by adding this explanation. But since you asked, this is "why not".

Also, are you suggesting that Stata "include encoding and decoding functionality" for the insheet command? If so you will be disappointed, because help insheet reports that as of Stata 13 insheet is no longer an official part of Stata, having been superseded by import delimited.

Thank you! Indeed, char(13) {return} results in a new line. It solved my problem!
Comment

Announcement