Import delimited gets leading spaces wrong

Daniel Feenberg

Join Date: Oct 2014

Posts: 323
#1

Import delimited gets leading spaces wrong

29 Mar 2018, 15:19

I am using -import delimited- to import a file with whitespace separating values. As far as I can tell, any leading whitespace is interpreted as a delimiter and becomes a missing value. Is there any solution other than modifying the input file? Here is a simple demonstration:

Code:

. type test.raw 1.00 2.00 . import delimited using test.raw,delimiter(whitespace,collapse) (2 vars, 2 obs) . list +---------+ | v1 v2 | |---------| 1. | . 1 | 2. | 2 . | +---------+ . version version 14.2

Notice how the "1.00" in the first observation is preceeded by a single space, which convinces Stata that there are two variables, and the first is missing. I really don't want to change the format of the input data, and no other program seems to take this interpretation of whitespace used as a delimiter.
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

29 Mar 2018, 19:14

Oddly enough, being less explicit about the delimiter frees the import delimited command to work it out on its own, correctly - at least in Stata 15.1.

Code:

. type test.raw 1.00 2.00 . import delimited test.raw (1 var, 2 obs) . list +----+ | v1 | |----| 1. | 1 | 2. | 2 | +----+ . version version 15.1
Comment
Daniel Feenberg

Join Date: Oct 2014

Posts: 323
#3

30 Mar 2018, 05:32

Thanks. That works back to Stata version 13, but according to the help files it shouldn't::

By
default, import delimited will check if the file is delimited by tabs or commas based on the
first line of data. Specify delimiters("\t") to use a tab character, or specify
delimiters("whitespace") to use whitespace as a delimiter.]

I have a feeling support will not be withdrawn, though. If a documented bug is a feature, is an undocumented feature a bug?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

30 Mar 2018, 06:18

From example 2 in the full documentation (version 15) we learn that the example I gave works only because there was just one field on each input line.

Code:

. type test2.raw 1.00 42.00 2.00 28.00 . import delimited test2.raw (1 var, 2 obs) . list +-------------+ | v1 | |-------------| 1. | 1.00 42.00 | 2. | 2.00 28.00 | +-------------+

Back to the drawing board, unless indeed your actual data has just one field per line.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

30 Mar 2018, 07:15

Here's an approach - read the lines into a single string variable and use the more robust split to get what you need.

Code:

. type test2.raw
 1.00 42.00
2.00 28.00

. import delimited test2.raw, delimiter("~") // read as single string variable
(1 var, 2 obs)

. split v1, generate(var) destring
variables born as string: 
var1  var2
var1: all characters numeric; replaced as byte
var2: all characters numeric; replaced as byte

. list

     +---------------------------+
     |          v1   var1   var2 |
     |---------------------------|
  1. |  1.00 42.00      1     42 |
  2. |  2.00 28.00      2     28 |
     +---------------------------+

Comment

Daniel Feenberg

Join Date: Oct 2014

Posts: 323
#6

18 Mar 2023, 07:39

I think Stata's confusion is brought about by the fact that the "whitespace" delimiter treats two consecutive spaces as a missing value. This is unexpected. It also has the effect that a leading space on a line becomes a missing value.
Comment

Announcement

Import delimited gets leading spaces wrong

Comment

Comment

Comment

Comment

Comment