Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Import delimited gets leading spaces wrong

    I am using -import delimited- to import a file with whitespace separating values. As far as I can tell, any leading whitespace is interpreted as a delimiter and becomes a missing value. Is there any solution other than modifying the input file? Here is a simple demonstration:

    Code:
    . type test.raw
     1.00
    2.00
    
    . import delimited using test.raw,delimiter(whitespace,collapse)
    (2 vars, 2 obs)
    
    . list
    
         +---------+
         | v1   v2 |
         |---------|
      1. |  .    1 |
      2. |  2    . |
         +---------+
    
    . version
    version 14.2
    Notice how the "1.00" in the first observation is preceeded by a single space, which convinces Stata that there are two variables, and the first is missing. I really don't want to change the format of the input data, and no other program seems to take this interpretation of whitespace used as a delimiter.

  • #2
    Oddly enough, being less explicit about the delimiter frees the import delimited command to work it out on its own, correctly - at least in Stata 15.1.
    Code:
    . type test.raw
     1.00
    2.00
    
    . import delimited test.raw
    (1 var, 2 obs)
    
    . list
    
         +----+
         | v1 |
         |----|
      1. |  1 |
      2. |  2 |
         +----+
    
    . version
    version 15.1

    Comment


    • #3
      Thanks. That works back to Stata version 13, but according to the help files it shouldn't::

      By
      default, import delimited will check if the file is delimited by tabs or commas based on the
      first line of data. Specify delimiters("\t") to use a tab character, or specify
      delimiters("whitespace") to use whitespace as a delimiter.]
      I have a feeling support will not be withdrawn, though. If a documented bug is a feature, is an undocumented feature a bug?

      Comment


      • #4
        From example 2 in the full documentation (version 15) we learn that the example I gave works only because there was just one field on each input line.
        Code:
        . type test2.raw
         1.00 42.00
        2.00 28.00
        
        . import delimited test2.raw
        (1 var, 2 obs)
        
        . list
        
             +-------------+
             |          v1 |
             |-------------|
          1. |  1.00 42.00 |
          2. |  2.00 28.00 |
             +-------------+
        Back to the drawing board, unless indeed your actual data has just one field per line.

        Comment


        • #5
          Here's an approach - read the lines into a single string variable and use the more robust split to get what you need.
          Code:
          . type test2.raw
           1.00 42.00
          2.00 28.00
          
          . import delimited test2.raw, delimiter("~") // read as single string variable
          (1 var, 2 obs)
          
          . split v1, generate(var) destring
          variables born as string: 
          var1  var2
          var1: all characters numeric; replaced as byte
          var2: all characters numeric; replaced as byte
          
          . list
          
               +---------------------------+
               |          v1   var1   var2 |
               |---------------------------|
            1. |  1.00 42.00      1     42 |
            2. |  2.00 28.00      2     28 |
               +---------------------------+

          Comment


          • #6
            I think Stata's confusion is brought about by the fact that the "whitespace" delimiter treats two consecutive spaces as a missing value. This is unexpected. It also has the effect that a leading space on a line becomes a missing value.

            Comment

            Working...
            X