Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata 14 -import delimited- different behavior from -insheet-

    The issue is related to the presence of the BOM (byte order marker) and has been earlier mentioned in this thread.

    A 3-byte BOM {ef bb bf} is added automatically by MS Windows Notepad.exe and many other programs saving data in UTF-8 encoding.
    Attached is an example of such file with some names written in Cyrillic letters.

    When imported with import delimited all variable names are correctly converted to lower case. However, when imported with insheet the first variable name is left mixed-case since it is confused by the presence of the BOM.

    1. Can the behavior of insheet replicate the behavior of import delimited in this case?

    Note also, that if the option encoding is not specified, import delimited tries to interpret BOM as part of the first variable name, creating a surprising (for a novice user) variable name:
    Code:
    ïid             byte    %8.0g                 Id
    2. Can both insheet and import delimited be more tolerant to presence of the BOM?

    Thank you, Sergiy Radyakin

    STATA MP 14.1 build number 419, MS Windows


    Code:
    . clear all
    
    . 
    . version 14
    
    . 
    . insheet using "c:\temp\ukr_names.txt", tab // lower case is default, manual: "By default, all variable names are imported as lowercase."
    (2 vars, 6 obs)
    
    . describe
    
    Contains data
      obs:             6                          
     vars:             2                          
     size:           102                          
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Id              byte    %8.0g                 Id
    firstname       str16   %16s                  FirstName
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Sorted by: 
         Note: Dataset has changed since last saved.
    
    . 
    . clear
    
    . import delimited "c:\temp\ukr_names.txt", delimiter(tab) encoding("utf-8") case(lower) // case option must be specified to convert variable names to lower case
    (2 vars, 6 obs)
    
    . describe
    
    Contains data
      obs:             6                          
     vars:             2                          
     size:           102                          
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    id              byte    %8.0g                 Id
    firstname       str16   %16s                  FirstName
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Sorted by: 
         Note: Dataset has changed since last saved.
    
    . 
    end of do-file
    
    .


    Attached Files
Working...
X