The issue is related to the presence of the BOM (byte order marker) and has been earlier mentioned in this thread.
A 3-byte BOM {ef bb bf} is added automatically by MS Windows Notepad.exe and many other programs saving data in UTF-8 encoding.
Attached is an example of such file with some names written in Cyrillic letters.
When imported with import delimited all variable names are correctly converted to lower case. However, when imported with insheet the first variable name is left mixed-case since it is confused by the presence of the BOM.
1. Can the behavior of insheet replicate the behavior of import delimited in this case?
Note also, that if the option encoding is not specified, import delimited tries to interpret BOM as part of the first variable name, creating a surprising (for a novice user) variable name:
2. Can both insheet and import delimited be more tolerant to presence of the BOM?
Thank you, Sergiy Radyakin
STATA MP 14.1 build number 419, MS Windows
A 3-byte BOM {ef bb bf} is added automatically by MS Windows Notepad.exe and many other programs saving data in UTF-8 encoding.
Attached is an example of such file with some names written in Cyrillic letters.
When imported with import delimited all variable names are correctly converted to lower case. However, when imported with insheet the first variable name is left mixed-case since it is confused by the presence of the BOM.
1. Can the behavior of insheet replicate the behavior of import delimited in this case?
Note also, that if the option encoding is not specified, import delimited tries to interpret BOM as part of the first variable name, creating a surprising (for a novice user) variable name:
Code:
ïid byte %8.0g Id
Thank you, Sergiy Radyakin
STATA MP 14.1 build number 419, MS Windows
Code:
. clear all . . version 14 . . insheet using "c:\temp\ukr_names.txt", tab // lower case is default, manual: "By default, all variable names are imported as lowercase." (2 vars, 6 obs) . describe Contains data obs: 6 vars: 2 size: 102 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Id byte %8.0g Id firstname str16 %16s FirstName ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by: Note: Dataset has changed since last saved. . . clear . import delimited "c:\temp\ukr_names.txt", delimiter(tab) encoding("utf-8") case(lower) // case option must be specified to convert variable names to lower case (2 vars, 6 obs) . describe Contains data obs: 6 vars: 2 size: 102 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- id byte %8.0g Id firstname str16 %16s FirstName ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by: Note: Dataset has changed since last saved. . end of do-file .