Illegal characters in Stata variable names

Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#1

Illegal characters in Stata variable names

07 Aug 2014, 23:28

The Stata manual is clear about what a variable name may be, see 11.3 here:
http://www.stata.com/manuals13/u11.pdf
if you don't want to read the rather lengthy post below, just make sure you always follow the rules in #11.3 of the manual.

Some other manuals reduce the definition or try to rephrase it in their own words. For example, here is the SAS interpretation (can you spot two mistakes?):
http://support.sas.com/documentation...a003103776.htm
Which evidently are still not fixed in 9.3 docs:
http://support.sas.com/documentation...9b9qsz4sp2.htm
Rephrasing like this is not a good idea. Why not link to the source?

Occasionally a dataset pops up with illegal variable names. A few user-written commands, including mine usespss (this bug was fixed in Nov 2012 version) were to blame for retaining some not allowed characters in variable names.
Other software may be guilty of that too:
http://www.stata.com/statalist/archi.../msg00124.html

However Stata itself seems to fail validation of variable names.

I believe the following code should stop with an error 198 at the first generate, not the last one:

Code:

version 13.0 clear all sysuse auto generate F`=char(186)'=32 generate C`=char(186)'=0 generate `=char(186)'F=99

186 could be preparation for unicode, but

Code:

generate K`=char(13)'=1

should never be valid.

The problem is of course not with generate. Other commands are equally affected (take for example egen, recode). From this the most likely culprit is the syntax command.

Interestingly, older Stata's behaved differently. For example, Stata 5 used to strip the illegal characters from the variable name, and thus never created such malformed names. I believe the bug might be related to the total overhaul of the rename command, which happened somewhere around version 12:
http://www.stata.com/statalist/archi.../msg01146.html
or earlier changes, since another similar bug (allowed spaces) was discovered by Roger Newson and confirmed by Bill Gould in 2009:
http://www.stata.com/statalist/archi.../msg00016.html
(or is the new bug result of the fix to this one?)

The discovery is coming from debugging the code that should convert foreign variable names to valid Stata variable names that came stumbling upon the following cases (all texts are coming from the file on disk and simulated here with char()):

Code:

. mata strtoname(".N") _N . mata strtoname("F`=char(186)'") Fº . mata st_isname("F`=char(186)'") 1

My expectation is that whatever st_isname() confirms and whatever strtoname() returns should be generate-able valid Stata variable name.

The manual for st_isname() and strtoname() is using "Stata name" lingvo and never mentions "Stata variable name" directly, which I imply from it. If the same Mata functions are used by the syntax internally, then this can explain the above behavior.

IMHO: strtoname() and st_isname() should both be aware of
1) illegal characters for Stata variable names;
2) blacklisted (reserved) names: byte, long, etc as shown in the manual.

Best, Sergiy Radyakin
Tags: bug, data, variable name

1 like
Rebecca Water

Join Date: Sep 2018

Posts: 44
#2

13 Dec 2018, 21:05

what happens if I call a variable like a command? E.g. "save" ?
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#3

14 Dec 2018, 08:05

Nothing bad happens. Try it:

Code:

clear set obs 1 generate save=1 list +------+ | save | |------| 1. | 1 | +------+
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

14 Dec 2018, 08:52

And the issues described in #1 still seems to be in the current version:

Code:

.
clear
set obs 1

foreach c of numlist 1/31 127 {

    mata assert(st_isname("_`=char(`c')'_`c'") ) 

    gen _`=char(`c')'_`c' = 1
}

des

Code:

Contains data
  obs:             1                          
 vars:            32                          
 size:           128                          
-------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------------------------------------------------------------
__1            float   %9.0g                 
__2            float   %9.0g                 
__3            float   %9.0g                 
__4            float   %9.0g                 
__5            float   %9.0g                 
__6            float   %9.0g                 
__7            float   %9.0g                 
__8            float   %9.0g                 
_ _9            float   %9.0g                 
_
_10           float   %9.0g                 
__11           float   %9.0g                 
__12           float   %9.0g                 
_ _13           float   %9.0g                 
__14           float   %9.0g                 
__15           float   %9.0g                 
__16           float   %9.0g                 
__17           float   %9.0g                 
__18           float   %9.0g                 
_ _19  float   %9.0g                 
__20           float   %9.0g                 
__21           float   %9.0g                 
__22           float   %9.0g                 
__23           float   %9.0g                 
__24           float   %9.0g                 
__25           float   %9.0g                 
__26           float   %9.0g                 
__27           float   %9.0g                 
__28           float   %9.0g                 
__29           float   %9.0g                 
__30           float   %9.0g                 
__31           float   %9.0g                 
__127          float   %9.0g

Comment

Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#5

14 Dec 2018, 17:25

Bjarte Aagnes note the statement version 13.0 at the top of my code. This issue predates unicode versions of Stata. With unicode the naming rules have changed and you can now have variables named in Russian or Japanese. But for Stata 13.0 I believe that should have been illegal. System characters (0-31) should probably also be excluded from names regardless of which version. And if you ask me, I'd say that only things you put in double quotes may be permitted to contain unicode content, while all the syntax elements (commands, variable and macro names) should be English only. But other languages do it like Stata, so what do I know .
Have a good weekend!
Comment

Announcement

Illegal characters in Stata variable names

Comment

Comment

Comment

Comment