Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illegal characters in Stata variable names

    The Stata manual is clear about what a variable name may be, see 11.3 here:
    http://www.stata.com/manuals13/u11.pdf
    if you don't want to read the rather lengthy post below, just make sure you always follow the rules in #11.3 of the manual.

    Some other manuals reduce the definition or try to rephrase it in their own words. For example, here is the SAS interpretation (can you spot two mistakes?):
    http://support.sas.com/documentation...a003103776.htm
    Which evidently are still not fixed in 9.3 docs:
    http://support.sas.com/documentation...9b9qsz4sp2.htm
    Rephrasing like this is not a good idea. Why not link to the source?


    Occasionally a dataset pops up with illegal variable names. A few user-written commands, including mine usespss (this bug was fixed in Nov 2012 version) were to blame for retaining some not allowed characters in variable names.
    Other software may be guilty of that too:
    http://www.stata.com/statalist/archi.../msg00124.html

    However Stata itself seems to fail validation of variable names.

    I believe the following code should stop with an error 198 at the first generate, not the last one:
    Code:
    version 13.0
    clear all
    sysuse auto
    generate F`=char(186)'=32
    generate C`=char(186)'=0
    generate `=char(186)'F=99
    186 could be preparation for unicode, but
    Code:
    generate K`=char(13)'=1
    should never be valid.

    The problem is of course not with generate. Other commands are equally affected (take for example egen, recode). From this the most likely culprit is the syntax command.

    Interestingly, older Stata's behaved differently. For example, Stata 5 used to strip the illegal characters from the variable name, and thus never created such malformed names. I believe the bug might be related to the total overhaul of the rename command, which happened somewhere around version 12:
    http://www.stata.com/statalist/archi.../msg01146.html
    or earlier changes, since another similar bug (allowed spaces) was discovered by Roger Newson and confirmed by Bill Gould in 2009:
    http://www.stata.com/statalist/archi.../msg00016.html
    (or is the new bug result of the fix to this one?)


    The discovery is coming from debugging the code that should convert foreign variable names to valid Stata variable names that came stumbling upon the following cases (all texts are coming from the file on disk and simulated here with char()):
    Code:
    . mata strtoname(".N")
      _N
    
    . mata strtoname("F`=char(186)'")
      FÂș
    
    . mata st_isname("F`=char(186)'")
      1
    My expectation is that whatever st_isname() confirms and whatever strtoname() returns should be generate-able valid Stata variable name.

    The manual for st_isname() and strtoname() is using "Stata name" lingvo and never mentions "Stata variable name" directly, which I imply from it. If the same Mata functions are used by the syntax internally, then this can explain the above behavior.

    IMHO: strtoname() and st_isname() should both be aware of
    1) illegal characters for Stata variable names;
    2) blacklisted (reserved) names: byte, long, etc as shown in the manual.


    Best, Sergiy Radyakin

  • #2
    what happens if I call a variable like a command? E.g. "save" ?

    Comment


    • #3
      Nothing bad happens. Try it:
      Code:
      clear
      set obs 1
      generate save=1
      list
      
           +------+
           | save |
           |------|
        1. |    1 |
           +------+

      Comment


      • #4
        And the issues described in #1 still seems to be in the current version:
        Code:
        .
        clear
        set obs 1
        
        foreach c of numlist 1/31 127 {
        
            mata assert(st_isname("_`=char(`c')'_`c'") ) 
        
            gen _`=char(`c')'_`c' = 1
        }
        
        des
        Code:
        Contains data
          obs:             1                          
         vars:            32                          
         size:           128                          
        -------------------------------------------------------------------------------------------------------------------------------------
                      storage   display    value
        variable name   type    format     label      variable label
        -------------------------------------------------------------------------------------------------------------------------------------
        __1            float   %9.0g                 
        __2            float   %9.0g                 
        __3            float   %9.0g                 
        __4            float   %9.0g                 
        __5            float   %9.0g                 
        __6            float   %9.0g                 
        __7            float   %9.0g                 
        __8            float   %9.0g                 
        _ _9            float   %9.0g                 
        _
        _10           float   %9.0g                 
        __11           float   %9.0g                 
        __12           float   %9.0g                 
        _ _13           float   %9.0g                 
        __14           float   %9.0g                 
        __15           float   %9.0g                 
        __16           float   %9.0g                 
        __17           float   %9.0g                 
        __18           float   %9.0g                 
        _ _19  float   %9.0g                 
        __20           float   %9.0g                 
        __21           float   %9.0g                 
        __22           float   %9.0g                 
        __23           float   %9.0g                 
        __24           float   %9.0g                 
        __25           float   %9.0g                 
        __26           float   %9.0g                 
        __27           float   %9.0g                 
        __28           float   %9.0g                 
        __29           float   %9.0g                 
        __30           float   %9.0g                 
        __31           float   %9.0g                 
        __127          float   %9.0g

        Comment


        • #5
          Bjarte Aagnes note the statement version 13.0 at the top of my code. This issue predates unicode versions of Stata. With unicode the naming rules have changed and you can now have variables named in Russian or Japanese. But for Stata 13.0 I believe that should have been illegal. System characters (0-31) should probably also be excluded from names regardless of which version. And if you ask me, I'd say that only things you put in double quotes may be permitted to contain unicode content, while all the syntax elements (commands, variable and macro names) should be English only. But other languages do it like Stata, so what do I know .
          Have a good weekend!

          Comment

          Working...
          X