Stata variable with more than 32 characters

Kristoffer Bjarkefur

Join Date: Feb 2016

Posts: 53
#1

Stata variable with more than 32 characters

27 May 2019, 15:16

I know that variables names in Stata are limited to 32 characters, but I am currently trying to replicate a segment of Stata code that someone else wrote. I do not know who it was but it is Stata code. I cannot replicate all of the code since on a few occasions the person that wrote this code used variables names that were 33 characters long.

I have modified the example below, but the lines of code I cannot run is trying to this (each underscore is 10, 20, 30 characters):

Code:

gen ABCDEFGHI_ABCDEFGHI_ABCDEFGHI_ABC = 0

The code I found is documented as if the person was able to run it. Does anyone know of any context where this could have been allowed? Or any settings that can be used so that too long variable names are truncated and no error is thrown? Or anything similar?

The indentations used does not seem to be regular do-file editor size, so I am thinking that the person who wrote this used some other kind of code editor. Could that be a clue?

Thanks for any advice!

PS. I know I can make the code running by just shorting the name, but what I trying to solve for my team right now is to see if there is any chance that this code - exactly as it is - could have created the results we are trying to replicate.
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

27 May 2019, 16:07

Are you working with the actual Stata do-files the original author created? Or are you working from a printed version, or a word processing document containing the code?

In the case that you do not have your hands on the actual do-file, I can imagine that the variable may have originally had special characters encoded in ISO Latin 1 that were allowed in variable names that somewhere in the process was transliterated to a two-character representation - for example ö to oe.

The output of help varname tells us

An invalid UTF-8 sequence is allowed in the variable name and is counted as one character. This is mainly for
backward compatibility reasons. For example, capital letter "E" with a grave accent is encoded as char(200) in
ISO-Latin-1 encoding, which may appear in variable names of older versions of Stata, but char(200) alone is an
invalid UTF-8 sequence. See [U] 12.4.2.6 Advice for users of Stata 13 and earlier for details.

which suggests that under versions 12 and earlier, variable names were not limited to the letters a-z and A-Z.

If someone more knowledgeable than me doesn't add more to my guesswork, you might want to ask the question of Stata Technical Services.
Comment
Bjarte Aagnes

Join Date: Apr 2014

Posts: 785
#3

28 May 2019, 04:17

Kristoffer,

What version of Stata are you using?
What version of Stata was used to write the dofile?
Could you show the variable name as is?

Current version is 15 and UTF-8 is used from version 14.

If a UTF-8 file is opened in version Stata 13 without prior convertion to "ANSII" some UTF-8 characters like "Æ" will be interpreted as two characters:

* Stata 14/15 UTF-8

Code:

gen _ÆÆÆ = 1

vs

* Stata 14/15 dofile opened in 13 without convertion to "ANSII"

Code:

gen _Ã†Ã†Ã† = 1

You can open the do file in a editor like Notepad++ and change encoding.

Last edited by Bjarte Aagnes; 28 May 2019, 04:23.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35754
#4

28 May 2019, 04:52

This is all a bit mysterious, although the implication is that you can't or don't want to contact the authors.

People so far are guessing at a subtle explanation hinging on the particular characters used.

My guess is mundane: the authors edited their code after the event so that variable names shown made more sense to readers, without checking that the code would still run. But does this really matter? So long as you translate consistently to legal variable names that will run the procedure will remain the same.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

28 May 2019, 06:15

While I agree in principle with Nick's suggestion to translate variable names to shorter ones, I can imagine complications
if the code makes use of variable abbreviations

if the variable name is built up from local macros

if the local macros used to build up variable name are created by levelsof from the values of string variables

if the variable names are meaningful and characters 28-32 are subsequently used as "j" in a reshape long command

I've been where Kristoffer is - wanting to verify that the code I have indeed is code that can reproduce the results I am shown, before I set about trying to understand the code. Especially given that the attempt to reproduce is likely running a different version of Stata as a different user with perhaps different system settings.

Once you change the code, you've introduced your changes to code of which you have a limited understanding as a potential source of any problems you may have. Not a comfortable position to be in.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35754
#6

28 May 2019, 06:49

William makes very good points. But in principle too the dataset has to be available for replication so some of these points can be investigated.

The best replication, however, is to write independent code and get the same results! If replication means repeating errors in the original, you are not much better off making the same mistakes.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

28 May 2019, 07:57

Nick and I may have different meanings of replication in mind. I agree with Nick's ideal definition of the best replication. But I think that's not what Kristoffer has been tasked with, not yet.

Kristoffer's profile tells us his employment is in a research organization I'm familiar with, having budgets, management, staff and staff turnover, and constraints on how staff spend their time. I imagine he has managers who have optimistically given him the (common) task of applying code written some years ago by someone no longer with the organization to a new set of data. And that code may well have not been intended - e.g. documented - for subsequent reuse. He has not (yet!) been given the task of - and budget for - rewriting the code from the original specifications and then applying the new code to the original data to validate the new code (or discover problems in the old code!). I imagine someone bright in his management had what seemed like a good idea - "hey, let's see what happens if we run this data through the code from the 2013 paper on a similar topic to see if that technique yields decent results" - and has given the task to Kristoffer.

Under similar circumstances, I have followed the path Kristoffer is following: trying to be sure the code I have been given, with no recourse to the original author, will reproduce the results from before with minimal intervention or understanding on my part, as management has assumed it would. What I learn from that will let management determine what the next step is, and budget sufficient time and labor to accomplish it. Or to decide that is not worth the effort of following up on.
Comment
Kristoffer Bjarkefur

Join Date: Feb 2016

Posts: 53
#8

28 May 2019, 12:26

Thanks everyone for your input!

I give William 4 Sherlock Holmes scores out of 5 possible. Not too far off!

I have access to the do-files and there are no non-standard characters in the name used. It is not my code, and there is a tiny risk the result of this replication gets a little controversial so I do not want to give away the real variable name as it is quite revealing. But there is nothing non-standard with that line of code apart from the length of the variable name.

I am using Stata 15 but I have an old installation of Stata 13 I am also using when testing commands that I write. The error is the same standard error for invalid name.

I was also thinking that the code was edited after the last time it was used. If I had a way to know if the variable names was the only thing edited after the code was used, then it would not be a big issue. I have some of the data sets this code generates so I can test if they end up being the different, but I do not have the data set this exact segment of code is generating.

It is still be very helpful for me that you have confirmed that there is no obvious setting or context where this would run. I will bring it up with my team and then see what we do next.

Thank you all!
Comment

Announcement

Stata variable with more than 32 characters

Comment

Comment

Comment

Comment

Comment

Comment

Comment