Interpretation of format width in the context of Unicode value labels in Stata 14

Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#1

Interpretation of format width in the context of Unicode value labels in Stata 14

19 Jan 2017, 03:50

Stata specifies a display format for every variable in the dataset. This format affects how the values are displayed on the screen and in some cases in output files. If the variable is numeric and contains value labels for some or all of the values, these labels are also affected by the variable format. A user can adjust default formats to fit a particular need with the help of command format.

The manual for the format command contains the following sentence:

. For example, %9.2f specifies the f format that is nine characters wide and has two digits following the decimal point.

Before version 14 of Stata nine characters wide meant literally 9 bytes wide. In the Unicode context of Stata 14 this is no longer same.

Compare for example the following presentation in the output window and the browser of the same dataset:

Here all variables are formatted with format %20.0g which according to the manual should provide the capacity for 20 characters. However, only the content in the data browser window seems to be formatted consistently with the manual, while the content in the output window does not obey the same rule and formats the values according to the byte width (Cyrillic letters occupy 2 bytes in the UTF-8 Unicode character encoding).

If the format width is doubled, it fits the text in the output window nicely, but results in an unnecessary wide spacing in the browser window. (and the situation will be worse for languages utilizing 3 and 4 byte Unicode characters):

I wonder, what are the recommendations for an external program saving a dataset in Stata's .dta format? Should it apply the byte-widest or character-widest format width?

In practice, what do users prefer most commonly: browse or list?

Is there any "fit" format - one that will expand the column precisely enough to fit the widest label/value for the purpose of list/browse commands? (I suppose no, based on the description of the existing formats).

Thank you, Sergiy Radyakin
Tags: format, unicode

1 like
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#2

19 Jan 2017, 07:30

Sergiy,

We will look into the -list- with value label display issue and get it fixed in a future update.

For your question 1:

All Stata formats (format saved in datasets) are byte based, at the storage level, string data are treated as byte stream instead of character arrays.

In terms of display, Stata results Windows is similar to a Unix terminal, it has a fixed column size which is controlled by -linesize- setting. This linesize 80 means 80 columns, and each column is promised to fit one ASCII letter (0-9a-zA-Z) or one ASCII punctuation. In Stata 14, Chinese, Japanese, and Korean (CJK) characters, occupy two display columns, other Unicode characters occupy one display column. See 12.4.2.2 Displaying Unicode characters in User's guide for a detailed discussion.

Best
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#3

19 Jan 2017, 09:59

Dear Hua,

thank you for your quick reaction and providing this clarification. As I understand after the fix the display/list commands will produce output similar to browse, for multibyte single-column characters correct?

This leaves the situation when characters from single-column languages (e.g. English, Russian, etc) are used together with characters of double-column languages (e.g. Khmer, Chinese, etc). (readers may want to also look through the earlier thread)

My issue is that I have to specify the formats of the variables when producing a dataset externally (the "%w.df" format). From your clarification I see that the value w is neither characters, nor bytes, but output columns, (which in the browser window are equivalent to characters). And hence the procedure that determines the maximum width w has to become aware of these exceptional languages that occupy two columns. As I understand further, the proper ranges of unicode characters should be hardcoded, and each string examined character-by-character to compare with the ranges and determine the width of each letter (one or two output columns), which is rather slow, as I imagine.

I also fail to see the exact rules of what Stata is doing when displaying the content in browse:

(here left column is formatted with the format shown in the screenshot, right column with %12.0f and both have identical content and value labels).

For example, there is enough space (according to pixel rules) and according to format width to display full content in cases 3 and 5, but an ellipsis is shown instead indicating that Stata probably starts truncating the values earlier. If double-column counting is applicable only in the output window (not in the browse window) than I should see the full text of the value label in observation 6 when I specify %6.0f (at least this works for Latin characters), while in fact Stata keeps on showing the ellipsis all the way until I specify %9.0f (which I think is equal to 3+3*2 or 3 columns for 3 Latin and 3*2 columns for 3 Chinese characters). So while the characters don't have visible spacing between them, it seems Stata keeps accounting for them for truncation purposes. Weird.

All-in-all it gets too complicated very quickly, and I feel there is hardly a way to be perfect here. But the topic (imho) deserves a blog entry in the Stata blog.

Thank you, Sergiy
Comment

Announcement

Interpretation of format width in the context of Unicode value labels in Stata 14

Comment

Comment