Stata 14, Unicode, and extended ASCII.

Svend Juul

Join Date: Apr 2014

Posts: 515
#1

Stata 14, Unicode, and extended ASCII.

14 Apr 2015, 03:31

Stata 14 is a big step forward. But there is a problem with the switch to Unicode for languages that so far used extended ASCII for some characters (German, French, Spanish, Portuguese, Danish, Norwegian, Swedish, Icelandic, Polish, Turkish, etc.). With the improved saveold command, Stata 14 can generate a dataset which can be opened by Stata 11 to 13, but legibility is poor. This is described in the Unicode help. unicode translate lets you translate from extended ASCII to Unicode, but the reverse is not possible.

At least, this is how I understand the possibilities. The problem will arise when we cooperate with or teach persons who don't have Stata 14. Does anybody have a solution?

Actually, Stat/Transfer 13 does the trick; it can translate a Stata 14 dataset to Stata 13, including translation to extended ASCII. To my mind this means that it is hardly impossible to expand the capability of unicode translate or saveold to make some "reverse translation".
Tags: None

1 like
Joseph Coveney

Join Date: Apr 2014

Posts: 4542
#2

14 Apr 2015, 07:22

I recall reading somewhere that Microsoft Excel stores all of its string data in Unicode, even when it's ASCII or ANSI. If my recollection isn't off-base, then you might be able to try export excel from Stata 14 of the extended ASCII (ANSI) text, and then import excel in Stata 13, or odbc for the two earlier Stata releases.

Even it if does work, I realize that this isn't exactly what you're looking for, but it would offer a Stata-only solution to the problem. For production use, of course, I'd stick with Stat/Transfer 13.
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1884
#3

14 Apr 2015, 10:02

Dear Svend, could you please share an example dataset, which illustrates the problem? Thank you, Sergiy
Comment

Svend Juul

Join Date: Apr 2014
Posts: 515

14 Apr 2015, 11:25

Dear Setgiy,

In Stata 14 I generated a small dataset with these commands:

Code:

clear
input a b xø str5 string
1 1 1 "mænd"
2 2 2 "møer"
end
label variable a "Danish characters æøåÆØÅ"
label define b 1 "MÆND" 2 "MØER"
label values b b
numlabel , add
notes: Saved by Stata 14
save x14.dta , replace

In Stata 14 it displays allright:

Code:

. codebook , compact
Variable   Obs Unique  Mean  Min  Max  Label
-----------------------------------------------------------------------------------------
a            2      2   1.5    1    2  Danish characters æøåÆØÅ     
b            2      2   1.5    1    2 
xø           2      2   1.5    1    2 
string       2      2     .    .    . 
-----------------------------------------------------------------------------------------
. list , clean
       a         b   xø   string 
  1.   1   1. MÆND    1     mænd 
  2.   2   2. MØER    2     møer

But now I try to open the dataset in Stata 13; which is not possible:

Code:

. use "x14.dta", clear
.dta too modern
    File C:\abc\x14.dta is from a more recent version of Stata.  Type update query to determine whether a free
    update of Stata is available, and browse http://www.stata.com/ to determine if a new version is available.

What I miss is the opportunity to translate back from Unicode to extended ASCII. It ought to be possible. It must be possible.

Svend

Comment

Svend Juul

Join Date: Apr 2014
Posts: 515

14 Apr 2015, 11:33

The above is incomplete. In Stata 14 I also used the saveold command to generate a version readable by Stata 13:

Code:

. saveold x14a.dta , version(13)
(saving in Stata 13 format)
  note: variable name "xø" contains unicode and thus may not display well in Stata 13.
  note: variable label "Danish characters æøåÆØÅ" contains unicode and thus may not
        display well in Stata 13.
file x14a.dta saved

Opening it with Stata 13 gives this unsatisfactory result:

Code:

. use x14a.dta
. codebook, compact
Variable   Obs Unique  Mean  Min  Max  Label
-------------------------------------------------------------------------------------------------------------------
a            2      2   1.5    1    2  Danish characters Ã¦Ã¸Ã¥Ã†Ã˜Ã…
b            2      2   1.5    1    2 
xÃ¸          2      2   1.5    1    2 
string       2      2     .    .    . 
-------------------------------------------------------------------------------------------------------------------
. list , clean
       a          b   xÃ¸   string 
  1.   1   1. MÃ†ND     1    mÃ¦nd 
  2.   2   2. MÃ˜ER     2    mÃ¸er

Comment

Alan Riley (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 170
#6

14 Apr 2015, 14:34

Originally posted by Svend Juul View Post

Stata 14 is a big step forward. But there is a problem with the switch to Unicode for languages that so far used extended ASCII for some characters (German, French, Spanish, Portuguese, Danish, Norwegian, Swedish, Icelandic, Polish, Turkish, etc.). With the improved saveold command, Stata 14 can generate a dataset which can be opened by Stata 11 to 13, but legibility is poor. This is described in the Unicode help. unicode translate lets you translate from extended ASCII to Unicode, but the reverse is not possible.

At least, this is how I understand the possibilities. The problem will arise when we cooperate with or teach persons who don't have Stata 14. Does anybody have a solution?

Actually, Stat/Transfer 13 does the trick; it can translate a Stata 14 dataset to Stata 13, including translation to extended ASCII. To my mind this means that it is hardly impossible to expand the capability of unicode translate or saveold to make some "reverse translation".

It is certainly possible to write a command similar to unicode translate which would convert all strings/labels/names in a Stata 14 dataset back to some extended ASCII encoding. To do it "right", however, as we feel we did with unicode translate for the extended-ASCII-to-Unicode conversion, is a bit tricky. For example, a dataset containing Unicode in variable names might use characters which aren't certain to appear in the desired target extended ASCII encoding. When those characters are then dropped or substituted with a replacement character, two or more variable names that were previously distinct could end up becoming duplicates, which is not allowed.

Because there isn't an official solution at this moment, let me share a little bit of code that you may find useful. First though, be sure to make a copy of any dataset you intend to use this on! In particular, you don't want to accidentally save over your nice, new, Unicode Stata 14 dataset with a dataset that has been back-converted to extended ASCII. So, I recommend starting with something like

Code:

copy myfile.dta myfile_ext.dta use myfile_ext.dta

so that you are working on a copy of your original dataset.

The first thing you need to do is determine the target extended ASCII encoding. help encodings can help with that. I don't know what target encoding you need, but let's use "ISO-8859-10" for this example. Let's store the encoding in a global macro just so we can easily use it later (and easily change it if it turns out not to be the right encoding):

Code:

global ENCODING "ISO-8859-10"

Let's start by converting something simple -- variable labels. We can use the ustrto() function to convert from Unicode (UTF-8) to our desired encoding. Other than the encoding to use, we must also decide how we want to deal with characters which for some reason can't be converted to the destination extended ASCII encoding. I'll specify "1" as the third argument to ustrto which means that invalid sequences will use a substitution character defined by the particular encoding we are translating to. I loop over all variables, grabbing the variable label of each one, converting that variable label, and reassigning it:

Code:

global ENCODING "ISO-8859-10" foreach var of varlist _all { local thelab : variable label `var' local thelab = ustrto(`"`thelab'"', "$ENCODING", 1) label variable `var' `"`thelab'"' }

Next, let's worry about string variables in your data:

Code:

foreach var of varlist _all { capture confirm string variable `var' if _rc==0 { replace `var' = ustrto(`var', "$ENCODING", 1) } }

There's one potential problem with the above loop. You might have strL variables, and if you do, some of their values might be binary. If, say, you read a PDF file into one observation of a strL variable, you wouldn't want to run it through ustrto() as that would corrupt it. Stata has an undocumented function _strisbinary() which can be used to detect and skip strL values which have been marked as binary. Let's incorporate it into the above loop:

Code:

foreach var of varlist _all { capture confirm string variable `var' if _rc==0 { replace `var' = ustrto(`var', "$ENCODING", 1) if !_strisbinary(`var') } }

Finally, here's a loop to convert the variable names:

Code:

foreach var of varlist _all { local newname = ustrto("`var'", "ISO-8859-10", 1) rename `var' `newname' }

I shouldn't have said "finally". I didn't provide code for things like characteristic contents, characteristic names, the dataset label, or value label values or names. The last one is the trickiest thing to handle because if a value label name changes, you not only have to modify the value label, you also have to find every variable to which it is attached and re-attach the new name. These are all things a hypothetical official command to translate from UTF-8 to extended ASCII would need to deal with.
Comment
Svend Juul

Join Date: Apr 2014

Posts: 515
#7

16 Apr 2015, 02:09

Thank you, Alan, for a precise answer. I understand that converting value labels can be tricky. Nevertheless, I hope - I really do - that StataCorp will commit itself to solve the problem.
Comment

Chen Samulsion

Join Date: Jan 2018
Posts: 952

25 Nov 2018, 08:45

I found Alan Riley (StataCorp)'s codes very helpful, abeit that he didn't provide code for translating value labels. I try to write codes for this purpose, my codes use -labelsof- command written by Ben Jann (SSC). My codes are suitable for cases that value label name conicides with variable name, welcome further refinement.

Code:

clear
input a b xø str5 string
1 1 1 "mænd"
2 2 2 "møer"
end
label variable a "Danish characters æøåÆØÅ"
label define b 1 "MÆND" 2 "MØER"
label values b b
numlabel , add

global ENCODING "iso-8859_10-1998" /*Encoding for Danish*/

quietly label dir
local varname : value label `r(names)'
display "`varname'"
labelsof `varname' /*ssc install labelsof*/
display `"`r(labels)'"'
local word : word count `r(labels)'
forvalue n = 1/`word' {
   local labvalue : word `n' of `r(labels)'
   local newlabvalue = ustrto("`labvalue'", "$ENCODING", 1) 
   label define `varname' `n' "`newlabvalue'", modify
}
label values `varname' `"`varname'"'
label list _all

Comment

Chen Samulsion

Join Date: Jan 2018

Posts: 952
#9

25 Nov 2018, 15:18

Reinventing the wheel! People who are interested in these matters can find Svend Juul and Morten Frydenberg's -unicode2ascii- helpful. https://www.statalist.org/forums/for...ation-and-more

Originally posted by Svend Juul View Post

This replaces a previous post about the non-functioning trans_unicode package.

Thanks to Kit Baum, the unicode2ascii package has been installed at SSC. It includes three commands that analyze or translate single files or groups of files in the current directory:

whichencoding examines the occurrence of Unicode and extended ASCII characters in Stata datasets and text files like do-files, ado-files, help files and log files. This is useful to determine the need for translation when sharing Stata files between users or computers with different versions of Stata installed. The official unicode analyze command serves the same purpose, but the output from whichencoding is more compact and transparent.

ascii2unicode translates datasets and text files with extended ASCII characters to Unicode encoding. Destination files take the names of the source files, and a suffix is added to the source file names. The official unicode translate command serves the same purpose, but the output from ascii2unicode is more compact and transparent, and you have access both to Unicode and ASCII versions of datasets and text files at the same time.

unicode2ascii translates datasets and text files with Unicode characters to ASCII encoding and saves datasets in Stata 13 or 12 format. Variable names, label names and contents (including labels in different languages), string variable contents, and notes are translated. The source files keep their names, and a suffix is added to the destination file names. Currently (September 2015), no official Stata command serves the same purpose.

Recently, Daniel Bela published two related commands at SSC: saveascii, which in Stata 14 translates the dataset in memory to ASCII encoding and saves it in Stata 13 or 12 format, and useold, which translates an ASCII encoded dataset to Unicode before opening it in Stata 14.

Svend Juul and Morten Frydenberg
1 like
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 952
#10

17 Sep 2019, 08:31

I just found that when coping with Chinese characters -unicode2ascii- can run sucessfully in Stata 15 but failed in Stata 16.
Comment

Announcement