Unconvertable characters when translating into Unicode

Federico Tedeschi

Join Date: Mar 2015

Posts: 137
#1

Unconvertable characters when translating into Unicode

23 May 2016, 05:59

Dear Statalisters,

I was trying today to solve a problem I have since installing Stata 14 (I have Stata version 14.1, running on Windows 8), i.e. the one of acute and grave accents, that are usally visualized as little squares.

I've tried with "unicode encoding set UTF-8" followed by "unicode translate", and I get the following error message:

"File not translated because it contains unconvertable characters;
you might need to specify a different encoding, but more likely you need to run unicode translate with the invalid option

File #filename.dta still needs translation

File summary:
all files not translated because they contain unconvertable characters;
you might need to specify a different encoding, but more likely you need to run unicode translate with the invalid option.

I've then rerun the last command according to the suggestion ("unicode translate#filename.dta, invalid"), and what I got at the end was:

assertion is false
9
-------------------------------------------------------------------------------------------------------------------------------------
File successfully translated

File summary:
all files successfully translated

I then took a look at my dataset and saw the accented letters were translated as "%XE0".

By running "unicode translate" I am informed by Stata about both the labels and the variables containing such unconvertable characters; however, how can I move from there? How can I make them convertable?
Tags: None
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#2

23 May 2016, 09:26

You original dataset is not in UTF-8 encoding, hence the error message when you unicode encoding set UTF-8 then unicode translate

Since you already force a translation with "invalid" option, which means you already lost information during translation. What you want to first is to restore the dataset to its original state using

Code:

unicode restore filename.dta

(suppose the file you translated is filename.dta).

After that, you want set the encoding to the correct encoding the original dataset is in, in your case, it is most likely in a Latin encoding, so try Latin 1 first

Code:

unicode encoding set ISO-8859-1 unicode translate filename.dta

Then review the resulting dataset to see if it is correctly translated. If not, see -help encodings- for a list of possible encodings. Run -unicode restore- first before try different encoding.

Also, read -help unicode_translate- carefully for a overview of the translation process.

Last edited by Hua Peng (StataCorp); 23 May 2016, 09:29.
Comment
Federico Tedeschi

Join Date: Mar 2015

Posts: 137
#3

24 May 2016, 02:26

Originally posted by Hua Peng (StataCorp) View Post

What you want to first is to restore the dataset to its original state using

Code:

unicode restore filename.dta

(suppose the file you translated is filename.dta).

After that, you want set the encoding to the correct encoding the original dataset is in, in your case, it is most likely in a Latin encoding, so try Latin 1 first

Code:

unicode encoding set ISO-8859-1 unicode translate filename.dta

Then review the resulting dataset to see if it is correctly translated.

Thank you very much: it worked!
Comment

Announcement

Unconvertable characters when translating into Unicode

Comment

Comment