Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unconvertable characters when translating into Unicode

    Dear Statalisters,

    I was trying today to solve a problem I have since installing Stata 14 (I have Stata version 14.1, running on Windows 8), i.e. the one of acute and grave accents, that are usally visualized as little squares.

    I've tried with "unicode encoding set UTF-8" followed by "unicode translate", and I get the following error message:

    "File not translated because it contains unconvertable characters;
    you might need to specify a different encoding, but more likely you need to run unicode translate with the invalid option

    File #filename.dta still needs translation

    File summary:
    all files not translated because they contain unconvertable characters;
    you might need to specify a different encoding, but more likely you need to run unicode translate with the invalid option.


    I've then rerun the last command according to the suggestion ("unicode translate#filename.dta, invalid"), and what I got at the end was:


    assertion is false
    9
    -------------------------------------------------------------------------------------------------------------------------------------
    File successfully translated

    File summary:
    all files successfully translated


    I then took a look at my dataset and saw the accented letters were translated as "%XE0".

    By running "unicode translate" I am informed by Stata about both the labels and the variables containing such unconvertable characters; however, how can I move from there? How can I make them convertable?




  • #2
    You original dataset is not in UTF-8 encoding, hence the error message when you unicode encoding set UTF-8 then unicode translate

    Since you already force a translation with "invalid" option, which means you already lost information during translation. What you want to first is to restore the dataset to its original state using

    Code:
    unicode restore filename.dta
    (suppose the file you translated is filename.dta).

    After that, you want set the encoding to the correct encoding the original dataset is in, in your case, it is most likely in a Latin encoding, so try Latin 1 first

    Code:
    unicode encoding set ISO-8859-1
    unicode translate filename.dta
    Then review the resulting dataset to see if it is correctly translated. If not, see -help encodings- for a list of possible encodings. Run -unicode restore- first before try different encoding.

    Also, read -help unicode_translate- carefully for a overview of the translation process.
    Last edited by Hua Peng (StataCorp); 23 May 2016, 09:29.

    Comment


    • #3
      Originally posted by Hua Peng (StataCorp) View Post
      What you want to first is to restore the dataset to its original state using

      Code:
      unicode restore filename.dta
      (suppose the file you translated is filename.dta).

      After that, you want set the encoding to the correct encoding the original dataset is in, in your case, it is most likely in a Latin encoding, so try Latin 1 first

      Code:
      unicode encoding set ISO-8859-1
      unicode translate filename.dta
      Then review the resulting dataset to see if it is correctly translated.
      Thank you very much: it worked!

      Comment

      Working...
      X