Help with Unicode -- Vietnamese Characters

Saba Khan

Join Date: Apr 2018

Posts: 12
#1

Help with Unicode -- Vietnamese Characters

10 Dec 2019, 09:59

Hello,

I am working with Stata 15 and have a data set with Vietnamese characters, which appear as empty boxes in the data editor section/diamonds with questions marks in the middle in the variable name/label section. Long story short, I tried to translate my data file to unicode, but what I did ended up replacing the missing Vietnamese characters with lots of random characters (for instance, the Greek my symbol, the copyright symbol, etc) and I have no idea why. I tried the following approach:

Code:

unicode analyze Household_12.dta, redo unicode encoding set ibm-1258_P100-1997 unicode translate Household_12.dta, ignore(mark)

I then load and view the Household_12.dta file and see that the missing Vietnamese characters had been replaced with random characters. I delete the data set and re-download it so that I have the original version again. However, when I type

Code:

unicode analyze Household_12.dta

I get the following message:

File summary (before starting):
1 file(s) specified
1 file(s) already translated in previous runs
0 file(s) to be examined ...
(nothing to do)

Does anyone have any insights on how to fix these problems? Thank you so much!

Last edited by Saba Khan; 10 Dec 2019, 10:07.
Tags: None
JeongHoon Min

Join Date: Jun 2019

Posts: 38
#2

10 Dec 2019, 12:58

short answer: there will be a folder named bak. stunicode in your working directory. Delete this folder. It preserves history (and original data) of -unicode- commands.
Comment
Saba Khan

Join Date: Apr 2018

Posts: 12
#3

10 Dec 2019, 14:20

JeongHoon Min Thank you very much for the advice about deleting the bak.stunicode file.

However, my translations are still gibberish. Does anyone know what is going wrong? I tried both of the encodings for Vietnamese from the --help encodings section.
(Also, there's a typo in my original post -- I mean invalid rather than ignore inthe first code section)

Thank you!
Comment
JeongHoon Min

Join Date: Jun 2019

Posts: 38
#4

10 Dec 2019, 14:37

Saba Khan To figure out problem regarding unicode, I think we need an example dataset. But I guess you might have set wrong encoding, try cp1258 or its aliases.

Code:

unicode encoding set cp1258
Comment
Saba Khan

Join Date: Apr 2018

Posts: 12
#5

10 Dec 2019, 19:42

JeongHoon Min Thank you again for the suggestion. Unfortunately, it led to the same incorrect results as the previous encodings I tried.
Comment
Saba Khan

Join Date: Apr 2018

Posts: 12
#6

11 Dec 2019, 14:21

JeongHoon Min Actually, I think the gibberish letters are not a mistake on my part but just how the data is coded. I was looking at a pdf manual for a different set of data from the same organization and the names of the provinces in the pdf manual have the same exact mistakes as the data I translated. Thank you again for all your help. I very much appreciate it.
Comment

Announcement

Help with Unicode -- Vietnamese Characters

Comment

Comment

Comment

Comment

Comment