Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with Unicode -- Vietnamese Characters

    Hello,

    I am working with Stata 15 and have a data set with Vietnamese characters, which appear as empty boxes in the data editor section/diamonds with questions marks in the middle in the variable name/label section. Long story short, I tried to translate my data file to unicode, but what I did ended up replacing the missing Vietnamese characters with lots of random characters (for instance, the Greek my symbol, the copyright symbol, etc) and I have no idea why. I tried the following approach:

    Code:
    unicode analyze Household_12.dta, redo
    unicode encoding set ibm-1258_P100-1997
    unicode translate Household_12.dta, ignore(mark)
    I then load and view the Household_12.dta file and see that the missing Vietnamese characters had been replaced with random characters. I delete the data set and re-download it so that I have the original version again. However, when I type

    Code:
    unicode analyze Household_12.dta
    I get the following message:

    File summary (before starting):
    1 file(s) specified
    1 file(s) already translated in previous runs
    0 file(s) to be examined ...
    (nothing to do)
    Does anyone have any insights on how to fix these problems? Thank you so much!
    Last edited by Saba Khan; 10 Dec 2019, 10:07.

  • #2
    short answer: there will be a folder named bak. stunicode in your working directory. Delete this folder. It preserves history (and original data) of -unicode- commands.

    Comment


    • #3
      JeongHoon Min Thank you very much for the advice about deleting the bak.stunicode file.

      However, my translations are still gibberish. Does anyone know what is going wrong? I tried both of the encodings for Vietnamese from the --help encodings section.
      (Also, there's a typo in my original post -- I mean invalid rather than ignore inthe first code section)

      Thank you!

      Comment


      • #4
        Saba Khan To figure out problem regarding unicode, I think we need an example dataset. But I guess you might have set wrong encoding, try cp1258 or its aliases.
        Code:
        unicode encoding set cp1258

        Comment


        • #5
          JeongHoon Min Thank you again for the suggestion. Unfortunately, it led to the same incorrect results as the previous encodings I tried.

          Comment


          • #6
            JeongHoon Min Actually, I think the gibberish letters are not a mistake on my part but just how the data is coded. I was looking at a pdf manual for a different set of data from the same organization and the names of the provinces in the pdf manual have the same exact mistakes as the data I translated. Thank you again for all your help. I very much appreciate it.

            Comment

            Working...
            X