Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Failed to unicode translate the chinese character string to stata14 format

    My original dataset contains a chinese character string variable where some “exotic character” exist, which means you can’t eliminate some blank spaces around the string. Following http://www.stata.com/statalist/archi.../msg00891.html in statalist, I have managed to identify and remove those unobservable exotic characters (though not quite understand the underlying mechanism).


    HTML Code:
    . charlist city
    
     &'().01?ABCDEGHIJKLMNPQSTUWXYZabcdeghijklnopqrstuwxyz��������������������������
    > ���������������������������������������������������������������������
    
    . ret li
    
    macros:
                  r(chars) : "
     &'().01?ABCDEGHIJKLMNPQSTUWXYZabcdeghijklnopqrs.."
               r(sepchars) : "
         & ' ( ) . 0 1 ? A B C D E G H I J K L M N P .."
                  r(ascii) : "10 13 32 38 39 40 41 46 48 49 63 65 66 67 68 69 71.."
    and
    HTML Code:
    replace city = subinstr(city, "`=char(10)'", "",.)
    replace city = subinstr(city, "`=char(32)'", "",.)
    replace city = subinstr(city, "`=char(161)'`=char(161)'", "",.)
    However, when I want to convert this dataset(of Stata13 format)to Stata 14 format using Unicode command, the string variable is replaced by those little square like following. Even I keep the original variable without removing the exotic characters, it still end up with the same result. So that I'm not 100% sure whether it's due to encoding problem or the exotic characters. Dataset sample see the attactment (in stata13 and below format)
    Thankyou

    The unicode translate is preformed like
    HTML Code:
    cd E:\Land_Supply\Data\土地交易微观数据
    clear
    *unicode encoding set gb18030 // city names are in chinese
    unicode analyze trans_citypanel2013.dta
    unicode translate trans_citypanel2013.dta,invalid
    u trans_citypanel2013,clear
    Attached Files

  • #2
    Not 100% sure about the mechanics at work here, but I can manage to read in your file with a bit of trickery:

    Code:
    export delimited using "XYZ\test.csv", replace
    clear
    import delimited XYZ\test.csv, encoding(GBK)
    So it's GBK encoding you want. I dont know if there are other shortcuts aviable, other than exporting to csv and importing again.

    Also, some characters still appear funny. Maybe the file you attached already had some characters removed or encoded? If so, try again without those steps first.


    Note: you can also select text encoding in the import window when clicking File>Import>text data(csv). Not an option when importing from xls. I've no idea why. Perhaps someone else with a better understanding of encoding will be able to help clarify.
    Last edited by Jorrit Gosens; 16 Aug 2017, 05:39.

    Comment


    • #3
      I recommend that you go ahead and do the translation according to your do-file excerpt without commenting-out the unicode encoding set line. I don't get any square-like exotic characters when I convert without commenting out the Unicode encoding setting.

      There is a Unicode white-space character in the data, which your ANSI-based subinstr() could not catch, but which you can remove after translation using ustrtrim(). Maybe that's what's causing your little square-like exotic characters?

      I've attached the do-file (based upon your do-file excerpt) that I used, and its SMCL log file, along with what I believe is the successfully converted Stata dataset to this post. For comparison, I also converted the Stata 13 dataset to a Unicode (UTF-8) Stata 14 dataset using Stat/Transfer 13. (The automatically generated .stcmd file attached for reference as to the encodings used.) Both methods give the same result (see log file: the Stat/Transfer dataset is identified there as citynames_st.dta).

      Take a look at the attached converted Stata dataset and see whether it's what you were hoping to arrive at.

      If I'm misunderstanding what your problem is, then post back.

      Note 1: it seems that the forum software forbids me to attach the Stata datasets (.dta) and Stat/Transfer command file (.stcmd) and so I zipped everything into a .zip file. The forum software also won't allow attaching a .zip file, and so I renamed it with a .txt file extension. After downloading, remove the appended .txt file extension to unzip it.

      Note 2: there is an assertion is false message among the unicode translate command's output. Such a response is not mentioned in the command's online help file, and I do not know what it means.
      Attached Files

      Comment

      Working...
      X