Failed to unicode translate the chinese character string to stata14 format

Zhang_Lu

Join Date: Oct 2014

Posts: 155
#1

Failed to unicode translate the chinese character string to stata14 format

15 Aug 2017, 17:53

My original dataset contains a chinese character string variable where some “exotic character” exist, which means you can’t eliminate some blank spaces around the string. Following http://www.stata.com/statalist/archi.../msg00891.html in statalist, I have managed to identify and remove those unobservable exotic characters (though not quite understand the underlying mechanism).

HTML Code:

. charlist city &'().01?ABCDEGHIJKLMNPQSTUWXYZabcdeghijklnopqrstuwxyz�� > �� . ret li macros: r(chars) : " &'().01?ABCDEGHIJKLMNPQSTUWXYZabcdeghijklnopqrs.." r(sepchars) : " & ' ( ) . 0 1 ? A B C D E G H I J K L M N P .." r(ascii) : "10 13 32 38 39 40 41 46 48 49 63 65 66 67 68 69 71.."

and

HTML Code:

replace city = subinstr(city, "`=char(10)'", "",.) replace city = subinstr(city, "`=char(32)'", "",.) replace city = subinstr(city, "`=char(161)'`=char(161)'", "",.)

However, when I want to convert this dataset（of Stata13 format）to Stata 14 format using Unicode command, the string variable is replaced by those little square like following. Even I keep the original variable without removing the exotic characters, it still end up with the same result. So that I'm not 100% sure whether it's due to encoding problem or the exotic characters. Dataset sample see the attactment (in stata13 and below format)
Thankyou

The unicode translate is preformed like

HTML Code:

cd E:\Land_Supply\Data\土地交易微观数据 clear *unicode encoding set gb18030 // city names are in chinese unicode analyze trans_citypanel2013.dta unicode translate trans_citypanel2013.dta,invalid u trans_citypanel2013,clear

Attached Files

citynames.dta (3.8 KB, 1 view)
Tags: None
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#2

16 Aug 2017, 05:36

Not 100% sure about the mechanics at work here, but I can manage to read in your file with a bit of trickery:

Code:

export delimited using "XYZ\test.csv", replace clear import delimited XYZ\test.csv, encoding(GBK)

So it's GBK encoding you want. I dont know if there are other shortcuts aviable, other than exporting to csv and importing again.

Also, some characters still appear funny. Maybe the file you attached already had some characters removed or encoded? If so, try again without those steps first.

Note: you can also select text encoding in the import window when clicking File>Import>text data(csv). Not an option when importing from xls. I've no idea why. Perhaps someone else with a better understanding of encoding will be able to help clarify.

Last edited by Jorrit Gosens; 16 Aug 2017, 05:39.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4380
#3

16 Aug 2017, 07:46

I recommend that you go ahead and do the translation according to your do-file excerpt without commenting-out the unicode encoding set line. I don't get any square-like exotic characters when I convert without commenting out the Unicode encoding setting.

There is a Unicode white-space character in the data, which your ANSI-based subinstr() could not catch, but which you can remove after translation using ustrtrim(). Maybe that's what's causing your little square-like exotic characters?

I've attached the do-file (based upon your do-file excerpt) that I used, and its SMCL log file, along with what I believe is the successfully converted Stata dataset to this post. For comparison, I also converted the Stata 13 dataset to a Unicode (UTF-8) Stata 14 dataset using Stat/Transfer 13. (The automatically generated .stcmd file attached for reference as to the encodings used.) Both methods give the same result (see log file: the Stat/Transfer dataset is identified there as citynames_st.dta).

Take a look at the attached converted Stata dataset and see whether it's what you were hoping to arrive at.

If I'm misunderstanding what your problem is, then post back.

Note 1: it seems that the forum software forbids me to attach the Stata datasets (.dta) and Stat/Transfer command file (.stcmd) and so I zipped everything into a .zip file. The forum software also won't allow attaching a .zip file, and so I renamed it with a .txt file extension. After downloading, remove the appended .txt file extension to unzip it.

Note 2: there is an assertion is false message among the unicode translate command's output. Such a response is not mentioned in the command's online help file, and I do not know what it means.
Attached Files

citynames.zip.txt (3.8 KB, 1 view)
Comment

Announcement

Failed to unicode translate the chinese character string to stata14 format

Comment

Comment