same Chinese characters?

River Huang

Join Date: Mar 2016

Posts: 1908
#1

same Chinese characters?

12 Sep 2021, 18:55

Dear All, I have this data.

Code:

clear input byte A str9 B 1 "里" 2 "里" end

There are two (supposed) "same" Chinese characters. However, when I list as follows:

Code:

. list if B=="里" +--------+ | A B | |--------| 2. | 2 里 | +--------+

I only see the second observation (no first observation). When I copy the Chinese word from the first observation and list

Code:

. list if B=="里" +--------+ | A B | |--------| 1. | 1 里 | +--------+

I obtain only the first observation. According to these, it seems that they are two different words (but seem the same). Does anyone have any ideas? Thanks,
Note that, in the data example using dataex above, these two words look not the same, one is smaller and the other is bigger.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Tags: None
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 344
#2

12 Sep 2021, 20:09

If you look at the hex code using

Code:

gen hex = ustrtohex(B) list hex +--------+ | hex | |--------| 1. | \uf9e9 | 2. | \u91cc |

The first character is https://unicode.scarfboy.com/?s=U%2BF9E9, the second is https://unicode.scarfboy.com/?s=U%2B91cc

They are different Unicode codepoints representing the same Chinese ideograph.
2 likes
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#3

12 Sep 2021, 20:35

Dear Hua, Thanks for the reply.

The problem is that I have a big data set (almost 1,000,000 observations) on addresses (for car accidents). Due to unknown reasons, the (supposed) same addresses in excel appeared to be different in Stata (as you mentioned above).

I need to make sure they are identical (how can I do this? It is unlikely to check the whole name/words of addresses one by one) so that I can do -collapse- or -merge- with another file (probably has the same address problem).

Do you have any feasible suggestions on this difficult problem? Thanks.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 344
#4

12 Sep 2021, 20:43

I do not believe they are the same in Excel to begin with. Do you only need to make several characters to the same or it's more of needle in the haystack type of situation?
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#5

12 Sep 2021, 21:17

Dear Hua, Well, in fact, the data are from my son (for his master thesis). Unfortunately, I think it is more likely to be the latter case, i.e., more of needle in the haystack type of situation. I wonder if there is a method to make them (whenever are different initially) the same in Stata finally.

Last edited by River Huang; 12 Sep 2021, 21:25.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#6

12 Sep 2021, 22:12

Another example is

Code:

clear input byte A str15 B 1 "里" 2 "里" 1 "無道路名稱" 2 "無道路名稱" end

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4382
#7

12 Sep 2021, 23:19

Originally posted by River Huang View Post

I wonder if there is a method to make them (whenever are different initially) the same in Stata finally.

Can you get a list off the Internet somewhere of Unicode codepoints that share the same Chinese ideograph? If so, then you can use it as a "cross-walk" table to harmonize the different values to a single chosen Unicode codepoint for each set that shares an ideograph. To do this, you''d use the various Stata string functions, -merge- and regular expression functions, perhaps in combination.
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

13 Sep 2021, 02:20

maybe ustrnormalize() can help:

Code:

clear
input byte A str15 B
1 "里"
2 "里"
1 "無道路名稱"
2 "無道路名稱"
end

gen C =  ustrnormalize(B,"nfc")

tab C

Code:

              C |      Freq.     Percent        Cum.
----------------+-----------------------------------
     無道路名稱   |          2       50.00       50.00
             里  |          2       50.00      100.00
----------------+-----------------------------------
          Total |          4      100.00

Comment

Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 344
#9

13 Sep 2021, 05:33

I agree with Bjarte, ustrnormalize() should be able to deal with CJK compatibility ideographs block.
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#10

13 Sep 2021, 17:27

Dear Joseph, Thanks for this suggestion. I'll see what I can do.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#11

13 Sep 2021, 17:28

Dear Bjarte, I will definitely try this out. Thanks.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#12

15 Sep 2021, 20:45

Dear Bjarte, After trial, your suggestion has perfectly solved our problem. Thanks a lot.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment

Announcement

same Chinese characters?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment