Count the numner of Chinese characters

River Huang

Join Date: Mar 2016
Posts: 1908

Count the numner of Chinese characters

02 Jul 2022, 18:36

Dear All, I was asked to count the number of (only) Chinese characters of the following data

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str6 股票代码 strL 提问内容
"000001" "新开业的武汉分行,资金有多少?规模有多大?"                                                
"000001" "股指期货是不是有利于银行股低估的修正"                                                   
"000001" "请问深发展出让持有000693的股份，为什么没有发公告？"                                 
"000001" "深发展A（000001）七年才分了一次红,是否不能申请增发融资？"                         
"000001" "您好！请问：平安收购深发展股权后，深发展的管理层会发生大的变动吗？谢谢"
"000001" "平安既要控股，就应早出方案；说要增发增持，已几个月了，是否缺钱？"         
"000001" "贵公司受放款额度所限是否会影响2-4季度的业绩？"                                       
"000001" "请直接回答投资者的提问，不要忽悠，是否对2-4季度的业绩产生影响？"            
"000001" "贵公司向平安增发股价还是18.26元吗？是否有变动？"                                     
"000001" "什么时间复牌？"                                                                                    
end

Any suggestions? Thanks.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)

Tags: None

William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

02 Jul 2022, 19:20

This may start you in the a useful direction. The regular expression below deletes all characters that are not part of the Han script, so the length of what is left is what you seek. The could be combined into a single command, but I wanted to display the intermediate results for checking.

Code:

. generate onlyHan = ustrregexra(提问内容,"[^\p{Han}]","") . generate nc = ustrlen(onlyHan) . list onlyHan nc, clean onlyHan nc 1. 新开业的武汉分行资金有多少规模有多大 18 2. 股指期货是不是有利于银行股低估的修正 18 3. 请问深发展出让持有的股份为什么没有发公告 20 4. 深发展七年才分了一次红是否不能申请增发融资 21 5. 您好请问平安收购深发展股权后深发展的管理层会发生大的变动吗谢谢 31 6. 平安既要控股就应早出方案说要增发增持已几个月了是否缺钱 27 7. 贵公司受放款额度所限是否会影响季度的业绩 20 8. 请直接回答投资者的提问不要忽悠是否对季度的业绩产生影响 27 9. 贵公司向平安增发股价还是元吗是否有变动 19 10. 什么时间复牌 6

Added in edit: this is one of those times when a screen shot may be more helpful than a code block. This is the sort of thing, by the way, that the Unicode regular expression engine shines at - the ability to do a "wild card" match to any character in a particular Unicode code block (character set) - which is why you got simultaneous copies of the same answer.

Further edit:

Code:

generate onlyHan = ustrregexra(提问内容,"[^\p{Han}]|一","")

will remove the character in line 4 that doesn't look like the rest of them, as an example of how to exclude a few characters that are in the Han script. That's a vertical bar "or sign" followed by the character copied-and-pasted from the input text.

One final edit: Hua Peng (StataCorp) demonstrates the advantages of a deep knowledge of regular expression syntax; his solution generalizes the one in my previous edit by requiring retained characters to both (a) be alphabetic and (b) be in the Han script.

Last edited by William Lisowski; 02 Jul 2022, 19:46.
2 likes
Comment

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2400

02 Jul 2022, 19:21

Here's a shot at this to at least get you started in a useful direction. Note that I do not have knowledge of Chinese typography or languages.

Code:

clear
input str6 a strL b
"000001" "新开业的武汉分行,资金有多少?规模有多大?"
"000001" "股指期货是不是有利于银行股低估的修正"
"000001" "请问深发展出让持有000693的股份，为什么没有发公告？"
"000001" "深发展A（000001）七年才分了一次红,是否不能申请增发融资？"
"000001" "您好！请问：平安收购深发展股权后，深发展的管理层会发生大的变动吗？谢谢"
"000001" "平安既要控股，就应早出方案；说要增发增持，已几个月了，是否缺钱？"
"000001" "贵公司受放款额度所限是否会影响2-4季度的业绩？"
"000001" "请直接回答投资者的提问，不要忽悠，是否对2-4季度的业绩产生影响？"
"000001" "贵公司向平安增发股价还是18.26元吗？是否有变动？"
"000001" "什么时间复牌？"
end

gen strL b1 = ustrregexra(b, "[^\p{Han}]", "", .)
gen int count = ustrlen(b1)

Result

Code:

     +----------------------------------------------------------------------------------------------------------------------------------------------------------+
     |      a                                                                        b                                                               b1   count |
     |----------------------------------------------------------------------------------------------------------------------------------------------------------|
  1. | 000001                                  新开业的武汉分行,资金有多少?规模有多大?                             新开业的武汉分行资金有多少规模有多大      18 |
  2. | 000001                                     股指期货是不是有利于银行股低估的修正                             股指期货是不是有利于银行股低估的修正      18 |
  3. | 000001                       请问深发展出让持有000693的股份，为什么没有发公告？                         请问深发展出让持有的股份为什么没有发公告      20 |
  4. | 000001                 深发展A（000001）七年才分了一次红,是否不能申请增发融资？                       深发展七年才分了一次红是否不能申请增发融资      21 |
  5. | 000001   您好！请问：平安收购深发展股权后，深发展的管理层会发生大的变动吗？谢谢   您好请问平安收购深发展股权后深发展的管理层会发生大的变动吗谢谢      31 |
     |----------------------------------------------------------------------------------------------------------------------------------------------------------|
  6. | 000001         平安既要控股，就应早出方案；说要增发增持，已几个月了，是否缺钱？           平安既要控股就应早出方案说要增发增持已几个月了是否缺钱      27 |
  7. | 000001                            贵公司受放款额度所限是否会影响2-4季度的业绩？                         贵公司受放款额度所限是否会影响季度的业绩      20 |
  8. | 000001          请直接回答投资者的提问，不要忽悠，是否对2-4季度的业绩产生影响？           请直接回答投资者的提问不要忽悠是否对季度的业绩产生影响      27 |
  9. | 000001                          贵公司向平安增发股价还是18.26元吗？是否有变动？                           贵公司向平安增发股价还是元吗是否有变动      19 |
 10. | 000001                                                           什么时间复牌？                                                     什么时间复牌       6 |
     +----------------------------------------------------------------------------------------------------------------------------------------------------------+

Edit: apparently William Lisowski came to the same solution as I did.

Comment

River Huang

Join Date: Mar 2016

Posts: 1908
#4

02 Jul 2022, 19:33

Dear @William Lisowski and @Leonardo Guizzetti, Thanks for the shelpful suggestions.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#5

02 Jul 2022, 19:38

Try:

Code:

gen len = ustrlen(ustrregexra(提问内容, "[^\p{Alphabetic}&\p{Script=Han}]+", ""))
4 likes
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#6

02 Jul 2022, 21:41

Dear Hua, Thanks a lot for the helpful suggestion.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#7

04 Jul 2022, 02:10

Dear Hua, Another quick question. In addition to Chinese characters, suppose that I'd like to count the number of "numbers" (say, observation 4, 000001) and "alphabets" (say, observation 4, A) as well. Any suggestions? Thanks.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
Fei Wang

Join Date: Oct 2021

Posts: 726
#8

04 Jul 2022, 02:43

River, if you'd like to count Chinese characters, numbers and alphabets together, then try the code below.

Code:

gen len = ustrlen(ustrregexra(提问内容, "\W", ""))
2 likes
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#9

04 Jul 2022, 06:08

Dear Fei, Many thanks for this helpful suggestion.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#10

04 Jul 2022, 08:01

River Huang , do you want "18.26" counted as 5 (with ".") or 4 (without ".")? Same question for "2-4". counted as 2 or 3?
2 likes
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#11

05 Jul 2022, 06:33

Dear Hua, I am not quite sure because the question was originally raised by someone else. Could you provide advices on both cases? Thanks.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment

Announcement

Count the numner of Chinese characters

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment