Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Count the numner of Chinese characters

    Dear All, I was asked to count the number of (only) Chinese characters of the following data
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str6 股票代码 strL 提问内容
    "000001" "新开业的武汉分行,资金有多少?规模有多大?"                                                
    "000001" "股指期货是不是有利于银行股低估的修正"                                                   
    "000001" "请问深发展出让持有000693的股份,为什么没有发公告?"                                 
    "000001" "深发展A(000001)七年才分了一次红,是否不能申请增发融资?"                         
    "000001" "您好!请问:平安收购深发展股权后,深发展的管理层会发生大的变动吗?谢谢"
    "000001" "平安既要控股,就应早出方案;说要增发增持,已几个月了,是否缺钱?"         
    "000001" "贵公司受放款额度所限是否会影响2-4季度的业绩?"                                       
    "000001" "请直接回答投资者的提问,不要忽悠,是否对2-4季度的业绩产生影响?"            
    "000001" "贵公司向平安增发股价还是18.26元吗?是否有变动?"                                     
    "000001" "什么时间复牌?"                                                                                    
    end
    Any suggestions? Thanks.
    Ho-Chuan (River) Huang
    Stata 19.0, MP(4)

  • #2
    This may start you in the a useful direction. The regular expression below deletes all characters that are not part of the Han script, so the length of what is left is what you seek. The could be combined into a single command, but I wanted to display the intermediate results for checking.
    Code:
    . generate onlyHan = ustrregexra(提问内容,"[^\p{Han}]","")
    
    . generate nc = ustrlen(onlyHan)
    
    . list onlyHan nc, clean
    
                                                                  onlyHan   nc  
      1.                             新开业的武汉分行资金有多少规模有多大   18  
      2.                             股指期货是不是有利于银行股低估的修正   18  
      3.                         请问深发展出让持有的股份为什么没有发公告   20  
      4.                       深发展七年才分了一次红是否不能申请增发融资   21  
      5.   您好请问平安收购深发展股权后深发展的管理层会发生大的变动吗谢谢   31  
      6.           平安既要控股就应早出方案说要增发增持已几个月了是否缺钱   27  
      7.                         贵公司受放款额度所限是否会影响季度的业绩   20  
      8.           请直接回答投资者的提问不要忽悠是否对季度的业绩产生影响   27  
      9.                           贵公司向平安增发股价还是元吗是否有变动   19  
     10.                                                     什么时间复牌    6
    Added in edit: this is one of those times when a screen shot may be more helpful than a code block. This is the sort of thing, by the way, that the Unicode regular expression engine shines at - the ability to do a "wild card" match to any character in a particular Unicode code block (character set) - which is why you got simultaneous copies of the same answer.
    Click image for larger version

Name:	image_27864.png
Views:	1
Size:	128.5 KB
ID:	1671908



    Further edit:
    Code:
    generate onlyHan = ustrregexra(提问内容,"[^\p{Han}]|一","")
    will remove the character in line 4 that doesn't look like the rest of them, as an example of how to exclude a few characters that are in the Han script. That's a vertical bar "or sign" followed by the character copied-and-pasted from the input text.

    One final edit: Hua Peng (StataCorp) demonstrates the advantages of a deep knowledge of regular expression syntax; his solution generalizes the one in my previous edit by requiring retained characters to both (a) be alphabetic and (b) be in the Han script.
    Last edited by William Lisowski; 02 Jul 2022, 19:46.

    Comment


    • #3
      Here's a shot at this to at least get you started in a useful direction. Note that I do not have knowledge of Chinese typography or languages.

      Code:
      clear
      input str6 a strL b
      "000001" "新开业的武汉分行,资金有多少?规模有多大?"
      "000001" "股指期货是不是有利于银行股低估的修正"
      "000001" "请问深发展出让持有000693的股份,为什么没有发公告?"
      "000001" "深发展A(000001)七年才分了一次红,是否不能申请增发融资?"
      "000001" "您好!请问:平安收购深发展股权后,深发展的管理层会发生大的变动吗?谢谢"
      "000001" "平安既要控股,就应早出方案;说要增发增持,已几个月了,是否缺钱?"
      "000001" "贵公司受放款额度所限是否会影响2-4季度的业绩?"
      "000001" "请直接回答投资者的提问,不要忽悠,是否对2-4季度的业绩产生影响?"
      "000001" "贵公司向平安增发股价还是18.26元吗?是否有变动?"
      "000001" "什么时间复牌?"
      end
      
      gen strL b1 = ustrregexra(b, "[^\p{Han}]", "", .)
      gen int count = ustrlen(b1)
      Result

      Code:
           +----------------------------------------------------------------------------------------------------------------------------------------------------------+
           |      a                                                                        b                                                               b1   count |
           |----------------------------------------------------------------------------------------------------------------------------------------------------------|
        1. | 000001                                  新开业的武汉分行,资金有多少?规模有多大?                             新开业的武汉分行资金有多少规模有多大      18 |
        2. | 000001                                     股指期货是不是有利于银行股低估的修正                             股指期货是不是有利于银行股低估的修正      18 |
        3. | 000001                       请问深发展出让持有000693的股份,为什么没有发公告?                         请问深发展出让持有的股份为什么没有发公告      20 |
        4. | 000001                 深发展A(000001)七年才分了一次红,是否不能申请增发融资?                       深发展七年才分了一次红是否不能申请增发融资      21 |
        5. | 000001   您好!请问:平安收购深发展股权后,深发展的管理层会发生大的变动吗?谢谢   您好请问平安收购深发展股权后深发展的管理层会发生大的变动吗谢谢      31 |
           |----------------------------------------------------------------------------------------------------------------------------------------------------------|
        6. | 000001         平安既要控股,就应早出方案;说要增发增持,已几个月了,是否缺钱?           平安既要控股就应早出方案说要增发增持已几个月了是否缺钱      27 |
        7. | 000001                            贵公司受放款额度所限是否会影响2-4季度的业绩?                         贵公司受放款额度所限是否会影响季度的业绩      20 |
        8. | 000001          请直接回答投资者的提问,不要忽悠,是否对2-4季度的业绩产生影响?           请直接回答投资者的提问不要忽悠是否对季度的业绩产生影响      27 |
        9. | 000001                          贵公司向平安增发股价还是18.26元吗?是否有变动?                           贵公司向平安增发股价还是元吗是否有变动      19 |
       10. | 000001                                                           什么时间复牌?                                                     什么时间复牌       6 |
           +----------------------------------------------------------------------------------------------------------------------------------------------------------+
      Edit: apparently William Lisowski came to the same solution as I did.

      Comment


      • #4
        Dear @William Lisowski and @Leonardo Guizzetti, Thanks for the shelpful suggestions.
        Ho-Chuan (River) Huang
        Stata 19.0, MP(4)

        Comment


        • #5
          Try:

          Code:
          gen len = ustrlen(ustrregexra(提问内容, "[^\p{Alphabetic}&\p{Script=Han}]+", ""))

          Comment


          • #6
            Dear Hua, Thanks a lot for the helpful suggestion.
            Ho-Chuan (River) Huang
            Stata 19.0, MP(4)

            Comment


            • #7
              Dear Hua, Another quick question. In addition to Chinese characters, suppose that I'd like to count the number of "numbers" (say, observation 4, 000001) and "alphabets" (say, observation 4, A) as well. Any suggestions? Thanks.
              Ho-Chuan (River) Huang
              Stata 19.0, MP(4)

              Comment


              • #8
                River, if you'd like to count Chinese characters, numbers and alphabets together, then try the code below.

                Code:
                gen len = ustrlen(ustrregexra(提问内容, "\W", ""))

                Comment


                • #9
                  Dear Fei, Many thanks for this helpful suggestion.
                  Ho-Chuan (River) Huang
                  Stata 19.0, MP(4)

                  Comment


                  • #10
                    River Huang , do you want "18.26" counted as 5 (with ".") or 4 (without ".")? Same question for "2-4". counted as 2 or 3?

                    Comment


                    • #11
                      Dear Hua, I am not quite sure because the question was originally raised by someone else. Could you provide advices on both cases? Thanks.
                      Ho-Chuan (River) Huang
                      Stata 19.0, MP(4)

                      Comment

                      Working...
                      X