Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • same Chinese characters?

    Dear All, I have this data.
    Code:
    clear
    input byte A str9 B
    1 "里"
    2 "里"
    end
    There are two (supposed) "same" Chinese characters. However, when I list as follows:
    Code:
    . list if B=="里"
    
         +--------+
         | A    B |
         |--------|
      2. | 2   里 |
         +--------+
    I only see the second observation (no first observation). When I copy the Chinese word from the first observation and list
    Code:
    . list if B=="里"
    
         +--------+
         | A    B |
         |--------|
      1. | 1   里 |
         +--------+
    I obtain only the first observation. According to these, it seems that they are two different words (but seem the same). Does anyone have any ideas? Thanks,
    Note that, in the data example using dataex above, these two words look not the same, one is smaller and the other is bigger.
    Ho-Chuan (River) Huang
    Stata 19.0, MP(4)

  • #2
    If you look at the hex code using
    Code:
    gen hex = ustrtohex(B)
    list hex
    
         +--------+
         |    hex |
         |--------|
      1. | \uf9e9 |
      2. | \u91cc |
    The first character is https://unicode.scarfboy.com/?s=U%2BF9E9, the second is https://unicode.scarfboy.com/?s=U%2B91cc

    They are different Unicode codepoints representing the same Chinese ideograph.

    Comment


    • #3
      Dear Hua, Thanks for the reply.

      The problem is that I have a big data set (almost 1,000,000 observations) on addresses (for car accidents). Due to unknown reasons, the (supposed) same addresses in excel appeared to be different in Stata (as you mentioned above).

      I need to make sure they are identical (how can I do this? It is unlikely to check the whole name/words of addresses one by one) so that I can do -collapse- or -merge- with another file (probably has the same address problem).

      Do you have any feasible suggestions on this difficult problem? Thanks.
      Ho-Chuan (River) Huang
      Stata 19.0, MP(4)

      Comment


      • #4
        I do not believe they are the same in Excel to begin with. Do you only need to make several characters to the same or it's more of needle in the haystack type of situation?

        Comment


        • #5
          Dear Hua, Well, in fact, the data are from my son (for his master thesis). Unfortunately, I think it is more likely to be the latter case, i.e., more of needle in the haystack type of situation. I wonder if there is a method to make them (whenever are different initially) the same in Stata finally.
          Last edited by River Huang; 12 Sep 2021, 21:25.
          Ho-Chuan (River) Huang
          Stata 19.0, MP(4)

          Comment


          • #6
            Another example is
            Code:
            clear
            input byte A str15 B
            1 "里"
            2 "里"
            1 "無道路名稱"
            2 "無道路名稱"
            end
            Ho-Chuan (River) Huang
            Stata 19.0, MP(4)

            Comment


            • #7
              Originally posted by River Huang View Post
              I wonder if there is a method to make them (whenever are different initially) the same in Stata finally.
              Can you get a list off the Internet somewhere of Unicode codepoints that share the same Chinese ideograph? If so, then you can use it as a "cross-walk" table to harmonize the different values to a single chosen Unicode codepoint for each set that shares an ideograph. To do this, you''d use the various Stata string functions, -merge- and regular expression functions, perhaps in combination.

              Comment


              • #8
                maybe ustrnormalize() can help:
                Code:
                clear
                input byte A str15 B
                1 "里"
                2 "里"
                1 "無道路名稱"
                2 "無道路名稱"
                end
                
                gen C =  ustrnormalize(B,"nfc")
                
                tab C
                Code:
                              C |      Freq.     Percent        Cum.
                ----------------+-----------------------------------
                     無道路名稱   |          2       50.00       50.00
                             里  |          2       50.00      100.00
                ----------------+-----------------------------------
                          Total |          4      100.00

                Comment


                • #9
                  I agree with Bjarte, ustrnormalize() should be able to deal with CJK compatibility ideographs block.

                  Comment


                  • #10
                    Dear Joseph, Thanks for this suggestion. I'll see what I can do.
                    Ho-Chuan (River) Huang
                    Stata 19.0, MP(4)

                    Comment


                    • #11
                      Dear Bjarte, I will definitely try this out. Thanks.
                      Ho-Chuan (River) Huang
                      Stata 19.0, MP(4)

                      Comment


                      • #12
                        Dear Bjarte, After trial, your suggestion has perfectly solved our problem. Thanks a lot.
                        Ho-Chuan (River) Huang
                        Stata 19.0, MP(4)

                        Comment

                        Working...
                        X