Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • New -chartab- package on SSC to tabulate character frequency counts

    Thanks to Kit Baum, the chartab package is now available on SSC. To install, type in Stata's Command window:
    Code:
    ssc install chartab
    This installs two commands that tabulate character frequency counts. The chartab command tabulates Unicode characters (requires Stata 14 or higher) and the chartabb command tabulates byte codes (requires Stata 10 or higher).

    If you are using an older version of Stata (version 13 or earlier), a character is encoded using a single byte. This allows for 256 distinct values. char(0) to char(127) are ASCII codes but there is no standard for what char(128) to char(255) represent.

    If you are using Stata 14 or higher, each character is encoded in UTF-8. This is a storage-efficient Unicode encoding where the 128 ASCII characters are encoded using a single byte (using the same ASCII byte code). All other Unicode characters are encoded using a multi-byte sequence (from two to four bytes, with each byte code >= 128). So by design, UTF-8 is backwards compatible with ASCII.

    Both chartab and chartabb can process text from any combination of string variables, files, string scalars, and string literals in a single run. Here's an example with a string literal:

    Code:
    . chartab , literal("j'ai hâte à l'été")
    
       decimal  hexadecimal   character |     frequency    unique name
    ------------------------------------+------------------------------------------------------
            32       \u0020             |             3    SPACE
            39       \u0027       '     |             2    APOSTROPHE
            97       \u0061       a     |             1    LATIN SMALL LETTER A
           101       \u0065       e     |             1    LATIN SMALL LETTER E
           104       \u0068       h     |             1    LATIN SMALL LETTER H
           105       \u0069       i     |             1    LATIN SMALL LETTER I
           106       \u006a       j     |             1    LATIN SMALL LETTER J
           108       \u006c       l     |             1    LATIN SMALL LETTER L
           116       \u0074       t     |             2    LATIN SMALL LETTER T
           224       \u00e0       à     |             1    LATIN SMALL LETTER A WITH GRAVE
           226       \u00e2       â     |             1    LATIN SMALL LETTER A WITH CIRCUMFLEX
           233       \u00e9       é     |             2    LATIN SMALL LETTER E WITH ACUTE
    ------------------------------------+------------------------------------------------------
    
                                        freq. count   distinct
    ASCII characters              =              13          9
    Multibyte UTF-8 characters    =               4          3
    Unicode replacement character =               0          0
    Total Unicode characters      =              17         12
    
    
    .
    I can do the same in Stata 10 using chartabb. But since this is an older version of Stata, each character is encoded using a single byte code. I'm on a Mac, so characters are encoded using the Mac OS Roman encoding.

    Code:
    . chartabb , literal("j'ai hâte à l'été")
    
       decimal  hexadecimal   character |     frequency
    ------------------------------------+--------------------------------------------------------------------
            32           20             |             3
            39           27       '     |             2
            97           61       a     |             1
           101           65       e     |             1
           104           68       h     |             1
           105           69       i     |             1
           106           6A       j     |             1
           108           6C       l     |             1
           116           74       t     |             2
           136           88       à     |             1
           137           89       â     |             1
           142           8E       é     |             2
    ------------------------------------+--------------------------------------------------------------------
    ASCII control characters     =               0
    ASCII printable characters   =              13
    Extended characters          =               4
    Total characters (bytes)     =              17
    
    
    .



  • #2
    Robert, looks like another really useful program. Thank you!

    Comment


    • #3
      Dear Robert, Many thanks for this interesting package.
      Ho-Chuan (River) Huang
      Stata 17.0, MP(4)

      Comment


      • #4
        Another excellent command! Let me underline the application to a common problem: destring, replace won't perform but you don't know precisely why.

        In this silly example, the problem is clear from the construction, but the issue is how would one detect it:

        Code:
        . clear 
        . set obs 1000 
        . gen frog = cond(_n == 42, "frog", string(_n))
        
        . destring frog, replace
        frog: contains nonnumeric characters; no replace
        
        . chartab frog if missing(real(frog))
        
           decimal  hexadecimal   character |     frequency    unique name
        ------------------------------------+--------------------------------------
               102       \u0066       f     |             1    LATIN SMALL LETTER F
               103       \u0067       g     |             1    LATIN SMALL LETTER G
               111       \u006f       o     |             1    LATIN SMALL LETTER O
               114       \u0072       r     |             1    LATIN SMALL LETTER R
        ------------------------------------+--------------------------------------
        
                                            freq. count   distinct
        ASCII characters              =               4          4
        Multibyte UTF-8 characters    =               0          0
        Unicode replacement character =               0          0
        Total Unicode characters      =               4          4
        Naturally other solutions are possible e.g. using one or more of list browse edit tabulate but in some cases the output from those could be very lengthy, whereas chartab could then give a concise report on problem characters, e.g. the letter l for 1 or the letter O for 0.

        Comment


        • #5
          Another common problem is detecting characters that look like standard ASCII characters (a likely issue if the string data transited via Microsoft Word or Excel):

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str79 organization
          "Egalax—Empia Technology Inc."                                         
          "Taiguen Technology (Shen—Zhen) Co., Ltd."                             
          "Commerce and Industry Company “Faberge Art's Applied Craft” Limited"
          "SANOFI—AVENTIS DEUTSCHLAND GMBH"                                      
          "Danziger ‘DAN’ Flower Farm"                                         
          "Danziger “Dan” Flower Farm"                                         
          "EGALAX₋EMPIA TECHNOLOGY INC."                                         
          "8×8, Inc."                                                             
          "Good Humor−Breyers Ice Cream, Division of Conopco, Inc."              
          end
          
          chartab organization, noascii
          and the results:
          Code:
          . chartab organization, noascii
          
             decimal  hexadecimal   character |     frequency    unique name
          ------------------------------------+---------------------------------------------
                 215       \u00d7       ×     |             1    MULTIPLICATION SIGN
               8,212       \u2014       —     |             3    EM DASH
               8,216       \u2018       ‘     |             1    LEFT SINGLE QUOTATION MARK
               8,217       \u2019       ’     |             1    RIGHT SINGLE QUOTATION MARK
               8,220       \u201c       “     |             2    LEFT DOUBLE QUOTATION MARK
               8,221       \u201d       ”     |             2    RIGHT DOUBLE QUOTATION MARK
               8,331       \u208b       ₋     |             1    SUBSCRIPT MINUS
               8,722       \u2212       −     |             1    MINUS SIGN
          ------------------------------------+---------------------------------------------
          
                                              freq. count   distinct
          ASCII characters              =               0          0
          Multibyte UTF-8 characters    =              12          8
          Unicode replacement character =               0          0
          Total Unicode characters      =              12          8
          
          
          .
          If you are trying to group or merge data based on string values, you should probably perform some Unicode due diligence. chartab can help by identifying which of the more than 100,000 characters in the Unicode Character Database are in your string data.

          Comment


          • #6
            This command is amazing

            Comment

            Working...
            X