New -chartab- package on SSC to tabulate character frequency counts

Robert Picard

Join Date: Mar 2014
Posts: 1536

New -chartab- package on SSC to tabulate character frequency counts

18 Feb 2019, 14:57

Thanks to Kit Baum, the chartab package is now available on SSC. To install, type in Stata's Command window:

Code:

ssc install chartab

This installs two commands that tabulate character frequency counts. The chartab command tabulates Unicode characters (requires Stata 14 or higher) and the chartabb command tabulates byte codes (requires Stata 10 or higher).

If you are using an older version of Stata (version 13 or earlier), a character is encoded using a single byte. This allows for 256 distinct values. char(0) to char(127) are ASCII codes but there is no standard for what char(128) to char(255) represent.

If you are using Stata 14 or higher, each character is encoded in UTF-8. This is a storage-efficient Unicode encoding where the 128 ASCII characters are encoded using a single byte (using the same ASCII byte code). All other Unicode characters are encoded using a multi-byte sequence (from two to four bytes, with each byte code >= 128). So by design, UTF-8 is backwards compatible with ASCII.

Both chartab and chartabb can process text from any combination of string variables, files, string scalars, and string literals in a single run. Here's an example with a string literal:

Code:

. chartab , literal("j'ai hâte à l'été")

   decimal  hexadecimal   character |     frequency    unique name
------------------------------------+------------------------------------------------------
        32       \u0020             |             3    SPACE
        39       \u0027       '     |             2    APOSTROPHE
        97       \u0061       a     |             1    LATIN SMALL LETTER A
       101       \u0065       e     |             1    LATIN SMALL LETTER E
       104       \u0068       h     |             1    LATIN SMALL LETTER H
       105       \u0069       i     |             1    LATIN SMALL LETTER I
       106       \u006a       j     |             1    LATIN SMALL LETTER J
       108       \u006c       l     |             1    LATIN SMALL LETTER L
       116       \u0074       t     |             2    LATIN SMALL LETTER T
       224       \u00e0       à     |             1    LATIN SMALL LETTER A WITH GRAVE
       226       \u00e2       â     |             1    LATIN SMALL LETTER A WITH CIRCUMFLEX
       233       \u00e9       é     |             2    LATIN SMALL LETTER E WITH ACUTE
------------------------------------+------------------------------------------------------

                                    freq. count   distinct
ASCII characters              =              13          9
Multibyte UTF-8 characters    =               4          3
Unicode replacement character =               0          0
Total Unicode characters      =              17         12


.

I can do the same in Stata 10 using chartabb. But since this is an older version of Stata, each character is encoded using a single byte code. I'm on a Mac, so characters are encoded using the Mac OS Roman encoding.

Code:

. chartabb , literal("j'ai hâte à l'été")

   decimal  hexadecimal   character |     frequency
------------------------------------+--------------------------------------------------------------------
        32           20             |             3
        39           27       '     |             2
        97           61       a     |             1
       101           65       e     |             1
       104           68       h     |             1
       105           69       i     |             1
       106           6A       j     |             1
       108           6C       l     |             1
       116           74       t     |             2
       136           88       à     |             1
       137           89       â     |             1
       142           8E       é     |             2
------------------------------------+--------------------------------------------------------------------
ASCII control characters     =               0
ASCII printable characters   =              13
Extended characters          =               4
Total characters (bytes)     =              17


.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#2

18 Feb 2019, 15:05

Robert, looks like another really useful program. Thank you!
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#3

18 Feb 2019, 18:31

Dear Robert, Many thanks for this interesting package.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment

Nick Cox

Join Date: Mar 2014
Posts: 36058

19 Feb 2019, 05:14

Another excellent command! Let me underline the application to a common problem: destring, replace won't perform but you don't know precisely why.

In this silly example, the problem is clear from the construction, but the issue is how would one detect it:

Code:

. clear 
. set obs 1000 
. gen frog = cond(_n == 42, "frog", string(_n))

. destring frog, replace
frog: contains nonnumeric characters; no replace

. chartab frog if missing(real(frog))

   decimal  hexadecimal   character |     frequency    unique name
------------------------------------+--------------------------------------
       102       \u0066       f     |             1    LATIN SMALL LETTER F
       103       \u0067       g     |             1    LATIN SMALL LETTER G
       111       \u006f       o     |             1    LATIN SMALL LETTER O
       114       \u0072       r     |             1    LATIN SMALL LETTER R
------------------------------------+--------------------------------------

                                    freq. count   distinct
ASCII characters              =               4          4
Multibyte UTF-8 characters    =               0          0
Unicode replacement character =               0          0
Total Unicode characters      =               4          4

Naturally other solutions are possible e.g. using one or more of list browse edit tabulate but in some cases the output from those could be very lengthy, whereas chartab could then give a concise report on problem characters, e.g. the letter l for 1 or the letter O for 0.

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

19 Feb 2019, 09:02

Another common problem is detecting characters that look like standard ASCII characters (a likely issue if the string data transited via Microsoft Word or Excel):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str79 organization
"Egalax—Empia Technology Inc."                                         
"Taiguen Technology (Shen—Zhen) Co., Ltd."                             
"Commerce and Industry Company “Faberge Art's Applied Craft” Limited"
"SANOFI—AVENTIS DEUTSCHLAND GMBH"                                      
"Danziger ‘DAN’ Flower Farm"                                         
"Danziger “Dan” Flower Farm"                                         
"EGALAX₋EMPIA TECHNOLOGY INC."                                         
"8×8, Inc."                                                             
"Good Humor−Breyers Ice Cream, Division of Conopco, Inc."              
end

chartab organization, noascii

and the results:

Code:

. chartab organization, noascii

   decimal  hexadecimal   character |     frequency    unique name
------------------------------------+---------------------------------------------
       215       \u00d7       ×     |             1    MULTIPLICATION SIGN
     8,212       \u2014       —     |             3    EM DASH
     8,216       \u2018       ‘     |             1    LEFT SINGLE QUOTATION MARK
     8,217       \u2019       ’     |             1    RIGHT SINGLE QUOTATION MARK
     8,220       \u201c       “     |             2    LEFT DOUBLE QUOTATION MARK
     8,221       \u201d       ”     |             2    RIGHT DOUBLE QUOTATION MARK
     8,331       \u208b       ₋     |             1    SUBSCRIPT MINUS
     8,722       \u2212       −     |             1    MINUS SIGN
------------------------------------+---------------------------------------------

                                    freq. count   distinct
ASCII characters              =               0          0
Multibyte UTF-8 characters    =              12          8
Unicode replacement character =               0          0
Total Unicode characters      =              12          8


.

If you are trying to group or merge data based on string values, you should probably perform some Unicode due diligence. chartab can help by identifying which of the more than 100,000 characters in the Unicode Character Database are in your string data.

Comment

Jesse Wursten

Join Date: Jan 2016

Posts: 915
#6

26 Jan 2021, 02:25

This command is amazing
1 like
Comment

Announcement

New -chartab- package on SSC to tabulate character frequency counts

Comment

Comment

Comment

Comment

Comment