-charlist- displays "boxes" when -tab- displays character

Eivind Olsen

Join Date: Aug 2016

Posts: 2
#1

-charlist- displays "boxes" when -tab- displays character

27 Oct 2016, 04:28

I am using Stata 14.2 to -unicode translate- files to UTF-8. To tease out potential translation problems (e.g. because I selected the wrong original encoding), I use Nick Cox's -charlist- from SSC.

I was expecting -charlist- to properly view correctly translated unicode characters in the viewer window, but it produces "boxes". Meanwhile, -tabulate- on the variable displays characters properly in the viewer. In other words, -charlist- doesn't distinguish between characters showing as "boxes" in other windows within Stata (e.g. the data browser) and correctly translated non-standard characters, showing in their proper form elsewhere in Stata.

Although I am able to identify the characters behind the "boxes" by -di "`r(ascii)'"-, I was curious as to why -charlist- doesn't properly display those characters. I figured it can't be a font-problem, since -tabulate- uses the same font (although I've tried with all the built-in fonts).

Any hints?
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35500
#2

27 Oct 2016, 06:33

charlist was written for Stata 9, as is documented in ssc desc charlist. Correspondingly it has no knowledge of Unicode. So, the answer is, regrettably, that I

1. Did not know in 2002 what would be introduced in 2015.

2. Have not updated the program for Unicode.

Updating is not on any to-do list of mine either. I don't work in practice with anything but characters of the kind that Stata could handle 25 years ago. User-written programs are occasionally written for fun as an answer to someone's question, but more often written because the user-programmer wants some facility for their own work (and I do use charlist myself from time to time).

I would be very happy if someone wanted to write a new program that does what you want -- indeed it may already exist. But please don't call it charlist.
Comment
Julian Duggan

Join Date: Jul 2016

Posts: 63
#3

14 Jan 2019, 10:02

Does anyone know if the unicode equivalent of charlist that Nick described has been written ? I have looked around but haven't found one.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35500

14 Jan 2019, 10:11

Not me, but a few more comments may help a little. First, charlist is fairly trivial:

Code:

program def charlist, rclass
*! NJC 1.1.0 17 Dec 2002 
    version 7 
    syntax varname(string) [if] [in] 
    marksample touse, novarlist
    
    * not 0: see [P] file formats .dta 
    forval i = 1/255 { 
        capture assert index(`varlist', char(`i')) == 0 if `touse' 
        if _rc {
            if char(`i') == " " { 
                local c " " 
            } 
            else local c = char(`i') 
            local chars `"`chars'`c'"' 
            local sepchars `"`sepchars'`c' "'
            local ascii "`ascii'`i' " 
        }
    } 
    
    di as text `"`chars'"' 
    return local ascii "`ascii'" 
    return local sepchars `"`sepchars'"' 
    return local chars `"`chars'"'  
end

There is an easy tweak to ucharlist which calls up uchar() not char().

What's not so obvious is what do about strLs and what sort of performance might be expected with large datasets dominated by strLs. As I don't have experience with such data, and already have a long to-do list, I remain happy to delegate this to the rest of the community, or StataCorp.

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

14 Jan 2019, 10:28

Perhaps you can achieve your goal using Stata's unicode string functions directly Here is some code that demonstrates functions that might be of use to you.

Code:

 local s "médiane"

. local l = ustrlen(`"`s'"')

. forvalues i = 1/`l' {
  2. local c = usubstr(`"`s'"',`i',1)
  3. local b = tobytes(`"`c'"',0)
  4. local x = tobytes(`"`c'"',1)
  5. display `" `c' - `b' - `x' "' 
  6. }
 m - \d109 - \x6d 
 é - \d195\d169 - \xc3\xa9 
 d - \d100 - \x64 
 i - \d105 - \x69 
 a - \d097 - \x61 
 n - \d110 - \x6e 
 e - \d101 - \x65

Comment

Julian Duggan

Join Date: Jul 2016
Posts: 63

14 Jan 2019, 10:37

Hi Nick,

Thanks a million for posting that. I made a few minor modifications that don't address your concern about performance, but make the program functional (at least for my needs-- I don't expect to encounter any characters beyond decimal value 1000).

Code:

program def ucharlist, rclass
* https://www.statalist.org/forums/forum/general-stata-discussion/general/1361945-charlist-displays-boxes-when-tab-displays-character
*! Very minor modification of charlist by NJC 1.1.0 17 Dec 2002
    syntax varname(string) [if] [in] 
    marksample touse, novarlist
    
    * not 0: see [P] file formats .dta 
    forval i = 32/1500 { 
        capture assert index(`varlist', uchar(`i')) == 0 if `touse' 
        if _rc {
            if uchar(`i') == " " { 
                local c " " 
            } 
            else local c = uchar(`i') 
            local uchars `"`uchars'`c'"' 
            local sepuchars `"`sepuchars'`c' "'
            local unicode "`unicode'`i' " 
        }
    } 
    
    di as text `"`uchars'"' 
    return local unicode "`unicode'" 
    return local sepuchars `"`sepuchars'"' 
    return local uchars `"`uchars'"'  
end

Cheers,

Julian

Comment

Julian Duggan

Join Date: Jul 2016

Posts: 63
#7

14 Jan 2019, 10:42

Hi William,

That's nifty as well, thanks!

Cheers,

Julian
Comment

Julian Duggan

Join Date: Jul 2016
Posts: 63

14 Jan 2019, 11:56

Apologies for spamming, but for the sake of whoever looks later, I found a typo in the program above. uchar(96) (i.e. "`") cannot be stored in a local so I have to exclude it. Again, this is fine for my purposes but might now work for others:

Code:

    ** define function that produces list of all characters contained in string variable. 
    cap prog drop ucharlist 
    program def ucharlist, rclass
    * see https://www.statalist.org/forums/forum/general-stata-discussion/general/1361945-charlist-displays-boxes-when-tab-displays-character
    *! Very minor modification of NJC 1.1.0 17 Dec 2002
        syntax varname(string) [if] [in] 
        marksample touse, novarlist
        * not 0: see [P] file formats .dta 
        forval i = 32/1500 { 
            capture assert index(`varlist', uchar(`i')) == 0 if `touse' 
            if _rc {
                if uchar(`i') == " " | `i' == 96 { 
                    local c " "
                } 
                else local c = uchar(`i') 
                local uchars `"`uchars'`c'"' 
                local sepuchars `"`sepuchars'`c' "'
                local unicode "`unicode'`i' " 
            }
        } 
        di as text `"`uchars'"' 
        return local unicode "`unicode'" 
        return local sepuchars `"`sepuchars'"' 
        return local uchars `"`uchars'"'  
    end

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#9

14 Jan 2019, 12:52

This is so not spam. Statalist is a forum of users discussing Stata techniqes. By taking the time to post this, you have "payed it forward" to someone who finds this topic at a later date and needs to adapt its code.

This prompted me to experiment. Technically, the problem is not that the decimal 96 character (which is the Stata "left single quote" character ` used to dereference a local macro) cannot be stored into a local macro, the code below demonstrates that.

Code:

. local c = char(96) . local x | below the line is the character in the local macro c . macro list _x _c _x: | below the line is the character in the local macro c _c: `

The problems seem to arise when you try to dereference the local macro containing the 96 character, because it creates a left single quote that Stata then rescans as the start of a local macro. Sometimes the macval() expansion operator can help deal with this.

Code:

. display `">>>`c'<<<"' >>> . display `">>>`macval(c)'<<<"' >>>`<<<
Comment
daniel klein

Join Date: Mar 2014

Posts: 3834
#10

14 Jan 2019, 13:18

Note that Nick has fixed the problem with character 96 a long time ago, as explained in more recent versions of charlist. I downloaded the program from SSC and

Code:

which charlist

displays

Code:

*! NJC 1.3.0 28 Feb 2014

I would use that code as a basis, not the outdated code from 2002.

Best
Daniel
Comment
Julian Duggan

Join Date: Jul 2016

Posts: 63
#11

14 Jan 2019, 14:17

HI William,

Nice catch! That makes a lot of sense. And Daniel, I will look into that should I need to update the program in the future. Thanks for letting us know.

Julian
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#12

15 Jan 2019, 07:39

I've been working on and off on a new program that I call chartab that tabulates character frequencies. It supports the full range of Unicode characters (137,439 code points in the current Unicode Character Database) and it can process character strings from variables and files (including URL's). I'm in the process of adding string scalars and literals (for completeness) and byte code support. It can efficiently deal with long strL, very large datasets, and external files while maintaining a small RAM footprint.

I'll try to find time this week to finish it and I'll make an announcement on this list when it becomes available on SSC.
3 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35500
#13

15 Jan 2019, 07:46

daniel klein #10 That's an excellent catch. Rummaging around shows that (unsurprisingly) I did have a charlist from 2014 on my machine but it wasn't visible in the Stata I was working on. Too many directories from too many previous computers, but entirely my fault.

Meanwhile Robert Picard's project is precisely what I hoped would appear from the community. What I recall from relay races when I ran them (last done ~1969 in my case) was that the main point was to pass the baton on to someone else!
Comment

Announcement