Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • -charlist- displays "boxes" when -tab- displays character

    I am using Stata 14.2 to -unicode translate- files to UTF-8. To tease out potential translation problems (e.g. because I selected the wrong original encoding), I use Nick Cox's -charlist- from SSC.

    I was expecting -charlist- to properly view correctly translated unicode characters in the viewer window, but it produces "boxes". Meanwhile, -tabulate- on the variable displays characters properly in the viewer. In other words, -charlist- doesn't distinguish between characters showing as "boxes" in other windows within Stata (e.g. the data browser) and correctly translated non-standard characters, showing in their proper form elsewhere in Stata.

    Although I am able to identify the characters behind the "boxes" by -di "`r(ascii)'"-, I was curious as to why -charlist- doesn't properly display those characters. I figured it can't be a font-problem, since -tabulate- uses the same font (although I've tried with all the built-in fonts).

    Any hints?

  • #2
    charlist was written for Stata 9, as is documented in ssc desc charlist. Correspondingly it has no knowledge of Unicode. So, the answer is, regrettably, that I

    1. Did not know in 2002 what would be introduced in 2015.

    2. Have not updated the program for Unicode.

    Updating is not on any to-do list of mine either. I don't work in practice with anything but characters of the kind that Stata could handle 25 years ago. User-written programs are occasionally written for fun as an answer to someone's question, but more often written because the user-programmer wants some facility for their own work (and I do use charlist myself from time to time).

    I would be very happy if someone wanted to write a new program that does what you want -- indeed it may already exist. But please don't call it charlist.

    Comment


    • #3
      Does anyone know if the unicode equivalent of charlist that Nick described has been written ? I have looked around but haven't found one.

      Comment


      • #4
        Not me, but a few more comments may help a little. First, charlist is fairly trivial:


        Code:
        program def charlist, rclass
        *! NJC 1.1.0 17 Dec 2002 
            version 7 
            syntax varname(string) [if] [in] 
            marksample touse, novarlist
            
            * not 0: see [P] file formats .dta 
            forval i = 1/255 { 
                capture assert index(`varlist', char(`i')) == 0 if `touse' 
                if _rc {
                    if char(`i') == " " { 
                        local c " " 
                    } 
                    else local c = char(`i') 
                    local chars `"`chars'`c'"' 
                    local sepchars `"`sepchars'`c' "'
                    local ascii "`ascii'`i' " 
                }
            } 
            
            di as text `"`chars'"' 
            return local ascii "`ascii'" 
            return local sepchars `"`sepchars'"' 
            return local chars `"`chars'"'  
        end
        There is an easy tweak to ucharlist which calls up uchar() not char().

        What's not so obvious is what do about strLs and what sort of performance might be expected with large datasets dominated by strLs. As I don't have experience with such data, and already have a long to-do list, I remain happy to delegate this to the rest of the community, or StataCorp.

        Comment


        • #5
          Perhaps you can achieve your goal using Stata's unicode string functions directly Here is some code that demonstrates functions that might be of use to you.
          Code:
           local s "médiane"
          
          . local l = ustrlen(`"`s'"')
          
          . forvalues i = 1/`l' {
            2. local c = usubstr(`"`s'"',`i',1)
            3. local b = tobytes(`"`c'"',0)
            4. local x = tobytes(`"`c'"',1)
            5. display `" `c' - `b' - `x' "' 
            6. }
           m - \d109 - \x6d 
           é - \d195\d169 - \xc3\xa9 
           d - \d100 - \x64 
           i - \d105 - \x69 
           a - \d097 - \x61 
           n - \d110 - \x6e 
           e - \d101 - \x65

          Comment


          • #6
            Hi Nick,

            Thanks a million for posting that. I made a few minor modifications that don't address your concern about performance, but make the program functional (at least for my needs-- I don't expect to encounter any characters beyond decimal value 1000).

            Code:
            program def ucharlist, rclass
            * https://www.statalist.org/forums/forum/general-stata-discussion/general/1361945-charlist-displays-boxes-when-tab-displays-character
            *! Very minor modification of charlist by NJC 1.1.0 17 Dec 2002
                syntax varname(string) [if] [in] 
                marksample touse, novarlist
                
                * not 0: see [P] file formats .dta 
                forval i = 32/1500 { 
                    capture assert index(`varlist', uchar(`i')) == 0 if `touse' 
                    if _rc {
                        if uchar(`i') == " " { 
                            local c " " 
                        } 
                        else local c = uchar(`i') 
                        local uchars `"`uchars'`c'"' 
                        local sepuchars `"`sepuchars'`c' "'
                        local unicode "`unicode'`i' " 
                    }
                } 
                
                di as text `"`uchars'"' 
                return local unicode "`unicode'" 
                return local sepuchars `"`sepuchars'"' 
                return local uchars `"`uchars'"'  
            end
            Cheers,

            Julian

            Comment


            • #7
              Hi William,

              That's nifty as well, thanks!

              Cheers,

              Julian

              Comment


              • #8
                Apologies for spamming, but for the sake of whoever looks later, I found a typo in the program above. uchar(96) (i.e. "`") cannot be stored in a local so I have to exclude it. Again, this is fine for my purposes but might now work for others:

                Code:
                    ** define function that produces list of all characters contained in string variable. 
                    cap prog drop ucharlist 
                    program def ucharlist, rclass
                    * see https://www.statalist.org/forums/forum/general-stata-discussion/general/1361945-charlist-displays-boxes-when-tab-displays-character
                    *! Very minor modification of NJC 1.1.0 17 Dec 2002
                        syntax varname(string) [if] [in] 
                        marksample touse, novarlist
                        * not 0: see [P] file formats .dta 
                        forval i = 32/1500 { 
                            capture assert index(`varlist', uchar(`i')) == 0 if `touse' 
                            if _rc {
                                if uchar(`i') == " " | `i' == 96 { 
                                    local c " "
                                } 
                                else local c = uchar(`i') 
                                local uchars `"`uchars'`c'"' 
                                local sepuchars `"`sepuchars'`c' "'
                                local unicode "`unicode'`i' " 
                            }
                        } 
                        di as text `"`uchars'"' 
                        return local unicode "`unicode'" 
                        return local sepuchars `"`sepuchars'"' 
                        return local uchars `"`uchars'"'  
                    end

                Comment


                • #9
                  This is so not spam. Statalist is a forum of users discussing Stata techniqes. By taking the time to post this, you have "payed it forward" to someone who finds this topic at a later date and needs to adapt its code.

                  This prompted me to experiment. Technically, the problem is not that the decimal 96 character (which is the Stata "left single quote" character ` used to dereference a local macro) cannot be stored into a local macro, the code below demonstrates that.
                  Code:
                  . local c = char(96)
                  
                  . local x | below the line is the character in the local macro c
                  
                  . macro list _x _c
                  _x:             | below the line is the character in the local macro c
                  _c:             `
                  The problems seem to arise when you try to dereference the local macro containing the 96 character, because it creates a left single quote that Stata then rescans as the start of a local macro. Sometimes the macval() expansion operator can help deal with this.
                  Code:
                  . display `">>>`c'<<<"'
                  >>>
                  
                  . display `">>>`macval(c)'<<<"'
                  >>>`<<<

                  Comment


                  • #10
                    Note that Nick has fixed the problem with character 96 a long time ago, as explained in more recent versions of charlist. I downloaded the program from SSC and

                    Code:
                    which charlist
                    displays

                    Code:
                    *! NJC 1.3.0 28 Feb 2014
                    I would use that code as a basis, not the outdated code from 2002.

                    Best
                    Daniel

                    Comment


                    • #11
                      HI William,

                      Nice catch! That makes a lot of sense. And Daniel, I will look into that should I need to update the program in the future. Thanks for letting us know.

                      Julian

                      Comment


                      • #12
                        I've been working on and off on a new program that I call chartab that tabulates character frequencies. It supports the full range of Unicode characters (137,439 code points in the current Unicode Character Database) and it can process character strings from variables and files (including URL's). I'm in the process of adding string scalars and literals (for completeness) and byte code support. It can efficiently deal with long strL, very large datasets, and external files while maintaining a small RAM footprint.

                        I'll try to find time this week to finish it and I'll make an announcement on this list when it becomes available on SSC.

                        Comment


                        • #13
                          daniel klein #10 That's an excellent catch. Rummaging around shows that (unsurprisingly) I did have a charlist from 2014 on my machine but it wasn't visible in the Stata I was working on. Too many directories from too many previous computers, but entirely my fault.

                          Meanwhile Robert Picard's project is precisely what I hoped would appear from the community. What I recall from relay races when I ran them (last done ~1969 in my case) was that the main point was to pass the baton on to someone else!

                          Comment

                          Working...
                          X