Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata 14, Unicode, and extended ASCII.

    Stata 14 is a big step forward. But there is a problem with the switch to Unicode for languages that so far used extended ASCII for some characters (German, French, Spanish, Portuguese, Danish, Norwegian, Swedish, Icelandic, Polish, Turkish, etc.). With the improved saveold command, Stata 14 can generate a dataset which can be opened by Stata 11 to 13, but legibility is poor. This is described in the Unicode help. unicode translate lets you translate from extended ASCII to Unicode, but the reverse is not possible.

    At least, this is how I understand the possibilities. The problem will arise when we cooperate with or teach persons who don't have Stata 14. Does anybody have a solution?

    Actually, Stat/Transfer 13 does the trick; it can translate a Stata 14 dataset to Stata 13, including translation to extended ASCII. To my mind this means that it is hardly impossible to expand the capability of unicode translate or saveold to make some "reverse translation".

  • #2
    I recall reading somewhere that Microsoft Excel stores all of its string data in Unicode, even when it's ASCII or ANSI. If my recollection isn't off-base, then you might be able to try export excel from Stata 14 of the extended ASCII (ANSI) text, and then import excel in Stata 13, or odbc for the two earlier Stata releases.

    Even it if does work, I realize that this isn't exactly what you're looking for, but it would offer a Stata-only solution to the problem. For production use, of course, I'd stick with Stat/Transfer 13.

    Comment


    • #3
      Dear Svend, could you please share an example dataset, which illustrates the problem? Thank you, Sergiy

      Comment


      • #4

        Dear Setgiy,

        In Stata 14 I generated a small dataset with these commands:
        Code:
        clear
        input a b xø str5 string
        1 1 1 "mænd"
        2 2 2 "møer"
        end
        label variable a "Danish characters æøåÆØÅ"
        label define b 1 "MÆND" 2 "MØER"
        label values b b
        numlabel , add
        notes: Saved by Stata 14
        save x14.dta , replace
        In Stata 14 it displays allright:
        Code:
        . codebook , compact
        Variable   Obs Unique  Mean  Min  Max  Label
        -----------------------------------------------------------------------------------------
        a            2      2   1.5    1    2  Danish characters æøåÆØÅ     
        b            2      2   1.5    1    2 
        xø           2      2   1.5    1    2 
        string       2      2     .    .    . 
        -----------------------------------------------------------------------------------------
        . list , clean
               a         b   xø   string 
          1.   1   1. MÆND    1     mænd 
          2.   2   2. MØER    2     møer
        But now I try to open the dataset in Stata 13; which is not possible:
        Code:
        . use "x14.dta", clear
        .dta too modern
            File C:\abc\x14.dta is from a more recent version of Stata.  Type update query to determine whether a free
            update of Stata is available, and browse http://www.stata.com/ to determine if a new version is available.
        What I miss is the opportunity to translate back from Unicode to extended ASCII. It ought to be possible. It must be possible.

        Svend

        Comment


        • #5

          The above is incomplete. In Stata 14 I also used the saveold command to generate a version readable by Stata 13:
          Code:
          . saveold x14a.dta , version(13)
          (saving in Stata 13 format)
            note: variable name "xø" contains unicode and thus may not display well in Stata 13.
            note: variable label "Danish characters æøåÆØÅ" contains unicode and thus may not
                  display well in Stata 13.
          file x14a.dta saved
          Opening it with Stata 13 gives this unsatisfactory result:
          Code:
          . use x14a.dta
          . codebook, compact
          Variable   Obs Unique  Mean  Min  Max  Label
          -------------------------------------------------------------------------------------------------------------------
          a            2      2   1.5    1    2  Danish characters æøåÆØÅ
          b            2      2   1.5    1    2 
          xø          2      2   1.5    1    2 
          string       2      2     .    .    . 
          -------------------------------------------------------------------------------------------------------------------
          . list , clean
                 a          b   xø   string 
            1.   1   1. MÆND     1    mænd 
            2.   2   2. MØER     2    møer

          Comment


          • #6
            Originally posted by Svend Juul View Post
            Stata 14 is a big step forward. But there is a problem with the switch to Unicode for languages that so far used extended ASCII for some characters (German, French, Spanish, Portuguese, Danish, Norwegian, Swedish, Icelandic, Polish, Turkish, etc.). With the improved saveold command, Stata 14 can generate a dataset which can be opened by Stata 11 to 13, but legibility is poor. This is described in the Unicode help. unicode translate lets you translate from extended ASCII to Unicode, but the reverse is not possible.

            At least, this is how I understand the possibilities. The problem will arise when we cooperate with or teach persons who don't have Stata 14. Does anybody have a solution?

            Actually, Stat/Transfer 13 does the trick; it can translate a Stata 14 dataset to Stata 13, including translation to extended ASCII. To my mind this means that it is hardly impossible to expand the capability of unicode translate or saveold to make some "reverse translation".

            It is certainly possible to write a command similar to unicode translate which would convert all strings/labels/names in a Stata 14 dataset back to some extended ASCII encoding. To do it "right", however, as we feel we did with unicode translate for the extended-ASCII-to-Unicode conversion, is a bit tricky. For example, a dataset containing Unicode in variable names might use characters which aren't certain to appear in the desired target extended ASCII encoding. When those characters are then dropped or substituted with a replacement character, two or more variable names that were previously distinct could end up becoming duplicates, which is not allowed.

            Because there isn't an official solution at this moment, let me share a little bit of code that you may find useful. First though, be sure to make a copy of any dataset you intend to use this on! In particular, you don't want to accidentally save over your nice, new, Unicode Stata 14 dataset with a dataset that has been back-converted to extended ASCII. So, I recommend starting with something like

            Code:
            copy myfile.dta myfile_ext.dta
            use myfile_ext.dta
            so that you are working on a copy of your original dataset.

            The first thing you need to do is determine the target extended ASCII encoding. help encodings can help with that. I don't know what target encoding you need, but let's use "ISO-8859-10" for this example. Let's store the encoding in a global macro just so we can easily use it later (and easily change it if it turns out not to be the right encoding):

            Code:
            global ENCODING "ISO-8859-10"

            Let's start by converting something simple -- variable labels. We can use the ustrto() function to convert from Unicode (UTF-8) to our desired encoding. Other than the encoding to use, we must also decide how we want to deal with characters which for some reason can't be converted to the destination extended ASCII encoding. I'll specify "1" as the third argument to ustrto which means that invalid sequences will use a substitution character defined by the particular encoding we are translating to. I loop over all variables, grabbing the variable label of each one, converting that variable label, and reassigning it:

            Code:
            global ENCODING "ISO-8859-10"
            
            foreach var of varlist _all {
               local thelab : variable label `var'
               local thelab = ustrto(`"`thelab'"', "$ENCODING", 1) 
               label variable `var' `"`thelab'"'
            }
            Next, let's worry about string variables in your data:

            Code:
            foreach var of varlist _all {
               capture confirm string variable `var'
               if _rc==0 {
                   replace `var' = ustrto(`var', "$ENCODING", 1)
               }
            }
            There's one potential problem with the above loop. You might have strL variables, and if you do, some of their values might be binary. If, say, you read a PDF file into one observation of a strL variable, you wouldn't want to run it through ustrto() as that would corrupt it. Stata has an undocumented function _strisbinary() which can be used to detect and skip strL values which have been marked as binary. Let's incorporate it into the above loop:

            Code:
            foreach var of varlist _all {
               capture confirm string variable `var'
               if _rc==0 {
                   replace `var' = ustrto(`var', "$ENCODING", 1) if !_strisbinary(`var')
               }
            }
            Finally, here's a loop to convert the variable names:

            Code:
            foreach var of varlist _all {
               local newname = ustrto("`var'", "ISO-8859-10", 1) 
               rename `var' `newname'
            }
            I shouldn't have said "finally". I didn't provide code for things like characteristic contents, characteristic names, the dataset label, or value label values or names. The last one is the trickiest thing to handle because if a value label name changes, you not only have to modify the value label, you also have to find every variable to which it is attached and re-attach the new name. These are all things a hypothetical official command to translate from UTF-8 to extended ASCII would need to deal with.

            Comment


            • #7
              Thank you, Alan, for a precise answer. I understand that converting value labels can be tricky. Nevertheless, I hope - I really do - that StataCorp will commit itself to solve the problem.

              Comment


              • #8
                I found Alan Riley (StataCorp)'s codes very helpful, abeit that he didn't provide code for translating value labels. I try to write codes for this purpose, my codes use -labelsof- command written by Ben Jann (SSC). My codes are suitable for cases that value label name conicides with variable name, welcome further refinement.

                Code:
                clear
                input a b xø str5 string
                1 1 1 "mænd"
                2 2 2 "møer"
                end
                label variable a "Danish characters æøåÆØÅ"
                label define b 1 "MÆND" 2 "MØER"
                label values b b
                numlabel , add
                
                global ENCODING "iso-8859_10-1998" /*Encoding for Danish*/
                
                quietly label dir
                local varname : value label `r(names)'
                display "`varname'"
                labelsof `varname' /*ssc install labelsof*/
                display `"`r(labels)'"'
                local word : word count `r(labels)'
                forvalue n = 1/`word' {
                   local labvalue : word `n' of `r(labels)'
                   local newlabvalue = ustrto("`labvalue'", "$ENCODING", 1) 
                   label define `varname' `n' "`newlabvalue'", modify
                }
                label values `varname' `"`varname'"'
                label list _all

                Comment


                • #9
                  Reinventing the wheel! People who are interested in these matters can find Svend Juul and Morten Frydenberg's -unicode2ascii- helpful. https://www.statalist.org/forums/for...ation-and-more
                  Originally posted by Svend Juul View Post
                  This replaces a previous post about the non-functioning trans_unicode package.

                  Thanks to Kit Baum, the unicode2ascii package has been installed at SSC. It includes three commands that analyze or translate single files or groups of files in the current directory:

                  whichencoding examines the occurrence of Unicode and extended ASCII characters in Stata datasets and text files like do-files, ado-files, help files and log files. This is useful to determine the need for translation when sharing Stata files between users or computers with different versions of Stata installed. The official unicode analyze command serves the same purpose, but the output from whichencoding is more compact and transparent.

                  ascii2unicode translates datasets and text files with extended ASCII characters to Unicode encoding. Destination files take the names of the source files, and a suffix is added to the source file names. The official unicode translate command serves the same purpose, but the output from ascii2unicode is more compact and transparent, and you have access both to Unicode and ASCII versions of datasets and text files at the same time.

                  unicode2ascii translates datasets and text files with Unicode characters to ASCII encoding and saves datasets in Stata 13 or 12 format. Variable names, label names and contents (including labels in different languages), string variable contents, and notes are translated. The source files keep their names, and a suffix is added to the destination file names. Currently (September 2015), no official Stata command serves the same purpose.

                  Recently, Daniel Bela published two related commands at SSC: saveascii, which in Stata 14 translates the dataset in memory to ASCII encoding and saves it in Stata 13 or 12 format, and useold, which translates an ASCII encoded dataset to Unicode before opening it in Stata 14.

                  Svend Juul and Morten Frydenberg

                  Comment

                  Working...
                  X