Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata 14, Unicode, and extended ASCII.

    Stata 14 is a big step forward. But there is a problem with the switch to Unicode for languages that so far used extended ASCII for some characters (German, French, Spanish, Portuguese, Danish, Norwegian, Swedish, Icelandic, Polish, Turkish, etc.). With the improved saveold command, Stata 14 can generate a dataset which can be opened by Stata 11 to 13, but legibility is poor. This is described in the Unicode help. unicode translate lets you translate from extended ASCII to Unicode, but the reverse is not possible.

    At least, this is how I understand the possibilities. The problem will arise when we cooperate with or teach persons who don't have Stata 14. Does anybody have a solution?

    Actually, Stat/Transfer 13 does the trick; it can translate a Stata 14 dataset to Stata 13, including translation to extended ASCII. To my mind this means that it is hardly impossible to expand the capability of unicode translate or saveold to make some "reverse translation".

  • #2
    I recall reading somewhere that Microsoft Excel stores all of its string data in Unicode, even when it's ASCII or ANSI. If my recollection isn't off-base, then you might be able to try export excel from Stata 14 of the extended ASCII (ANSI) text, and then import excel in Stata 13, or odbc for the two earlier Stata releases.

    Even it if does work, I realize that this isn't exactly what you're looking for, but it would offer a Stata-only solution to the problem. For production use, of course, I'd stick with Stat/Transfer 13.

    Comment


    • #3
      Dear Svend, could you please share an example dataset, which illustrates the problem? Thank you, Sergiy

      Comment


      • #4

        Dear Setgiy,

        In Stata 14 I generated a small dataset with these commands:
        Code:
        clear
        input a b xø str5 string
        1 1 1 "mænd"
        2 2 2 "møer"
        end
        label variable a "Danish characters æøåÆØÅ"
        label define b 1 "MÆND" 2 "MØER"
        label values b b
        numlabel , add
        notes: Saved by Stata 14
        save x14.dta , replace
        In Stata 14 it displays allright:
        Code:
        . codebook , compact
        Variable   Obs Unique  Mean  Min  Max  Label
        -----------------------------------------------------------------------------------------
        a            2      2   1.5    1    2  Danish characters æøåÆØÅ     
        b            2      2   1.5    1    2 
        xø           2      2   1.5    1    2 
        string       2      2     .    .    . 
        -----------------------------------------------------------------------------------------
        . list , clean
               a         b   xø   string 
          1.   1   1. MÆND    1     mænd 
          2.   2   2. MØER    2     møer
        But now I try to open the dataset in Stata 13; which is not possible:
        Code:
        . use "x14.dta", clear
        .dta too modern
            File C:\abc\x14.dta is from a more recent version of Stata.  Type update query to determine whether a free
            update of Stata is available, and browse http://www.stata.com/ to determine if a new version is available.
        What I miss is the opportunity to translate back from Unicode to extended ASCII. It ought to be possible. It must be possible.

        Svend

        Comment


        • #5

          The above is incomplete. In Stata 14 I also used the saveold command to generate a version readable by Stata 13:
          Code:
          . saveold x14a.dta , version(13)
          (saving in Stata 13 format)
            note: variable name "xø" contains unicode and thus may not display well in Stata 13.
            note: variable label "Danish characters æøåÆØÅ" contains unicode and thus may not
                  display well in Stata 13.
          file x14a.dta saved
          Opening it with Stata 13 gives this unsatisfactory result:
          Code:
          . use x14a.dta
          . codebook, compact
          Variable   Obs Unique  Mean  Min  Max  Label
          -------------------------------------------------------------------------------------------------------------------
          a            2      2   1.5    1    2  Danish characters æøåÆØÅ
          b            2      2   1.5    1    2 
          xø          2      2   1.5    1    2 
          string       2      2     .    .    . 
          -------------------------------------------------------------------------------------------------------------------
          . list , clean
                 a          b   xø   string 
            1.   1   1. MÆND     1    mænd 
            2.   2   2. MØER     2    møer

          Comment


          • #6
            Originally posted by Svend Juul View Post
            Stata 14 is a big step forward. But there is a problem with the switch to Unicode for languages that so far used extended ASCII for some characters (German, French, Spanish, Portuguese, Danish, Norwegian, Swedish, Icelandic, Polish, Turkish, etc.). With the improved saveold command, Stata 14 can generate a dataset which can be opened by Stata 11 to 13, but legibility is poor. This is described in the Unicode help. unicode translate lets you translate from extended ASCII to Unicode, but the reverse is not possible.

            At least, this is how I understand the possibilities. The problem will arise when we cooperate with or teach persons who don't have Stata 14. Does anybody have a solution?

            Actually, Stat/Transfer 13 does the trick; it can translate a Stata 14 dataset to Stata 13, including translation to extended ASCII. To my mind this means that it is hardly impossible to expand the capability of unicode translate or saveold to make some "reverse translation".

            It is certainly possible to write a command similar to unicode translate which would convert all strings/labels/names in a Stata 14 dataset back to some extended ASCII encoding. To do it "right", however, as we feel we did with unicode translate for the extended-ASCII-to-Unicode conversion, is a bit tricky. For example, a dataset containing Unicode in variable names might use characters which aren't certain to appear in the desired target extended ASCII encoding. When those characters are then dropped or substituted with a replacement character, two or more variable names that were previously distinct could end up becoming duplicates, which is not allowed.

            Because there isn't an official solution at this moment, let me share a little bit of code that you may find useful. First though, be sure to make a copy of any dataset you intend to use this on! In particular, you don't want to accidentally save over your nice, new, Unicode Stata 14 dataset with a dataset that has been back-converted to extended ASCII. So, I recommend starting with something like

            Code:
            copy myfile.dta myfile_ext.dta
            use myfile_ext.dta
            so that you are working on a copy of your original dataset.

            The first thing you need to do is determine the target extended ASCII encoding. help encodings can help with that. I don't know what target encoding you need, but let's use "ISO-8859-10" for this example. Let's store the encoding in a global macro just so we can easily use it later (and easily change it if it turns out not to be the right encoding):

            Code:
            global ENCODING "ISO-8859-10"

            Let's start by converting something simple -- variable labels. We can use the ustrto() function to convert from Unicode (UTF-8) to our desired encoding. Other than the encoding to use, we must also decide how we want to deal with characters which for some reason can't be converted to the destination extended ASCII encoding. I'll specify "1" as the third argument to ustrto which means that invalid sequences will use a substitution character defined by the particular encoding we are translating to. I loop over all variables, grabbing the variable label of each one, converting that variable label, and reassigning it:

            Code:
            global ENCODING "ISO-8859-10"
            
            foreach var of varlist _all {
               local thelab : variable label `var'
               local thelab = ustrto(`"`thelab'"', "$ENCODING", 1) 
               label variable `var' `"`thelab'"'
            }
            Next, let's worry about string variables in your data:

            Code:
            foreach var of varlist _all {
               capture confirm string variable `var'
               if _rc==0 {
                   replace `var' = ustrto(`var', "$ENCODING", 1)
               }
            }
            There's one potential problem with the above loop. You might have strL variables, and if you do, some of their values might be binary. If, say, you read a PDF file into one observation of a strL variable, you wouldn't want to run it through ustrto() as that would corrupt it. Stata has an undocumented function _strisbinary() which can be used to detect and skip strL values which have been marked as binary. Let's incorporate it into the above loop:

            Code:
            foreach var of varlist _all {
               capture confirm string variable `var'
               if _rc==0 {
                   replace `var' = ustrto(`var', "$ENCODING", 1) if !_strisbinary(`var')
               }
            }
            Finally, here's a loop to convert the variable names:

            Code:
            foreach var of varlist _all {
               local newname = ustrto("`var'", "ISO-8859-10", 1) 
               rename `var' `newname'
            }
            I shouldn't have said "finally". I didn't provide code for things like characteristic contents, characteristic names, the dataset label, or value label values or names. The last one is the trickiest thing to handle because if a value label name changes, you not only have to modify the value label, you also have to find every variable to which it is attached and re-attach the new name. These are all things a hypothetical official command to translate from UTF-8 to extended ASCII would need to deal with.

            Comment


            • #7
              Thank you, Alan, for a precise answer. I understand that converting value labels can be tricky. Nevertheless, I hope - I really do - that StataCorp will commit itself to solve the problem.

              Comment

              Working...
              X