Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unicode decode error while using Data.getAsDict in sfi.Data

    Dear Stata Forum,

    While trying to import some data from a Stata dataset to a Dictionary with the Data.getAsDict module in Python, the following error was retrieved:

    Code:
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 1: invalid continuation byte
    The code I am currently using is:

    Code:
    python
    
    from sfi import Data
    import numpy as np
    import pandas as pd
    
    dataraw = Data.getAsDict(None, valuelabel=False, missingval=np.nan)
    
    end
    I am not at ease with this topic, but I think it may be related with the characters in string variable in the Stata dataset. In fact, running this code with the auto dataset works fine, but it triggers the error with my dataset. I present a sample of the dataset below:
    Variable 1 [long] Variable 2 [double] Variable 3 [str12]
    20211231 45411111 NIF / NIPC
    20211231 45411112 NIF / NIPC
    20211231 45411113 NIF / NIPC
    Please note that if I limit the dataset to Variable 1 and Variable 2, no error is triggered.

    I am currently using Stata 17 in a Windows.

    How may I solve this issue?

    Please state if there is any other information I may provide which may help.

    Thanks in advance,
    Francisco
    Last edited by Francisco Leal Augusto; 27 Jul 2023, 04:33. Reason: Added tags

  • #2
    Code:
    help unicode_translate##analyze
    help unicode translate

    Comment


    • #3
      Originally posted by Bjarte Aagnes View Post
      Code:
      help unicode_translate##analyze
      help unicode translate
      Hi Bjarte,

      Thanks for your hint. Unfortunately I could not manage to come up with a solution from it.

      I used
      Code:
      unicode analyze vShort.dta
      on a file which reproduces the example I provided above and the outcome was
      Code:
      unicode analyze vShort.dta
        (Directory ./bak.stunicode created; please do not delete)
      
        File summary (before starting):
              1  file(s) specified
              1  file(s) to be examined ...
      
        File vShort.dta (Stata dataset)
      
         File does not need translation
      May you please provide some information on how this approach could help?

      Thanks in advance,
      Francisco

      Comment


      • #4
        Dear Statalist,

        An update: while trying to find solutions related to the previous post I identified something in my data which was generating the error identified in the first post:

        I found that some observations under the variable 3 were defined as �. This would be the relevant example:
        Variable_1 Variable_2 Variable_3
        20251231 4541111 NIF / NIPC
        20251231 4541112 NIF / NIPC
        20251231 4541113 NIF / NIPC
        20251231 4541114 C�digo Fonte
        And that bad enconding was triggering the error. Once removed, it worked.

        By the way, I removed that sign with the following code:

        gen help_string = regexm( Variable_3, "Fonte$" )
        replace Variable_3 = "Código Fonte" if help_string==1

        Thanks for the help,
        Francisco
        Last edited by Francisco Leal Augusto; 01 Aug 2023, 10:38.

        Comment


        • #5
          The bytes '0xf3'are not valid UTF-8 but, are probably Windows-1252 or ISO-8859-1 encoding in a file supposed to be UTF-8.

          -unicode translate- should warn you if a file is a mix of UTF-8 and extended ASCII like '0xf3'

          To avoid similar problems you might investigate the data flow from (external) source to Stata focusing on the encoding in each step.

          -help unicode- provides good advice. Below is a small example which show some use of -unicode translate- and some functions:
          Code:
          clear all
          version 14
          
          *make data with illegal UTF-8 sequence
          set obs 1
          gen v1 = "Código Fonte" 
          replace v1 = subinstr(v1, "ó", char(243), 1 )
          save mixed_enc,  replace
          
          clear 
          unicode analyze mixed_enc.dta     
          
          if ( r(N_needed) ) {
          
              use mixed_enc
              
              foreach v of varlist * {
                      
                  capt assert ustrinvalidcnt(`v') == 0
                  
                  if ( _rc == 9 ) {
                  
                      tab `v'  
                  }
              }
          }
          
          copy mixed_enc.dta mixed_enc_org.dta, replace  
          
          clear 
          unicode encoding set "windows-1252"
          unicode translate mixed_enc.dta   
          
          * Alternative: fix variable(s) using ustrfrom()  
          
          use mixed_enc_org.dta, clear  
          
          capt noi assert ustrinvalidcnt(v1) == 0
          gen v1_copy = ustrfrom(v1, "utf-8", 4) 
          gen v1_fix = ustrfrom(v1, "windows-1252", 4)    
          assert ustrinvalidcnt(v1_fix) == 0
          
          list 
          tab v1_fix
          drop v1 
          save enc2 , replace 
          clear
          unicode analyze enc2.dta
          Code:
               +-----------------------------------------------+
               |           v1           v1_copy         v1_fix |
               |-----------------------------------------------|
            1. | C�digo Fonte   C%XF3digo Fonte   Código Fonte |
               +-----------------------------------------------+

          Comment


          • #6
            Thank you very much for the last post, in particular for the function ustrfrom(), which I was unaware. It solved the issue!
            Last edited by Francisco Leal Augusto; 24 Aug 2023, 06:52.

            Comment

            Working...
            X