Unicode decode error while using Data.getAsDict in sfi.Data

Francisco Leal Augusto

Join Date: Jul 2023

Posts: 7
#1

Unicode decode error while using Data.getAsDict in sfi.Data

27 Jul 2023, 04:30

Dear Stata Forum,

While trying to import some data from a Stata dataset to a Dictionary with the Data.getAsDict module in Python, the following error was retrieved:

Code:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 1: invalid continuation byte

The code I am currently using is:

Code:

python from sfi import Data import numpy as np import pandas as pd dataraw = Data.getAsDict(None, valuelabel=False, missingval=np.nan) end

I am not at ease with this topic, but I think it may be related with the characters in string variable in the Stata dataset. In fact, running this code with the auto dataset works fine, but it triggers the error with my dataset. I present a sample of the dataset below:
Variable 1 [long] Variable 2 [double] Variable 3 [str12]

20211231 45411111 NIF / NIPC

20211231 45411112 NIF / NIPC

20211231 45411113 NIF / NIPC

Please note that if I limit the dataset to Variable 1 and Variable 2, no error is triggered.

I am currently using Stata 17 in a Windows.

How may I solve this issue?

Please state if there is any other information I may provide which may help.

Thanks in advance,
Francisco

Last edited by Francisco Leal Augusto; 27 Jul 2023, 04:33. Reason: Added tags
Tags: python, sfi.Data
Bjarte Aagnes

Join Date: Apr 2014

Posts: 785
#2

30 Jul 2023, 10:06

Code:

help unicode_translate##analyze help unicode translate
Comment
Francisco Leal Augusto

Join Date: Jul 2023

Posts: 7
#3

01 Aug 2023, 09:38

Originally posted by Bjarte Aagnes View Post

Code:

help unicode_translate##analyze help unicode translate

Hi Bjarte,

Thanks for your hint. Unfortunately I could not manage to come up with a solution from it.

I used

Code:

unicode analyze vShort.dta

on a file which reproduces the example I provided above and the outcome was

Code:

unicode analyze vShort.dta (Directory ./bak.stunicode created; please do not delete) File summary (before starting): 1 file(s) specified 1 file(s) to be examined ... File vShort.dta (Stata dataset) File does not need translation

May you please provide some information on how this approach could help?

Thanks in advance,
Francisco
Comment
Francisco Leal Augusto

Join Date: Jul 2023

Posts: 7
#4

01 Aug 2023, 10:06

Dear Statalist,

An update: while trying to find solutions related to the previous post I identified something in my data which was generating the error identified in the first post:

I found that some observations under the variable 3 were defined as �. This would be the relevant example:

Variable_1 Variable_2 Variable_3

20251231 4541111 NIF / NIPC

20251231 4541112 NIF / NIPC

20251231 4541113 NIF / NIPC

20251231 4541114 C�digo Fonte

And that bad enconding was triggering the error. Once removed, it worked.

By the way, I removed that sign with the following code:

gen help_string = regexm( Variable_3, "Fonte$" )
replace Variable_3 = "Código Fonte" if help_string==1

Thanks for the help,
Francisco

Last edited by Francisco Leal Augusto; 01 Aug 2023, 10:38.
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 785

04 Aug 2023, 10:59

The bytes '0xf3'are not valid UTF-8 but, are probably Windows-1252 or ISO-8859-1 encoding in a file supposed to be UTF-8.

-unicode translate- should warn you if a file is a mix of UTF-8 and extended ASCII like '0xf3'

To avoid similar problems you might investigate the data flow from (external) source to Stata focusing on the encoding in each step.

-help unicode- provides good advice. Below is a small example which show some use of -unicode translate- and some functions:

Code:

clear all
version 14

*make data with illegal UTF-8 sequence
set obs 1
gen v1 = "Código Fonte" 
replace v1 = subinstr(v1, "ó", char(243), 1 )
save mixed_enc,  replace

clear 
unicode analyze mixed_enc.dta     

if ( r(N_needed) ) {

    use mixed_enc
    
    foreach v of varlist * {
            
        capt assert ustrinvalidcnt(`v') == 0
        
        if ( _rc == 9 ) {
        
            tab `v'  
        }
    }
}

copy mixed_enc.dta mixed_enc_org.dta, replace  

clear 
unicode encoding set "windows-1252"
unicode translate mixed_enc.dta   

* Alternative: fix variable(s) using ustrfrom()  

use mixed_enc_org.dta, clear  

capt noi assert ustrinvalidcnt(v1) == 0
gen v1_copy = ustrfrom(v1, "utf-8", 4) 
gen v1_fix = ustrfrom(v1, "windows-1252", 4)    
assert ustrinvalidcnt(v1_fix) == 0

list 
tab v1_fix
drop v1 
save enc2 , replace 
clear
unicode analyze enc2.dta

Code:

     +-----------------------------------------------+
     |           v1           v1_copy         v1_fix |
     |-----------------------------------------------|
  1. | C�digo Fonte   C%XF3digo Fonte   Código Fonte |
     +-----------------------------------------------+

Comment

Francisco Leal Augusto

Join Date: Jul 2023

Posts: 7
#6

24 Aug 2023, 06:42

Thank you very much for the last post, in particular for the function ustrfrom(), which I was unaware. It solved the issue!

Last edited by Francisco Leal Augusto; 24 Aug 2023, 06:52.
Comment

Variable 1 [long]	Variable 2 [double]	Variable 3 [str12]
20211231	45411111	NIF / NIPC
20211231	45411112	NIF / NIPC
20211231	45411113	NIF / NIPC

Variable_1	Variable_2	Variable_3
20251231	4541111	NIF / NIPC
20251231	4541112	NIF / NIPC
20251231	4541113	NIF / NIPC
20251231	4541114	C�digo Fonte

Announcement