Problem with accents

Ylenia Curci

Join Date: Sep 2017

Posts: 72
#1

Problem with accents

23 Jun 2021, 09:46

Hello, I am using the following code to import the content of several txt file, but stata does not recognise the accents and some other characters (text is in french). Can I fix this while importing the files or should I act directly on the txt before importing them? Thanks

local filenames: dir "." files "*.txt"

tempfile building

save `building', emptyok
foreach f of local filenames {
clear
set obs 1
gen filename = "`f'"
gen strL contents = fileread("`f'")
append using `building'
save `"`building'"', replace
}

use `building', clear
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10190
#2

23 Jun 2021, 11:06

What version of Stata do you have? Versions 14+ support Unicode, see

Code:

help unicode_advice

If these are text files, try importing first

Code:

help import delimited

Last edited by Andrew Musau; 23 Jun 2021, 11:09.
Comment
Ylenia Curci

Join Date: Sep 2017

Posts: 72
#3

23 Jun 2021, 15:14

Thank you for your answer!
Yes I am using stata 15, but I am importing txt files, and each file is an observation and the content of the file is a variable. So I don't see how I can use import delimited, as my observations are different files and not different lines in the same file.
I should actually specify I want to generate a unicode long string, but apparently it is not allowed, is it?
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10190

23 Jun 2021, 16:06

Try and see if this works:

Code:

unicode encoding set utf8
local filenames: dir "." files "*.txt"
tempfile building
save `building', emptyok
foreach f of local filenames {
    unicode translate `f'
    clear
    set obs 1
    gen filename = "`f'"
    gen strL contents = fileread("`f'")
    append using `building'
    save `"`building'"', replace
}
use `building', clear

Comment

Ylenia Curci

Join Date: Sep 2017

Posts: 72
#5

24 Jun 2021, 00:11

No, it does not. It does not find the files
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10190
#6

24 Jun 2021, 00:53

I do not understand what this means. Can you take one of the files, run the code below, copying and pasting the exact output? For example, if the file is named "myfile.txt", run

Code:

unicode encoding set utf8 local filenames: dir "." files "myfile.txt" tempfile building save `building', emptyok foreach f of local filenames { unicode translate `f' clear set obs 1 gen filename = "`f'" gen strL contents = fileread("`f'") append using `building' save `"`building'"', replace } use `building', clear dataex

Then post the output of the entire code, including the dataex. Be sure to change "myfile.txt" to one of the text files in your current directory.
Comment
Ylenia Curci

Join Date: Sep 2017

Posts: 72
#7

24 Jun 2021, 09:35

I have managed to deal with the problem with the following code:

local filenames: dir "." files "*.txt"
unicode analyze *

unicode encoding set ISO-8859-15
*unicode encoding set ISO-8859-15,invalid(mark) transutf8
unicode translate *

local filenames: dir "." files "*.txt"

tempfile building

save `building', emptyok
foreach f of local filenames {
clear
set obs 1
gen filename = "`f'"
gen strL contents = fileread("`f'")
append using `building'
save `"`building'"', replace
}

use `building', clear

Thank you for the input Andrew!
Comment
Marco Biagetti INAPP

Join Date: Jun 2022

Posts: 1
#8

30 Jun 2022, 07:31

Hi Andrew, ciao Ylenia.
I've had the same problem with accents and apostrophes in Italian but it seems I managed to overcome it by using unicode2ascii
Comment
daniel klein

Join Date: Mar 2014

Posts: 3847
#9

30 Jun 2022, 07:47

The best way to go is Stata's unicode routines.

I have noticed a common misconception, which is also present in #4. Note that

Code:

unicode encoding set

does not specify the target/wanted encoding; it specifies the encoding of the non-Unicode source file. That is, you specify the encoding that you want to translate to Unicode. It is very unlikely that you want utf-8. In Europe, you probably want some variation of ISO-8859, as suggested in #7. In Germany, if you are using Windows, the most likely encoding, other than Unicode, is windows-1252. It is similar but not quite identical to ISO-8859-1,

If you encounter problems with the encoding, take the time to read the documentation on unicode carefully. It is worth your time, believe me.

Last edited by daniel klein; 30 Jun 2022, 07:49.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10190
#10

30 Jun 2022, 16:59

Originally posted by daniel klein View Post

Code:

unicode encoding set

does not specify the target/wanted encoding; it specifies the encoding of the non-Unicode source file.

Indeed.

It is very unlikely that you want utf-8.

I do not necessarily agree, utf-8 has very broad coverage, see https://en.wikipedia.org/wiki/UTF-8.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3847
#11

30 Jun 2022, 23:57

Originally posted by Andrew Musau View Post

I do not necessarily agree, utf-8 has very broad coverage, see https://en.wikipedia.org/wiki/UTF-8.

I did not make myself clear. Sorry. I wanted to say that if your source file is already UTF-8 encoded, then there is no need to translate it. Thus, you always want your (Stata) files in UTF-8 but you almost never want to type

Code:

unicode encoding set utf-8

In fact, I wonder whether you ever want to type that.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10190
#12

01 Jul 2022, 03:35

Originally posted by daniel klein View Post

In fact, I wonder whether you ever want to type that.

As the default in versions of Stata that support Unicode is utf-8, the code line

unicode encoding set utf8

changes nothing assuming that a different encoding was not set previously. In this case, adding the line does no harm but also overwrites some previous non-utf-8 encoding if it exists.

Last edited by Andrew Musau; 01 Jul 2022, 03:38.
1 like
Comment

Announcement

Problem with accents

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment