I have the following problem. I have a dataset with french characters, for example, one name is "Jean Léon". I want to export it to csv to open this dataset in Python and search for these names in the search engine of a website similar to Wikipedia.
When I use:
I get my csv file. And I can see that the encoding is "uft-8", since after trying:
I get that "Jean Léon" is "Jean Léon" after the first line, but it is "Jean Léon" after the second one.
When I open the csv file in Excel I see "Jean Léon" instead of "Jean Léon". However, I was not worried about this since I thought that Excel was interpreting incorrectly the encoding (as latin1 or windows-1252), and the encoding of my file is correct.
However, when I try to import the file in Python using pandas, it says that utf-8 cannot decode some bytes. Therefore, I think this means that the encoding in my csv file is not utf-8. Moreover, I can read the csv files as latin-1 or windows-1252 without problems.
Note that using import excel this problem does not happen. After:
If I open the file in Excel, I read "Jean Léon" and Python/pandas can import it without problems.
My question is the following: Is this a problem with export delimited? Why exporting it as an Excel file works but the encoding has problems when the file is exported a csv? Is there a solution to these using export delimited?
When I use:
Code:
export delimited using "my_dataset", replace
Code:
import delimited using "my_dataset", clear varnames(1) encoding("utf-8") import delimited using "my_dataset", clear varnames(1) encoding("latin1")
When I open the csv file in Excel I see "Jean Léon" instead of "Jean Léon". However, I was not worried about this since I thought that Excel was interpreting incorrectly the encoding (as latin1 or windows-1252), and the encoding of my file is correct.
However, when I try to import the file in Python using pandas, it says that utf-8 cannot decode some bytes. Therefore, I think this means that the encoding in my csv file is not utf-8. Moreover, I can read the csv files as latin-1 or windows-1252 without problems.
Note that using import excel this problem does not happen. After:
Code:
export excel using "my_dataset", replace firstrow(variables)
My question is the following: Is this a problem with export delimited? Why exporting it as an Excel file works but the encoding has problems when the file is exported a csv? Is there a solution to these using export delimited?
Comment