Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • encoding used by export delimited

    I have the following problem. I have a dataset with french characters, for example, one name is "Jean Léon". I want to export it to csv to open this dataset in Python and search for these names in the search engine of a website similar to Wikipedia.

    When I use:
    Code:
    export delimited using "my_dataset", replace
    I get my csv file. And I can see that the encoding is "uft-8", since after trying:
    Code:
    import delimited using "my_dataset", clear varnames(1) encoding("utf-8")
    import delimited using "my_dataset", clear varnames(1) encoding("latin1")
    I get that "Jean Léon" is "Jean Léon" after the first line, but it is "Jean Léon" after the second one.

    When I open the csv file in Excel I see "Jean Léon" instead of "Jean Léon". However, I was not worried about this since I thought that Excel was interpreting incorrectly the encoding (as latin1 or windows-1252), and the encoding of my file is correct.

    However, when I try to import the file in Python using pandas, it says that utf-8 cannot decode some bytes. Therefore, I think this means that the encoding in my csv file is not utf-8. Moreover, I can read the csv files as latin-1 or windows-1252 without problems.

    Note that using import excel this problem does not happen. After:
    Code:
    export excel using "my_dataset", replace firstrow(variables)
    If I open the file in Excel, I read "Jean Léon" and Python/pandas can import it without problems.

    My question is the following: Is this a problem with export delimited? Why exporting it as an Excel file works but the encoding has problems when the file is exported a csv? Is there a solution to these using export delimited?






  • #2
    I found the problem. In my original dta file, I have another variable that was a string with an encoding different from utf-8. Stata doesn't have problems exporting and importing it, but Python/pandas assumed all variables are encoded with utf-8. Dropping this variable solves the issue. Therefore, it is not a problem with export delimited, though it may be useful to have this information in the docs or in the help file (or maybe it is there and I missed it!).
    Last edited by Belisario Deler; 10 Jul 2019, 16:34.

    Comment

    Working...
    X