Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with accents

    Hello, I am using the following code to import the content of several txt file, but stata does not recognise the accents and some other characters (text is in french). Can I fix this while importing the files or should I act directly on the txt before importing them? Thanks


    local filenames: dir "." files "*.txt"

    tempfile building

    save `building', emptyok
    foreach f of local filenames {
    clear
    set obs 1
    gen filename = "`f'"
    gen strL contents = fileread("`f'")
    append using `building'
    save `"`building'"', replace
    }

    use `building', clear

  • #2
    What version of Stata do you have? Versions 14+ support Unicode, see

    Code:
    help unicode_advice
    If these are text files, try importing first

    Code:
    help import delimited
    Last edited by Andrew Musau; 23 Jun 2021, 11:09.

    Comment


    • #3
      Thank you for your answer!
      Yes I am using stata 15, but I am importing txt files, and each file is an observation and the content of the file is a variable. So I don't see how I can use import delimited, as my observations are different files and not different lines in the same file.
      I should actually specify I want to generate a unicode long string, but apparently it is not allowed, is it?

      Comment


      • #4
        Try and see if this works:

        Code:
        unicode encoding set utf8
        local filenames: dir "." files "*.txt"
        tempfile building
        save `building', emptyok
        foreach f of local filenames {
            unicode translate `f'
            clear
            set obs 1
            gen filename = "`f'"
            gen strL contents = fileread("`f'")
            append using `building'
            save `"`building'"', replace
        }
        use `building', clear

        Comment


        • #5
          No, it does not. It does not find the files

          Comment


          • #6
            I do not understand what this means. Can you take one of the files, run the code below, copying and pasting the exact output? For example, if the file is named "myfile.txt", run

            Code:
            unicode encoding set utf8
            local filenames: dir "." files "myfile.txt"
            tempfile building
            save `building', emptyok
            foreach f of local filenames {
                unicode translate `f'
                clear
                set obs 1
                gen filename = "`f'"
                gen strL contents = fileread("`f'")
                append using `building'
                save `"`building'"', replace
            }
            use `building', clear
            dataex
            Then post the output of the entire code, including the dataex. Be sure to change "myfile.txt" to one of the text files in your current directory.

            Comment


            • #7
              I have managed to deal with the problem with the following code:

              local filenames: dir "." files "*.txt"
              unicode analyze *

              unicode encoding set ISO-8859-15
              *unicode encoding set ISO-8859-15,invalid(mark) transutf8
              unicode translate *


              local filenames: dir "." files "*.txt"

              tempfile building

              save `building', emptyok
              foreach f of local filenames {
              clear
              set obs 1
              gen filename = "`f'"
              gen strL contents = fileread("`f'")
              append using `building'
              save `"`building'"', replace
              }

              use `building', clear


              Thank you for the input Andrew!

              Comment


              • #8
                Hi Andrew, ciao Ylenia.
                I've had the same problem with accents and apostrophes in Italian but it seems I managed to overcome it by using unicode2ascii

                Comment


                • #9
                  The best way to go is Stata's unicode routines.

                  I have noticed a common misconception, which is also present in #4. Note that

                  Code:
                  unicode encoding set
                  does not specify the target/wanted encoding; it specifies the encoding of the non-Unicode source file. That is, you specify the encoding that you want to translate to Unicode. It is very unlikely that you want utf-8. In Europe, you probably want some variation of ISO-8859, as suggested in #7. In Germany, if you are using Windows, the most likely encoding, other than Unicode, is windows-1252. It is similar but not quite identical to ISO-8859-1,

                  If you encounter problems with the encoding, take the time to read the documentation on unicode carefully. It is worth your time, believe me.
                  Last edited by daniel klein; 30 Jun 2022, 07:49.

                  Comment


                  • #10
                    Originally posted by daniel klein View Post

                    Code:
                    unicode encoding set
                    does not specify the target/wanted encoding; it specifies the encoding of the non-Unicode source file.
                    Indeed.

                    It is very unlikely that you want utf-8.
                    I do not necessarily agree, utf-8 has very broad coverage, see https://en.wikipedia.org/wiki/UTF-8.

                    Comment


                    • #11
                      Originally posted by Andrew Musau View Post
                      I do not necessarily agree, utf-8 has very broad coverage, see https://en.wikipedia.org/wiki/UTF-8.
                      I did not make myself clear. Sorry. I wanted to say that if your source file is already UTF-8 encoded, then there is no need to translate it. Thus, you always want your (Stata) files in UTF-8 but you almost never want to type

                      Code:
                      unicode encoding set utf-8
                      In fact, I wonder whether you ever want to type that.

                      Comment


                      • #12
                        Originally posted by daniel klein View Post
                        In fact, I wonder whether you ever want to type that.
                        As the default in versions of Stata that support Unicode is utf-8, the code line

                        unicode encoding set utf8
                        changes nothing assuming that a different encoding was not set previously. In this case, adding the line does no harm but also overwrites some previous non-utf-8 encoding if it exists.
                        Last edited by Andrew Musau; 01 Jul 2022, 03:38.

                        Comment

                        Working...
                        X