Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Remove accents marks on a string variable in stata

    Dear all,

    I would like to know if someone knows a STATA code that I can use to remove accents marks on a string variable in STATA.
    my string code with accent looks like Huíla, Bié, François, etc,
    Thanks a lot for your help

    Eric

  • #2
    Hi Eric,
    you have to find out the decimal ANSI-codes of the letters you want to replace.
    Then the syntax is as follows (with examples é and à):

    Code:
    capture drop test
    gen str8 test = "éleàtr"
    local eacc = char(233)
    local aacc = char(224)
    replace test = subinstr(test, "`eacc'", "e",.)
    replace test = subinstr(test, "`aacc'", "a",.)
    tab test
    Use help string functions for an explanation how subinstr() works.

    Hope that helps,
    Klaudia

    Comment


    • #3
      Discussed in an older thread already:
      http://hsphsun3.harvard.edu/cgi-bin/...ticle-213.html

      Note that Stata 14 might have introduced canned functions for this. Ask around those who've got the new version already. You will be looking somewhere in the vicinity of utf-8 to ASCII string transformation (ASCII doesn't support diacritics, ANSI does).

      Best, Sergiy Radyakin

      Comment


      • #4
        Hi Klaudia and Sergiy,

        Thank you very much for the codes and the link, I will read them carefully and will try.

        Thanks,

        Best,

        Eric

        Comment


        • #5
          A supplement:

          http://ascii-table.com/ansi-codes.php

          Greetings, Klaudia

          Comment


          • #6
            Note that Klaudia's solution is essentially for Stata before version 14.

            Comment


            • #7
              Also note that ANSI is standard that is not universally adopted.

              Code:
              . gen str8 test = "éleàtr"
              
              . local eacc = char(233)
              
              . local aacc = char(224)
              
              . replace test = subinstr(test, "`eacc'", "e",.)
              (0 real changes made)
              
              . replace test = subinstr(test, "`aacc'", "a",.)
              (0 real changes made)
              
              . tab test
              no observations

              Comment


              • #8
                Robert Picard ,
                you are right that the ANSI code pages may differ for different users, but presumably each individual user would know the codepage used for a particular dataset and can decide whether to run or not this program. But if you are referring to "no observations" message from Stata in your example, then this is because Klaudia Erhardt 's example should have started with setting the observations to something more than 0. Then the replace commands work indeed.

                Code:
                clear
                set obs 1
                generate str8 test = "éleàtr"
                local eacc = char(233)
                local aacc = char(224)
                replace test = subinstr(test, "`eacc'", "e",.)
                replace test = subinstr(test, "`aacc'", "a",.)
                tabulate test
                Note also that Klaudia uses str8 for a 6-letter constant, probably because she is running Stata 14, where this text stored in utf-8 occupies 8 bytes (4x1+2x2=8). In all previous versions 6 bytes would be sufficient. One can omit this type specification and let Stata decide as necessary.

                Best, Sergiy Radyakin

                Comment


                • #9
                  You are right, I missed the lack of observation issue. It still does not work on my computer because it's not a PC and ANSI is not the character encoding used.

                  Code:
                  . clear
                  
                  . set obs 1
                  obs was 0, now 1
                  
                  . generate str8 test = "éleàtr"
                  
                  . local eacc = char(233)
                  
                  . local aacc = char(224)
                  
                  . replace test = subinstr(test, "`eacc'", "e",.)
                  (0 real changes made)
                  
                  . replace test = subinstr(test, "`aacc'", "a",.)
                  (0 real changes made)
                  
                  . tabulate test
                  
                         test |      Freq.     Percent        Cum.
                  ------------+-----------------------------------
                       éleàtr |          1      100.00      100.00
                  ------------+-----------------------------------
                        Total |          1      100.00

                  Comment


                  • #10


                    I used the char() function of Stata, which is based on the ANSI (extended ASCII)-codes.

                    It would be interesting to know if the str() function of Stata works the same on a system where "é" is - say - Code 242
                    Could you please test it, Robert, as your system seems to have a different encoding scheme?

                    To get a list of the codes on your system, this should work:

                    Code:
                    set more off
                    forvalues i=1(1)255 {
                        local a = char(`i')
                        display " char(`i') represents `a' "
                    }
                    By the way: sorry for causing confusion by generating a str8 variable for a string that is only 6 characters long. And not saying that there should be an active working file with at least one observation.

                    The code of this post works without active working file.

                    Klaudia

                    Comment


                    • #11
                      Klaudia:

                      Please note post #6.

                      Please confirm what version of Stata you are using. I suspect you are not using Stata 14. You're asked to declare what version you are using if it's not the latest. See FAQ Advice http://www.statalist.org/forums/help Section 11,

                      As elsewhere reported, the introduction of Unicode in Stata 14 breaks references to characters higher than 127.

                      Comment


                      • #12
                        Yes, the version I'm using is Stata 13.1

                        Nevertheless it would be interesting to know if the code I've posted in #10 can be used in any (newer) Stata Version to obtain the code of specific characters, no matter which codepage is active.
                        The problem Eric described in #1 occurs not so very seldom when working with older and/or international data.


                        Comment


                        • #13
                          Klaudia: Thanks for the confirmation. As said, references to char(128) up don't work in 14.

                          http://www.stata.com/help.cgi?char()

                          is accessible to all members explaining the situation in Stata 14 (up).

                          Comment


                          • #14
                            Hi Nick, my institution has not yet updated to Stata 14.

                            The link in your post #13 goes to a helpfile on the issue
                            [P] char -- Characteristics
                            from the somewhat cryptic description and examples I don't have the impression this is about string functions. Whereas help char() in Stata 13.1 goes to
                            String function char(n) Domain: integers 0 to 255 Range: ASCII characters Description: returns the character corresponding to ASCII code n. returns "" if n is not in the domain.
                            Do you mean to say in Stata 14 there are no such string functions any more?

                            Comment


                            • #15
                              No; I don't mean that at all. I think you are being bitten by what looks like a small bug in the forum software. The link is to

                              http://www.stata.com/help.cgi?char()

                              and both the parentheses at the end () must be included in your browser call.

                              If you click on that here you may need to add the final ) yourself inside your browser.

                              Comment

                              Working...
                              X