Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • remove strange symbol from split string variable

    I tried to split a string variable, there is a response including ""�"". How do I get rid of this weird symbol? The ultimate goal is to generate new 0-1 variable for each response.

    . tab AttentionSeeker1

    AttentionSeeker1 | Freq. Percent Cum.
    ------------------------------------+-----------------------------------
    Causes class disruptions | 1,443 43.57 43.57
    Other | 75 2.26 45.83
    Talks at inappropriate times | 856 25.85 71.68
    Wants teacher�s undivided attention | 938 28.32 100.00
    ------------------------------------+-----------------------------------
    Total | 3,312 100.00

  • #2
    So the first task is to find out what "�" actually is.

    Code:
    charlist AttentionSeeker1
    return list
    In the output line that begins with r(ascii) you will see a bunch of numbers corresponding to all of the characters (shown as r(sepchars) immediately above them) and you should be able to pick out the numerical code that represents that character. Then you can use subinstr() (or its unicode equivalent if need be) to replace it by a blank, or nothing, or from the looks of it, an apostrophe. My guess is that it's some kind of variant quote character from a word processing program.

    Note: -charlist- is not official Stata; it was written by Nick Cox and can be obtained from SSC. -ssc install charlist-

    Comment


    • #3
      I suppose you are using Stata 14. This is caused by the string contains invalid UTF-8 sequences. To fix it, you can use -ustrfix()- function, which replaces each invalid UTF-8 sequence with a Unicode character. If you want to get rid of it, run the following:

      Code:
      replace AttentionSeeker1 = ustrfix(AttentionSeeker1, "")
      See -help f_ustrfix- for details of the function.

      A side note, the � symbol you see is the symbol for Unicode replace character, which is used in Mac and Linux when an invalid Unicode character is encountered. On Windows, you would see an empty square instead.
      Last edited by Hua Peng (StataCorp); 23 Mar 2016, 12:38.

      Comment


      • #4
        Thanks Hua! Ustrfix works perfect!!

        Thanks Clyde! Leaned something new from you!!

        Comment

        Working...
        X