Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to identify observations with special characters, numbers and whitespaces (expect leading and trailing spaces) - string var

    Dear all,

    I am working with string data and I would like to identify observations that have special characters ("/" "?" "*" "(" "." ")" and others), numbers and whitespace (except leading and trailing spaces) in a string variable.

    The dataset is like
    string_var
    E.V.A
    F7010P
    175/70R14
    C B U Q
    *E.D.T.A
    (S6X60D)
    COLA

    In particular, I would like the dataset to be as the following
    string_var has_special_char
    E.V.A 1
    F7010P 1
    175/70R14 1
    C B U Q 1
    *E.D.T.A 1
    (S6X60D) 1
    COLA 0
    where 'has_special_char' is a dummy for whether the observation has special characters, numbers or whitespace (except leading and trailing spaces) in the variable 'string_var'.

    Could you help me to find a solution for that?

    Thank you very much!

    Below I provide the code for importing the example dataset into Stata:

    clear
    input str10 string_var
    "E.V.A"
    "F7010P"
    "175/70R14"
    "C B U Q"
    "*E.D.T.A"
    "(S6X60D)"
    "COLA"
    end

  • #2
    Code:
    gen byte wanted = regexm(string_var, "[^a-zA-Z0-9]")
    In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment


    • #3
      Clyde's code overlooked your desire to accept leading and trailing spaces. My approach would be to trim leading and trailing spaces from your variable before doing anything else: I can't see a good reason to keep them, and they can cause no end of confusion.
      Code:
      replace string_var = trim(string_var)
      But if you're unwilling to go that far,
      Code:
      gen byte wanted = regexm(trim(string_var), "[^a-zA-Z0-9]")
      will enhance Clyde's code to ignore leading and trailing spaces.

      Comment


      • #4
        Thank you Clyde Schechter and William Lisowski !

        Yes, Clyde, dataex is much better! I will use it in the future!

        Just one remaining question: I ran the code that you suggested, i.e.,


        replace string_var = trim(string_var)
        gen byte wanted = regexm(trim(string_var), "[^a-zA-Z0-9]")


        and I noted that the value of the dummy variable for the second observation (which is such that string_var == "F7010P") is zero and not one as it was meant to be.

        Can you help me to figure out what is the point of that?

        Thank you!

        Comment


        • #5
          Oh, sorry, I misread your original request. I didn't think numbers were also to be marked as 1 along with special characters. So just remove the 0-9 from inside the brackets in the code:

          Code:
          replace string_var = trim(string_var)
          gen byte wanted = regexm(trim(string_var), "[^a-zA-Z]")
          That will flag any observation that contains anything other than letters of the alphabet.

          Comment


          • #6
            Perfect! Now it works smoothly! Thank you very much, Clyde Schechter !

            Comment


            • #7
              Code:
              assert regexm("Conceição", "[^a-zA-Z]") != ustrregexm("Conceição", "\P{L}")
              https://www.regular-expressions.info/unicode.html

              Comment

              Working...
              X