Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Keep only string variables with character . in the string

    Hey Statalist Community!

    This is my first time posting, so if I commit any faux pas, please let me know.

    Anyway, I'm working with a dataset that includes (among many other things) company names. Many of these names have non-alphanumeric characters (e.g. - / .). Below is code that I have been using to isolate those names that include non-alphanumeric characters:

    BEGIN CODE
    gen slash = regexm(cname, "/")
    keep if slash == 1
    drop slash

    duplicates drop
    export excel using location/excelfile.xlsx, sheet("Slash") sheetmodify cell(B2)
    END CODE

    Where "cname" refers to the company name, and "location/excelfile.xlsx" is some arbitrary excel file that I've been exporting to.

    This code has worked well for all of the characters with the exception of the period. Whenever I use gen period = regexm(cname, "."), every entry is tagged, not just those that have a period in the name. I presume this occurs because of Stata's default interpretation of ".", but I'm not sure what to do next.

    Any suggestions would be welcome.

    Thank you,

    Andrew

  • #2
    Advice depends on your ultimate goal here, which is not clear to me.

    I would go with simple string functions (help string functions) instead of regexm here. Probably something like

    Code:
    keep if strpos(cname, "/")
    However, depending on what you want to do, there might be better ways.

    Best
    Daniel

    Comment


    • #3
      I agree with Daniel: if you can use a simple string function to accomplish your ultimate objective, it's going to be a lot clearer..

      For those who, like me, follow along on these forums to learn from others, let me add the answer to Andrew's specific question of why searching for the period returns every entry. Regular expressions support complex pattern matching, and that means some characters have a specific meaning in constructing a pattern to match. A single period matches any single character, so as long as the entry is at least one character long, it will match.

      I did a -search regular expression- and one of the results it returned was this Stata FAQ on regular expressions. I wasn't able to find anything in the help files or documentation PDFs.





      Comment


      • #4
        You can use the backslash character if you insist on regexm. Code

        Code:
        gen period = regexm(varname, "\.")
        Best
        Daniel

        Comment


        • #5
          Thank you Daniel and William!

          Daniel: Both codes that you provided accomplish exactly what I wanted. After looking into strpos() I must agree that it is a much simpler, more direct mechanism to accomplish my goals for this project.

          William: Thank you for the explanation as to why my previous code matched all entries. I knew that Stata used "." to represent missing values, but I didn't realize that it's also the ... universal character (for lack of a better descriptor). I've bookmarked that link to reference for future complex matching patterns.

          Thank you again,

          Andrew

          Comment


          • #6
            Above I wrote
            I did a -search regular expression- and one of the results it returned was this Stata FAQ on regular expressions. I wasn't able to find anything in the help files or documentation PDFs.
            I later learned that this FAQ is out of date and does not reflect the subsequently enhanced capabilities of Stata regular expressions. See here for more on documentation of regular expressions as implemented in Stata.
            Last edited by William Lisowski; 07 Mar 2015, 12:02.

            Comment

            Working...
            X