Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • use or in regular expression

    Hello,

    I am trying to identify for the variable "agency" (a string variable) these observations that contain either "state" or "district". Based on what I read, I used :

    gen new_variable = regexs(0) if(regexm(agency, "(state | district)"))

    But I think it only identifies the observations that contain "state". I tried a few variations but did not work. Can you advise what is the right code for this. Thank you very much.

  • #2
    I don't fully get your code but, looking to your first sentence, the problem should be easy solved as follow
    Code:
    gen wanted==.
    replace wanted =1 if agency=="state" | agency=="district"

    Comment


    • #3
      My guess is -strpos- is more appropriate for this:

      Code:
      gen new_variable = strpos(agency, "state") | strpos(agency,"district")

      Comment


      • #4
        Your regular expression is slightly incorrect. Spaces are meaningful characters to match, so you are searching for "state " and for " district" - the latter of which will not be matched if it is at the beginning of the agency with no space character in front of the "d". Change your command to
        Code:
        gen new_variable = regexs(0) if(regexm(agency, "(state|district)")
        and I expect it will do what you want, or at least will do what you want somewhat more often than it does now.

        Comment


        • #5
          Originally posted by Marco Errico View Post
          I don't fully get your code but, looking to your first sentence, the problem should be easy solved as follow
          Code:
          gen wanted==.
          replace wanted =1 if agency=="state" | agency=="district"
          Thanks, Marco, I am trying to select based on if agency contains certain words, instead of matched to certain words. otherwise i would do as you suggested
          Last edited by Sheran Deng; 12 May 2021, 13:41.

          Comment


          • #6
            Originally posted by Ali Atia View Post
            My guess is -strpos- is more appropriate for this:

            Code:
            gen new_variable = strpos(agency, "state") | strpos(agency,"district")
            Thanks,Ali I will check it out.
            Last edited by Sheran Deng; 12 May 2021, 13:40.

            Comment


            • #7
              Originally posted by William Lisowski View Post
              Your regular expression is slightly incorrect. Spaces are meaningful characters to match, so you are searching for "state " and for " district" - the latter of which will not be matched if it is at the beginning of the agency with no space character in front of the "d". Change your command to
              Code:
              gen new_variable = regexs(0) if(regexm(agency, "(state|district)")
              and I expect it will do what you want, or at least will do what you want somewhat more often than it does now.
              Thanks William, I tried to use what you suggested but it does not identify any observations. I attached the agency column here. I had imagined that there would be a straightforward way to do OR in regular expression......
              Attached Files

              Comment


              • #8
                Of course it didn't work.

                The data in your spreadsheet is capitalized, for example, "U.S. Attorney-Eastern District of Pennsylvania". "district" is never going to match "District". In post #1 you told us the regular expression
                Code:
                "(state | district)"
                matched observations that contained "state" and my code would match observations that contain "state" or "district". But not "State" or "District".

                The technique from post #3 similarly will not match "State" or "District".

                You are fortunate that I was able to preview your data without opening it in Excel. Like many members here, I decline to open datasets that can contain malicious code. The Statalist FAQ provides advice on effectively posing your questions, posting data, and sharing Stata output. Please take a few moments to review the FAQ to improve your future posts.

                Comment


                • #9
                  Originally posted by William Lisowski View Post
                  Of course it didn't work.

                  The data in your spreadsheet is capitalized, for example, "U.S. Attorney-Eastern District of Pennsylvania". "district" is never going to match "District". In post #1 you told us the regular expression
                  Code:
                  "(state | district)"
                  matched observations that contained "state" and my code would match observations that contain "state" or "district". But not "State" or "District".

                  The technique from post #3 similarly will not match "State" or "District".

                  You are fortunate that I was able to preview your data without opening it in Excel. Like many members here, I decline to open datasets that can contain malicious code. The Statalist FAQ provides advice on effectively posing your questions, posting data, and sharing Stata output. Please take a few moments to review the FAQ to improve your future posts.
                  Cool, Thanks!

                  Comment

                  Working...
                  X