Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using regexr() to look for a combination of keywords

    Hi statalist,

    I have a question that I am struggling to solve despite looking for help online and experimenting with regexr().

    To illustrate my problem, suppose I have variable1 (all in lowercase), which contains a free text response. I want to create a binary variable (variable2) that equals one whenever a combination of keywords is mentioned in variable1.

    Suppose I want to make the condition strict so that variable2 is only equal to one when the keyword person is mentioned along with keywords female OR male. This is my current code:

    gen variable2 = regexm(variable1, "person* & (female | male)*")

    I know this is wrong, but I am struggling to figure out the right way to specify what I want.

    I would additionally be grateful if you could help me specify the above expression so that it picks up female, male, and person within words like persons, females, males.

    Thanks in advance.

    Lili.

  • #2
    Hello,
    Based on your explanation of what you want to accomplish, it seems you want to use regexm() not regexr(). regexm(variable, "string") returns a 1 ("true") for the instances where "string" is contained in variable and 0 ("false") for the instances where "string" is not contained in variable. If I understand correctly, this is what you are going for.

    Code:
    gen variable2 = 0
    replace variable2 = 1 if regexm(variable1, "person") & regexm(variable1, "male")
    If you want to write this in one line
    Code:
    gen variable2 = (regexm(variable1, "person") & regexm(variable1, "male"))
    Here is detailed link on regular expressions.
    http://www.stata.com/support/faqs/da...r-expressions/

    ​​​​​​​Remember that the help command is your friend!

    Best,

    Leo

    Comment


    • #3
      Leonel gives very good advice and regular expressions are great.

      I just want to flag an alternative often overlooked by regex rappers.

      Code:
      gen OK = strpos(myvar, "person") & (strpos(myvar, "male") | strpos(myvar, "female"))
      strpos() finds the position in the first string specified, which could be a string variable as here, of the second string, which could be a literal string as here. (Other combinations are allowed, but not what we want here.)

      If the second string is found at all, the result will be a positive number. In this case we don't care exactly which positive number or that the positive number may vary from observation to observation. If it is not found, the result will be zero.

      Positive or zero corresponds exactly to Stata's rules of evaluating logical expressions: non-zero input counts as true and zero input counts as false.

      So, the code above is literal code for the pseudocode

      variable contains "person"

      AND

      (variable contains "male" OR variable contains "female").
      and the results in the new variable OK will be 1 or 0.

      More at http://www.stata.com/support/faqs/da...rue-and-false/

      Comment


      • #4
        In #4 "exactly" is an over-statement because negative also means non-zero. In this problem strpos() doesn't produce negative values, but in other problems negative values do arise and would be treated as true in logical comparisons.

        Comment


        • #5
          Hi Leo and Nick,

          Thank you so much for your thorough help and explanations--this is incredibly helpful. And sorry for mixing up regexr and regexm in the title of my question.

          Best

          Lili.

          Comment

          Working...
          X