Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • generate dummy variable using strpos

    Hi everyone!

    I have searched this forum and read the official stata help section, but I'm still not able to do what I want to do so I hope anyone here can help me.
    My dataset consists of several variables of someones occupations in their professional career, measured at three points in time (each point in time equals one variable).
    Now, for each variable there exist multiple occupations (entrepreneur, serial entrepreneur, investor, etc.) that can occur simultaneously.

    An example occupation for one point in time might be "Entrepreneur" while for another observation it is "Investor, Serial Entrepreneur".
    Now, I want to create a dummy variable "SubsequentEntrepreneur" that equals 1 if in any of these three strings contains the word "Entrepreneur" (or "Serial Entrepreneur") and 0 if it does not contain the phrase "Entrepreneur".

    I think I need to use "strpos" but I have difficulties to connect the three variables (points in time). Maybe anyone has a suggestion?
    I would be very grateful for any help!

    Thank you very much,
    Chris

  • #2
    Since you have only three variables to deal with here, and only one search term "Entrepreneur" this can be handled fairly straitforwardly:

    Code:
    gen byte new_variable = strpos(var1, "Entrepreneur") | strpos(var2, "Entrepreneur") | strpos(var3, "Entrepreneur")
    Note: like all Stata string functions I can think of, this is case sensitive. So before using this code, make sure that the use of capitalization in this data set has been completely consistent. Also, spelling errors can trip this up. Text data is treacherous!

    Comment


    • #3
      Thank you very much! Just one question:
      If one of the three variables contains the string „EntrepreneurSerial“, will it also be included and count as “Entrepreneur” if I use this code?

      Comment


      • #4
        Nothing stops you from finding this out for yourself.

        Code:
        . di strpos("EntrepreneurSerial", "Entrepreneur")
        1

        Comment


        • #5
          Nothing stops you from finding this out for yourself.
          Believe me, I try

          I'm afraid, with your code something might be wrong. Stata tells me that "di" (or "display") is not known.
          Maybe anyone knows what might be wrong with the code?

          generate byte SubsequentEntrepreneurGeneral = strpos(Time1, "Entrepreneur1", "Entrepreneur2", "Entrepreneur3") | strpos("Entrepreneur1", "Entrepreneur2", "Entrepreneur3") | strpos(Time3, "Entrepreneur1", "Entrepreneur2", "Entrepreneur3").

          Stata tells me "Invalid Syntax", but I've also tried leaving out the commas or using "|" instead of commas.

          Comment


          • #6
            For specific advice, give a data example. Copy and paste the output of

            Code:
            ssc install dataex
            dataex in 1/5

            Comment


            • #7
              I would not use strpos and instead use something like strmatch or regexm. For example
              Code:
               gen byte new_variable = strmatch(Time1, "*Entrepreneur*") | strmatch(Time2, "*Entrepreneur*") | strpos(Time3, "*Entrepreneur*")

              Comment

              Working...
              X