Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting words from a string variable using a local list

    Hello,

    I want generate a new varibale that display the words found in the string variable that match my local list of words.

    The context hereby is that I want to create a profanity filter. Nevertheless not all words are considered to be profane in every context.
    Therefore I want to see which words my profanity filter is classifying as profane.

    The Approach sofar:

    gen profanitydummy = 0
    gen profanitycount = 0

    local badwords "badword1 badword2 badword3 badword4"

    foreach b in `badwords' {
    replace profanitydummy = 1 if strpos(varstring, " `b' ") != 0
    replace profanitycount = profanitycount + 1 if strpos(varstring, " `b' ") != 0
    }

    This results in a dummy if a word in the varstring matches a word in the local badwords.
    In addition it counts the number of unique badwords used in the string.

    The local badwords list is approx. 1100 words, I used from a reseacher gathering "offensive" words.


    I now want to know, for which words the profanity dummy is indicating that there is a bad word in the varstring.

    My approach:


    gen badwordinstring = ""
    foreach b in `badwords'{
    replace badwordinstring = " `b' " if strpos(varstring), " `b' ")
    }


    Nevertheless, get the error message "invalid Syntax" and cant figure out where the problem is.

    My desired goal would be: badwordsinstring: "badword5 badword7"


    In addition as of right now my profanitycounter only counts the unique badwords used in a the string.
    Do you guys have a hint how to change it to the absolute number of badwords in the string.

    For example if badword1 is used 2 times and badword2 is used 5 times the varibale should indicate 7, as of right now I am only able to get the unique amount of badwords.


    Thank you in advance.

  • #2
    My guess is that you are invoking a local macro where it is invisible. See e.g. the thread https://www.statalist.org/forums/for...creating-local

    Comment


    • #3
      Hello NIck,

      thanks for the quick response.

      I dont think local macro is the problem, as my code to create the profanity dummy runs without a problem. Only if I try to run the code to see which word was used I encounter the problem described.

      Any other ideas?

      Is it the right approach or is there an easier way?

      Comment


      • #4
        My other idea is that your rude words may in some cases be quoted phrases in which you may need to use compound double quotes.

        Comment


        • #5
          These changes worked, thank you Nick.


          foreach b in `badwords'{
          replace badwordinstring = " `b' " if strpos(varstring), " `b' ")
          }

          As of right now I only get 1 bad word in my badwordinstring. Do you know a solution how I can get all the words that are in the varstring and also in the liststring into my var badwordinstring?

          Comment


          • #6
            If you want to accumulate values, then don't overwrite them.

            Code:
            replace badwordinstring =  badwordinstring + " `b' " if strpos(varstring), " `b' ")

            Comment


            • #7
              Cross-posted at https://www.reddit.com/r/stata/comme...iable_using_a/

              Please note our policy on cross-posting, which is that you are asked to tell us about it.

              The Reddit folks can shift for themselves, but telling them might be considered courteous.

              Comment

              Working...
              X