Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting Number of Occurences of specific words within one String Variable Stata 14.0

    Hello togehter,


    I am using Stata 14.0, and I would really like to find the count of occurrences of a list of about 25 words in my dataset. Specifically one String variable.
    I am quite unfamiliar with using Stata, so excuse if this is a really dumb question.

    I tried using the egenmore package and tried using those commands:
    noccur(strvar) , string(substr) creates a variable containing the number > of occurrences of the string substr in string variable strvar. Note > that occurrences must be disjoint (non-overlapping): thus there are two > occurrences of "aa" within "aaaaa". (Stata 7 required.) > > nss(strvar) , find(substr) [ insensitive ] returns the number of > occurrences of substr within the string variable strvar. insensitive > makes counting case-insensitive. (Stata 6 required.) However once I type in everything and would like to run it, it does not work and gives me the notice "command noccur is unrecognized" even though I installed the package. I believe this is due to my Stata version.
    Is there any type of package or other trick that allows me to also include this type of command?

    Just to picture, I have a video transcript with lots of motives in text format. Now I would like to find out how often for example the word "logo" is being used in my variable "video".

    I am very happy for any help.

  • #2
    We would need to see the precise content of the command you typed in order to understand why you are having difficulties. I suspect the problem is not with your installation of -egenmore-, but with the syntax by which you attempted to use -noccur-. Your command should look something like this:
    Code:
    egen countlogo = noccur(video), string("logo")
    Before posting again, I'd encourage you to read the FAQ that new participants on StataList are asked to read. It offers good advice on how to post a question so as to have a better chance of getting an answer.

    Comment


    • #3
      An alternative is explained at https://www.stata-journal.com/articl...article=dm0056

      The risk here is of false positives. For exact occurrences of word frog search for " frog " within " " + variable + " " after zapping other punctuation.

      There is more on the second problem in a piece that will appear in Stata Journal 22(4) in about 3 or 4 weeks' time.

      Comment


      • #4
        Thank you very much for your help!
        Using the Code you recommended worked. I just read through the FAQ, thank you for letting me know

        I did however realize, that I get different results when plugging in "Logo" and "logo" due to the sensitivity of Stata. Is there an easy way to integrate all types of words irrelevant of it being capitalized or being part of a longer word?

        e.g.

        Code:
        egen count_environment = noccur(video), string("environmental") egen count_environment = noccur(video), string("environmentally")
        egen count_environment = noccur(video), string("environment")


        Thank you very much for helping me out!

        Comment


        • #5
          I think the help makes it quite clear that if something is part of a longer word, it will still be counted by the noccur function. If you want capitalisation to be irrelevant, an easy way is to first change your string variable to lower case using the strlower() function.

          See
          Code:
          help f_strupper

          Comment

          Working...
          X