Counting Number of Occurences of specific words within one String Variable Stata 14.0

Antonia Borg

Join Date: Dec 2022

Posts: 3
#1

Counting Number of Occurences of specific words within one String Variable Stata 14.0

08 Dec 2022, 08:15

Hello togehter,

I am using Stata 14.0, and I would really like to find the count of occurrences of a list of about 25 words in my dataset. Specifically one String variable.
I am quite unfamiliar with using Stata, so excuse if this is a really dumb question.

I tried using the egenmore package and tried using those commands:
noccur(strvar) , string(substr) creates a variable containing the number > of occurrences of the string substr in string variable strvar. Note > that occurrences must be disjoint (non-overlapping): thus there are two > occurrences of "aa" within "aaaaa". (Stata 7 required.) > > nss(strvar) , find(substr) [ insensitive ] returns the number of > occurrences of substr within the string variable strvar. insensitive > makes counting case-insensitive. (Stata 6 required.) However once I type in everything and would like to run it, it does not work and gives me the notice "command noccur is unrecognized" even though I installed the package. I believe this is due to my Stata version.
Is there any type of package or other trick that allows me to also include this type of command?

Just to picture, I have a video transcript with lots of motives in text format. Now I would like to find out how often for example the word "logo" is being used in my variable "video".

I am very happy for any help.
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

08 Dec 2022, 08:31

We would need to see the precise content of the command you typed in order to understand why you are having difficulties. I suspect the problem is not with your installation of -egenmore-, but with the syntax by which you attempted to use -noccur-. Your command should look something like this:

Code:

egen countlogo = noccur(video), string("logo")

Before posting again, I'd encourage you to read the FAQ that new participants on StataList are asked to read. It offers good advice on how to post a question so as to have a better chance of getting an answer.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35694
#3

08 Dec 2022, 08:54

An alternative is explained at https://www.stata-journal.com/articl...article=dm0056

The risk here is of false positives. For exact occurrences of word frog search for " frog " within " " + variable + " " after zapping other punctuation.

There is more on the second problem in a piece that will appear in Stata Journal 22(4) in about 3 or 4 weeks' time.
Comment
Antonia Borg

Join Date: Dec 2022

Posts: 3
#4

08 Dec 2022, 09:24

Thank you very much for your help!
Using the Code you recommended worked. I just read through the FAQ, thank you for letting me know

I did however realize, that I get different results when plugging in "Logo" and "logo" due to the sensitivity of Stata. Is there an easy way to integrate all types of words irrelevant of it being capitalized or being part of a longer word?

e.g.

Code:
egen count_environment = noccur(video), string("environmental") egen count_environment = noccur(video), string("environmentally")
egen count_environment = noccur(video), string("environment")

Thank you very much for helping me out!
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1396
#5

08 Dec 2022, 09:39

I think the help makes it quite clear that if something is part of a longer word, it will still be counted by the noccur function. If you want capitalisation to be irrelevant, an easy way is to first change your string variable to lower case using the strlower() function.

See

Code:

help f_strupper
Comment

Announcement

Counting Number of Occurences of specific words within one String Variable Stata 14.0

Comment

Comment

Comment

Comment