counting substrings in string

Franziska Hittmair

Join Date: Feb 2021
Posts: 12

counting substrings in string

29 Jan 2025, 13:10

Hello,
I am doing some text analysis and I have larger chunks of texts as strings in my data.
Now I would like to count how often a specific word occurs in the string.
I have managed to identify IF a specific word occurs by using regexm, but not how often it occurs.
For instance, the code below just tells me a sum of each individual keyword, but I am also interested in cases were, for instance, the word "fraud" appears several times in body.
thank you in advance for your help!

Code:

gen negative_count = 0  

local keywords "fraud scam misconduct corruption manipulation deception falsification misrepresentation overstatement greenwashing illegal trading non-compliance double counting price manipulation offset fraud unverified credits low-quality offsets worthless credits overestimated reductions questionable projects lack of additionality poor verification lack of transparency flawed methodology unverified claims inflated impact carbon leakage temporary storage loopholes fake reductions non-permanent offsets market failure lack of regulation lack of oversight inconsistent standards conflict of interest weak governance speculation unfair distribution profit-driven market opaque transactions middlemen issues poor enforcement exploitation of communities lack of trust industry capture"

foreach word in `keywords' {
    replace negative_count = negative_count + regexm(lower(body), "`word'")
}

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30124

29 Jan 2025, 15:06

Code:

local keywords "fraud scam misconduct corruption manipulation deception falsification misrepresentation overstatement greenwashing illegal trading non-compliance double counting price manipulation offset fraud unverified credits low-quality offsets worthless credits overestimated reductions questionable projects lack of additionality poor verification lack of transparency flawed methodology unverified claims inflated impact carbon leakage temporary storage loopholes fake reductions non-permanent offsets market failure lack of regulation lack of oversight inconsistent standards conflict of interest weak governance speculation unfair distribution profit-driven market opaque transactions middlemen issues poor enforcement exploitation of communities lack of trust industry capture"

gen body2 = lower(body)
gen starting_count = wordcount(body2)
foreach word in `keywords' {
    replace body2 = subinstr(body2, "`word'", "", .)
}
gen negative_count = starting_count - wordcount(body2)

Added: No sample data was provided, so the code is untested. I believe it is correct, but...

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35734

29 Jan 2025, 17:30

Code:

. search substring, sj

Search of official help files, FAQs, Examples, and Stata Journals

SJ-11-2 dm0056  . . . . . . . Stata tip 98: Counting substrings within strings
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N. J. Cox
        Q2/11   SJ 11(2):318--320                                (no commands)
        tip on counting substrings within strings

Announcement

counting substrings in string

Comment

Comment