Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • counting substrings in string

    Hello,
    I am doing some text analysis and I have larger chunks of texts as strings in my data.
    Now I would like to count how often a specific word occurs in the string.
    I have managed to identify IF a specific word occurs by using regexm, but not how often it occurs.
    For instance, the code below just tells me a sum of each individual keyword, but I am also interested in cases were, for instance, the word "fraud" appears several times in body.
    thank you in advance for your help!

    Code:
    gen negative_count = 0  
    
    local keywords "fraud scam misconduct corruption manipulation deception falsification misrepresentation overstatement greenwashing illegal trading non-compliance double counting price manipulation offset fraud unverified credits low-quality offsets worthless credits overestimated reductions questionable projects lack of additionality poor verification lack of transparency flawed methodology unverified claims inflated impact carbon leakage temporary storage loopholes fake reductions non-permanent offsets market failure lack of regulation lack of oversight inconsistent standards conflict of interest weak governance speculation unfair distribution profit-driven market opaque transactions middlemen issues poor enforcement exploitation of communities lack of trust industry capture"
    
    foreach word in `keywords' {
        replace negative_count = negative_count + regexm(lower(body), "`word'")
    }

  • #2
    Code:
    local keywords "fraud scam misconduct corruption manipulation deception falsification misrepresentation overstatement greenwashing illegal trading non-compliance double counting price manipulation offset fraud unverified credits low-quality offsets worthless credits overestimated reductions questionable projects lack of additionality poor verification lack of transparency flawed methodology unverified claims inflated impact carbon leakage temporary storage loopholes fake reductions non-permanent offsets market failure lack of regulation lack of oversight inconsistent standards conflict of interest weak governance speculation unfair distribution profit-driven market opaque transactions middlemen issues poor enforcement exploitation of communities lack of trust industry capture"
    
    gen body2 = lower(body)
    gen starting_count = wordcount(body2)
    foreach word in `keywords' {
        replace body2 = subinstr(body2, "`word'", "", .)
    }
    gen negative_count = starting_count - wordcount(body2)
    Added: No sample data was provided, so the code is untested. I believe it is correct, but...

    Comment


    • #3
      :
      Code:
      . search substring, sj
      
      Search of official help files, FAQs, Examples, and Stata Journals
      
      SJ-11-2 dm0056  . . . . . . . Stata tip 98: Counting substrings within strings
              . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N. J. Cox
              Q2/11   SJ 11(2):318--320                                (no commands)
              tip on counting substrings within strings

      Comment

      Working...
      X