Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying common sub-strings in long multi-word strings (Stata Word Cloud)

    Hi all,

    I have a set of 300,000 company names that I would like to standardize into the most common names. I know some of the most common company names, and have been able to code the new standardized name variable with thier names (using regexm), but am not sure what other names are common in the set of 300,000.

    Is there a command (user written or not) that examines strings for common sub-strings, similar to what a Word Cloud would do?

    Thanks, in advance,

    Ben Hoen
    LBNL

  • #2
    -tabsplit- from -tab_chi- (SSC) may help here. Otherwise, -split- and -reshape- may be good starts for examining substring frequency. Note that the number of distinct values may exceed table limits.

    Comment


    • #3
      Thanks Nick. This did indeed work!

      Comment

      Working...
      X