Identifying common sub-strings in long multi-word strings (Stata Word Cloud)

Ben Hoen

Join Date: May 2014

Posts: 85
#1

Identifying common sub-strings in long multi-word strings (Stata Word Cloud)

31 Jul 2014, 14:35

Hi all,

I have a set of 300,000 company names that I would like to standardize into the most common names. I know some of the most common company names, and have been able to code the new standardized name variable with thier names (using regexm), but am not sure what other names are common in the set of 300,000.

Is there a command (user written or not) that examines strings for common sub-strings, similar to what a Word Cloud would do?

Thanks, in advance,

Ben Hoen
LBNL
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35672
#2

31 Jul 2014, 20:55

-tabsplit- from -tab_chi- (SSC) may help here. Otherwise, -split- and -reshape- may be good starts for examining substring frequency. Note that the number of distinct values may exceed table limits.
Comment
Ben Hoen

Join Date: May 2014

Posts: 85
#3

01 Aug 2014, 12:30

Thanks Nick. This did indeed work!
Comment

Announcement