Most common word from string variable

William Blackmon

Join Date: Oct 2016

Posts: 40
#1

Most common word from string variable

16 Jan 2017, 13:37

Hi all -- probably an easy question, but I can't seem to figure it out. I have one string variable composed of a number of names and would like to generate a new variable containing the most common name from the other variable.

So if var1 = "John John James John William Sam Sarah John"

I would like var2 to contain "John"

Any way to take the word mode from a single variable?
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35700

16 Jan 2017, 14:09

Some technique:

Code:

clear
set obs 2

gen var1 = "John John James John William Sam Sarah John" in 1
replace var1 = "Michelle Michelle Martin" in 2

gen long id = _n
split var1
drop var1
reshape long var1, i(id) j(which)
drop if missing(var1)
bysort id var1 : gen freq = _N
bysort id (freq) : gen mode = var1[_N]
drop freq
reshape wide var1, i(id) j(which)

list

Comment

William Blackmon

Join Date: Oct 2016

Posts: 40
#3

16 Jan 2017, 14:56

Beautiful, thanks!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30103
#4

16 Jan 2017, 15:06

Just one caution before you implement Nick's advice. In situations like this it commonly happens that in some observations there may be more than one word tied for most common. E.g. if you had an observation with var1 = "John John James John James James William Sam Sarah", John and James would both have "the" most occurrences. Nick's code will select one of those at random--and it may not be reproducible when you rerun the code. If that is fine with you, then proceed.

If not, you may need to come up with a rule for breaking this kind of tie (e.g. the alphabetically first, or last, or some other scheme.) In that case, the code needs to be modified accordingly. You can post back for help if that's needed and you don't see how to do it yourself.
Comment
William Blackmon

Join Date: Oct 2016

Posts: 40
#5

16 Jan 2017, 15:10

Ah thanks, good point. I think in my situation there should always be a clear winner (the original variable has first and last names for a family, so the last name consistently emerges with Nick's code)
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#6

16 Jan 2017, 15:22

I have never used seriously but I think txttool may help you with this.
Comment

Julio Raffo

Join Date: May 2014
Posts: 132

17 Jan 2017, 00:48

A simple example using txttool follows. But I guess that if you have too many different names in your dataset this might be a problem.

Code:

clear
set obs 2
gen var1 = "John John James John William Sam Sarah John" in 1
replace var1 = "Michelle Michelle Martin Martin William" in 2
txttool var1 , replace bagwords
egen rowmax=rowmax(w_*)
gen var2=""
gen var3=""
foreach x of var w_* {
 replace var2=substr("`x'",3,.) if `x'==rowmax // var2 similar to Nick's code (although keeping only the last most frequent name found)
 replace var3=var3+" "+substr("`x'",3,.) if `x'==rowmax // var3 keeps all tied most frequent names found.
}
list var2 var3 w*, table

Last edited by Julio Raffo; 17 Jan 2017, 00:56. Reason: mispelling

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35700
#8

17 Jan 2017, 02:14

Julio: Please give the provenance of user-written programs you cite.

SJ-14-4 dm0077 . . . . . . . . txttool: Utilities for text analysis in Stata
(help txttool if installed) . . . . . . U. Williams and S. P. Williams
Q4/14 SJ 14(4):817--829
provides tools for managing free-form text
Comment

Announcement

Most common word from string variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment