Splitting string into column vector

Johan Karlsson

Join Date: Jan 2020

Posts: 25
#1

Splitting string into column vector

09 Jan 2020, 09:27

Dear Statalist,

I am currently working with a dataset consisting of research articles, where a given article text is saved in one cell. I would like to deconstruct this string variable into a column vector, consisting of all individual words within that string. I have worked out a solution to it, but it feels inefficient as it involves a large number of loops. The process that I am currently working with is described below.

For this description, I will be working with three different types of variables:
Stringvar - The variable containing the text. For the sake of this example, it will contain the text "A small cat"

Word_var - A column vector in which to compile all individual words of "Stringvar"

Var`i' - A set of placeholders that contain one word from "Stringvar"

Starting out, each dataset contains only "Stringvar" and 1 row.

The way that I have been going forward with this up until now is described in code below:

----------------------------------------------------------------------------------------------------------------------------------

clear all
set obs 1
gen Stringvar="A small cat"

// First I need to generate the same number of rows as number of words, which is done using
// gen(wordcount) function and macros

gen wordcount=wordcount(Stringvar)
egen s=max(wordcount)
replace wordcount=s
drop s
local wordcount=wordcount
global wordcount=wordcount
set obs $wordcount

// Next, I create the placeholder, "Word_var"

gen Word_var=.
tostring Word_var, replace

// Next, I loop over all words of Stringvar and create Var`i' that contains each word:
// In this example, this means that I will create variables Var1, Var2 and Var3.

forvalues i=1/`wordcount' {
gen var`i'=word(Stringvar,`i')

// Next, I extend them so that each row of Var`i' contains the same word:

replace Var`i'=Var`i'[_n-1] if Var`i'==""

// Lastly, I compile them into "Word_var":

replace Word_var=Var`i' if `i'==_n
}
// (In reality, I also delete Var1, Var2 and Var3 after each loop to reduce number of variables)

----------------------------------------------------------------------------------------------------------------------------------
I have then, in the end, created three variables (one for each word) using loops and compiled their values into "Word_var".

Stringvar Word_var Var1 Var2 Var3

A small cat A A small cat

small A small cat

cat A small cat

The problem with this is that a normal document with 10,000 words means that I must do 10,000 loops to go through the text.

My question is now:

Is there a smarter way of doing this?

Sincerely
Johan Karlsson
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35652

09 Jan 2020, 09:44

Your approach is mine, but the code can be simplified.

Code:

clear all
set obs 1
gen stringvar = "A small cat"
gen wordcount = wordcount(stringvar)
split stringvar
gen long id = _n
expand wordcount
bysort id: gen word = word(stringvar, _n)

list

     +----------------------------------------------------------------------+
     |   stringvar   wordco~t   string~1   string~2   string~3   id    word |
     |----------------------------------------------------------------------|
  1. | A small cat          3          A      small        cat    1       A |
  2. | A small cat          3          A      small        cat    1   small |
  3. | A small cat          3          A      small        cat    1     cat |
     +----------------------------------------------------------------------+

Although you don't need it summarize foo, meanonly leaves the mean behind in r(mean): using egen to hold a single mean is overkill.

Although you don't need it either generate bar = "" initializes a string variable as missing,

Most crucially, there are no loops here except insofar as by: controls a loop over groups of observations.

The point that is most surprising to people whose programming has been largely in mainstream languages is that

Code:

gen wordcount = wordcount(stringvar)
split stringvar
gen long id = _n
expand wordcount
bysort id: gen word = word(stringvar, _n)

applies equally to 10000 or 10 million observations, although speed will naturally vary,

Last edited by Nick Cox; 09 Jan 2020, 09:47.

Comment

Johan Karlsson

Join Date: Jan 2020
Posts: 25

09 Jan 2020, 09:47

Originally posted by Nick Cox View Post

Your approach is mine, but the code can be simplified.

Code:

clear all
set obs 1
gen stringvar = "A small cat"
gen wordcount = wordcount(stringvar)
split stringvar
gen long id = _n
expand wordcount
bysort id: gen word = word(stringvar, _n)

list

+----------------------------------------------------------------------+
| stringvar wordco~t string~1 string~2 string~3 id word |
|----------------------------------------------------------------------|
1. | A small cat 3 A small cat 1 A |
2. | A small cat 3 A small cat 1 small |
3. | A small cat 3 A small cat 1 cat |
+----------------------------------------------------------------------+

Thank you so much Nick, this is perfect!

Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2413

09 Jan 2020, 11:40

Here's an approach that uses a bit of Mata. I presumed you had many observations, with each one having a variable with an article's text.

Code:

// Example data
clear
input articleid strL text
111 "This is the first article"
220 "A second article appears here, with a longer set of words"
99 "This last article has trivial content."
end
// 
//  Do it.
putmata s = text    
forval i = 1/`=_N' {
   mata: ss = (tokens(s[`i',1]))'
   local id  = articleid[`i']  // article ids become part of varname
   getmata s`id' = ss, force
}
drop articleid text // no longer relevant or correct

Announcement