Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Splitting string into column vector

    Dear Statalist,

    I am currently working with a dataset consisting of research articles, where a given article text is saved in one cell. I would like to deconstruct this string variable into a column vector, consisting of all individual words within that string. I have worked out a solution to it, but it feels inefficient as it involves a large number of loops. The process that I am currently working with is described below.

    For this description, I will be working with three different types of variables:
    • Stringvar - The variable containing the text. For the sake of this example, it will contain the text "A small cat"
    • Word_var - A column vector in which to compile all individual words of "Stringvar"
    • Var`i' - A set of placeholders that contain one word from "Stringvar"
    Starting out, each dataset contains only "Stringvar" and 1 row.

    The way that I have been going forward with this up until now is described in code below:

    ----------------------------------------------------------------------------------------------------------------------------------

    clear all
    set obs 1
    gen Stringvar="A small cat"

    // First I need to generate the same number of rows as number of words, which is done using
    // gen(wordcount) function and macros

    gen wordcount=wordcount(Stringvar)
    egen s=max(wordcount)
    replace wordcount=s
    drop s
    local wordcount=wordcount
    global wordcount=wordcount
    set obs $wordcount

    // Next, I create the placeholder, "Word_var"

    gen Word_var=.
    tostring Word_var, replace

    // Next, I loop over all words of Stringvar and create Var`i' that contains each word:
    // In this example, this means that I will create variables Var1, Var2 and Var3.

    forvalues i=1/`wordcount' {
    gen var`i'=word(Stringvar,`i')

    // Next, I extend them so that each row of Var`i' contains the same word:

    replace Var`i'=Var`i'[_n-1] if Var`i'==""

    // Lastly, I compile them into "Word_var":

    replace Word_var=Var`i' if `i'==_n
    }
    // (In reality, I also delete Var1, Var2 and Var3 after each loop to reduce number of variables)

    ----------------------------------------------------------------------------------------------------------------------------------
    • I have then, in the end, created three variables (one for each word) using loops and compiled their values into "Word_var".
    Stringvar Word_var Var1 Var2 Var3
    A small cat A A small cat
    small A small cat
    cat A small cat





    The problem with this is that a normal document with 10,000 words means that I must do 10,000 loops to go through the text.

    My question is now:

    Is there a smarter way of doing this?

    Sincerely
    Johan Karlsson

  • #2
    Your approach is mine, but the code can be simplified.

    Code:
    clear all
    set obs 1
    gen stringvar = "A small cat"
    gen wordcount = wordcount(stringvar)
    split stringvar
    gen long id = _n
    expand wordcount
    bysort id: gen word = word(stringvar, _n)
    
    list
    
         +----------------------------------------------------------------------+
         |   stringvar   wordco~t   string~1   string~2   string~3   id    word |
         |----------------------------------------------------------------------|
      1. | A small cat          3          A      small        cat    1       A |
      2. | A small cat          3          A      small        cat    1   small |
      3. | A small cat          3          A      small        cat    1     cat |
         +----------------------------------------------------------------------+
    Although you don't need it summarize foo, meanonly leaves the mean behind in r(mean): using egen to hold a single mean is overkill.

    Although you don't need it either generate bar = "" initializes a string variable as missing,

    Most crucially, there are no loops here except insofar as by: controls a loop over groups of observations.

    The point that is most surprising to people whose programming has been largely in mainstream languages is that

    Code:
    gen wordcount = wordcount(stringvar)
    split stringvar
    gen long id = _n
    expand wordcount
    bysort id: gen word = word(stringvar, _n)
    applies equally to 10000 or 10 million observations, although speed will naturally vary,
    Last edited by Nick Cox; 09 Jan 2020, 09:47.

    Comment


    • #3
      Originally posted by Nick Cox View Post
      Your approach is mine, but the code can be simplified.

      Code:
      clear all
      set obs 1
      gen stringvar = "A small cat"
      gen wordcount = wordcount(stringvar)
      split stringvar
      gen long id = _n
      expand wordcount
      bysort id: gen word = word(stringvar, _n)
      
      list
      
      +----------------------------------------------------------------------+
      | stringvar wordco~t string~1 string~2 string~3 id word |
      |----------------------------------------------------------------------|
      1. | A small cat 3 A small cat 1 A |
      2. | A small cat 3 A small cat 1 small |
      3. | A small cat 3 A small cat 1 cat |
      +----------------------------------------------------------------------+
      Although you don't need it summarize foo, meanonly leaves the mean behind in r(mean): using egen to hold a single mean is overkill.

      Although you don't need it either generate bar = "" initializes a string variable as missing,

      Most crucially, there are no loops here except insofar as by: controls a loop over groups of observations,
      Thank you so much Nick, this is perfect!

      Comment


      • #4
        Here's an approach that uses a bit of Mata. I presumed you had many observations, with each one having a variable with an article's text.

        Code:
        // Example data
        clear
        input articleid strL text
        111 "This is the first article"
        220 "A second article appears here, with a longer set of words"
        99 "This last article has trivial content."
        end
        // 
        //  Do it.
        putmata s = text    
        forval i = 1/`=_N' {
           mata: ss = (tokens(s[`i',1]))'
           local id  = articleid[`i']  // article ids become part of varname
           getmata s`id' = ss, force
        }
        drop articleid text // no longer relevant or correct

        Comment

        Working...
        X