Fix no room to add more variables because of width issue when splitting a long string

Tasneem Mohammed

Join Date: Jul 2021

Posts: 9
#1

Fix no room to add more variables because of width issue when splitting a long string

16 Aug 2022, 19:35

Hi all,

I have a single string that is extremely long. It has a length of about 600,000 characters and 90,000 words separated by single spaces.

I want to get each word of this single string as one observation each. So, I would like to have 90,000 observations with each observation corresponding to each word of the initial long string.

What would be the most efficient way to achieve this?

I tried using the split command with a variety of separators in the parse option. The idea is to split the string by spaces or some other separator and then reshape it from wide to long. Two examples I tried include:

Code:

clear set maxvar 32767 split text, parse(" and") split text, parse(" ")

Naturally, no matter what separator I use to split the string, Stata returns a "no room to add more variables because of width" error. I understand that this is happening because my string is so long that Stata is reaching the maximum number of variables allowable per observation.

Is there a workaround to this issue to get to my final objective of converting the single long string with 90,000 words into a dataset with 90,000 observations/words?

FYI, I am attaching an example string that I tried splitting. I did not include the string in the code above due to its immense size.

Regards,
Tasneem
Attached Files

long_string.dta (581.4 KB, 1 view)
Tags: None
Tasneem Mohammed

Join Date: Jul 2021

Posts: 9
#2

16 Aug 2022, 20:03

Just an update, I figured out a code that gives me what I want, i.e. 90,000 words contained in one string converted to 90,000 observations. However, it is excruciatingly slow since it creates and appends 90,000 temporary files. I am sure there is a better way of going about this.

Code:

use long_string, clear gen word_count = wordcount(text) * Word count gives 92,900 words so use this in the loop below forvalues i = 1/92900 { use long_string, clear gen word`i' = word(text,`i') keep word`i' rename word`i' text tempfile temp`i' quietly save "`temp`i''", replace } clear forvalues i = 1/92900{ append using "`temp`i''" }

If the above code is the only way forward, I would appreciate it if someone tells me of a way through which I can feed the word count of the string into the loop without manually specifying 92,900 at the start of the loop (I'm thinking macros but can't seem to get it right).

Thanks,
Tasneem
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#3

16 Aug 2022, 20:42

If you are running version 16 or 17, you can do this with frames:

Code:

clear* use long_string gen long wc = wordcount(text) local wc = wc[1] frame create long_words str2045 word gen temp = "" forvalues i = 1/`wc' { quietly replace temp = word(text, `i') frame post long_words (temp[1]) } frame change long_words compress des

At the end of this code, the data set in frame long_words will contain a single variable, called word, that contains 92,900 observations, with one word from the long string in each.

On my setup this ran in 66 seconds.
2 likes
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2457
#4

16 Aug 2022, 20:44

I initially started to write some code that would bite off one word at a time, but then I had a better idea. You can "recast' your problem as one of data export and re-import. The approach requires that you can identify word boundaries and insert your own word delimiters. Clearly that works here since -word()- uses spaces as word boundaries. Then you can export the word list to a text file and then import it back as a CSV file. Here's a proof of concept.

Code:

input strL(words) "one two three four five" end * create your own delimiter between words. replace words = ustrregexra(words, " ", ",", .) * note: must use a different delimiter than the comma used above or Stata will wrap output string in quotes. tempfile mywords export delimited words using `"`mywords'"', novarnames replace delim("$") import delimited using `"`mywords'"', clear varnames(nonames) delim(",") gen `c(obs_t)' rownum = _n reshape long v, i(rownum) j(wordnum) rename v word drop rownum list

Result

Code:

. list +-----------------+ | wordnum word | |-----------------| 1. | 1 one | 2. | 2 two | 3. | 3 three | 4. | 4 four | 5. | 5 five | +-----------------+

Edit: Clyde's method is superior to mine and should be used. It's simple an efficient. My method wont work because -reshape- doesn't work with -strL-s and despite only have ~93k words, I also get an error about too many variables for my version of Stata. What is puzzling is that, even though I have Stata MP which has a maximum number of variables of 120k, the error appears.

Last edited by Leonardo Guizzetti; 16 Aug 2022, 21:07.
Comment

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2457

16 Aug 2022, 21:21

I still liked the idea of exporting and importing, so I tried it again for fun. This way tweaks my earlier attempt with none of the drawbacks. The delimiter is a new line, so then on import each word is already on a new line.

Code:

use long_string, clear

gen words = ustrtrim(text)
di wordcount(words)

* change the delimiter to a new line character
replace words = ustrregexra(words, " ", "`=char(13)'", .)

tempname fh
tempfile mywords
file open `fh' using "`mywords'", write text replace
file write `fh' (words[1])
file close `fh'
type `"`mywords'"', lines(10)

import delimited word using "`mywords'", clear varnames(nonames) delim("!") stringcols(_all)
list in 1/10

This took <1 second on my system using your data as input.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#6

16 Aug 2022, 21:25

Re #5: Wow! Yes, it took only 0.56 seconds on my setup as well. Very nice!
1 like
Comment
Tasneem Mohammed

Join Date: Jul 2021

Posts: 9
#7

16 Aug 2022, 21:44

Thanks a lot for your inputs, Clyde and Leonardo.

I tried all the suggestions above and indeed, Leonardo's final piece of code appears to be the most efficient; I got what I wanted in less than a second too.

Cheers,
Tasneem

Last edited by Tasneem Mohammed; 16 Aug 2022, 21:57.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

17 Aug 2022, 08:36

Let me improve slightly on Clyde Schechter's solution from post #3, while first acknowledging the elegance of Leonardo Guizzetti's solution from post #5. The latter wins the medal for thinking so far outside the box that you're in a different time zone than the box.

For a situation where for some reason a looping solution is required, the code in post #3 suffers from having to dig progressively deeper into the text string to find successive words. By spending a little time to remove the first word from text after it has been extracted and posted to the long_words frame, we restrict the code to always finding the first word of what remains of the text string. Technically, this reduces the process from having a time roughly proportional to the square of the number of words to one having time roughly linear in the number of words. For a string as long as the one in question, the difference is substantial.

Code:

timer clear clear* use "~/Downloads/long_string" timer on 1 gen long wc = wordcount(text) local wc = wc[1] frame create long_words str2045 word gen temp = "" forvalues i = 1/`wc' { quietly replace temp = word(text, `i') frame post long_words (temp[1]) } frame change long_words compress des timer off 1 frame change default frame drop long_words clear use "~/Downloads/long_string" timer on 2 replace text = trim(text) frame create long_words str2045 word gen temp = "" while text!="" { quietly replace temp = word(text, 1) quietly replace text = substr(text,length(temp)+2,.) frame post long_words (temp[1]) } frame change long_words compress des timer off 2 timer list

Code:

. timer list 1: 39.15 / 1 = 39.1520 2: 5.94 / 1 = 5.9430

The initial 40% reduction in time compared to post #3 is perhaps due to running on this year's MacBook Air with the M2 version of Apple Silicon.
2 likes
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2449
#9

17 Aug 2022, 11:01

I'd offer one small change to Leonardo Guizzetti's nice solution, intended for the benefit of those of us who can never recall how to use -file write, file open- etc. correctly. <grin> That is to use -filewrite()- to save the file containing the words in one per line layout. That's a command I can usually almost remember how to use. I'm presuming here that the file with long_string can have just one observation

Code:

gen b = filewrite("`mywords'", words)
1 like
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2457
#10

17 Aug 2022, 16:44

That's a nice convenience function Mike Lacy . I had forgotten that it exists, thanks.
Comment

Announcement

Fix no room to add more variables because of width issue when splitting a long string

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment