Read in text data, character per character

Alexander Koplenig

Join Date: Jul 2014

Posts: 39
#1

Read in text data, character per character

09 Feb 2015, 00:54

Dear Statalisters,

For further manipulations, I'm looking for a native Stata solution to read in text data, so that every character (including spaces) is read in as one separate observation (one character per line).

So what I would basically need is to insert whitespaces between every characters of my input file.
With a stream editior (e.g. GNU sed) this can be easily done by just replacing every character with itself followed by a space.

The following solution using the filefilter command works, but is very crude, I am looking for something more elegant and especially faster (when it comes to bigger text sizes).

********************* START *********************
/* generate some test data */
clear
set obs 1
gen str word="Hello world. This is a test! `"
outfile using test.raw, noquote replace

/* filefilter the test data
to replace every character with the character + a space in front of it
[ backslash and left quote in one additional routine] */

qui {
forvalues i=33(1)255 {

if !(`i'==92|`i'==96) {

noisily di "`=char(`i')'"
filefilter test.raw test_1.raw, ///
from(`"`=char(`i')'"') to(`" `=char(`i')'"') replace
erase test.raw
copy test_1.raw test.raw, replace
}
}
filefilter test.raw test_1.raw, ///
from(\BS) to(`" \BS"') replace
erase test.raw
copy test_1.raw test.raw, replace

filefilter test.raw test_1.raw, ///
from(\LQ) to(`" \LQ"') replace
erase test.raw
copy test_1.raw test.raw, replace

/* replace spaces by a placeholder */

filefilter test.raw test_1.raw, ///
from(`" "') to(`" *SPACE* "') replace
erase test.raw
copy test_1.raw test.raw, replace
}

/* infile data */

infile str10 char using test, clear
/* replace placeholder */

replace char=" " if char=="*SPACE*"
compress

/* erase the test data */
erase test.raw
erase test_1.raw
********************* END *********************

So if anyone has any ideas (e.g. using the file command or regular expressions), please let me know, I would be very grateful.

Many thanks

Ali

P.S.: I'm using Stata 12.1
Tags: character, file, filefilter, regular expression, text data

ben earnhart

Join Date: May 2014
Posts: 1027

09 Feb 2015, 16:16

How about this. Using -infix- you can tell it to read one character at a time. With anything reasonably wide, this could get really tedious. So we use Stata to create the .do file for us! If you're going wider than 300 characters, might need to modify it in various ways, but hope this basic approach is potentially attractive and simpler than what you were attempting.

Code:

cd c:\data\text

clear
set obs 100
*====creating do file
gen strL var1="cd c:\data\text" if _n==1
replace var1="clear" if _n==2
replace var1="infix " if _n==3

*===meat of it.  Assumes 300 characters per line
forvalues i=1/300 {
replace var1=var1+" str1 var" + "`i'" + " " + "`i'" + "-" + "`i'" + " " if _n==3
}
replace var1=var1+ " using C:\data\text\rawtext.txt" if _n==3
outfile using "c:\data\text\readraw.do", replace noq

*=====now do the do-file we created!
clear
do readraw.do
gen obs=_n
reshape long var, i(obs) j(j)

Last edited by ben earnhart; 09 Feb 2015, 16:25.

Comment

ben earnhart

Join Date: May 2014

Posts: 1027
#3

09 Feb 2015, 17:46

BTW -- I was expecting to bump up against a limit on the length of a command. But at least in Stata 13.1/IC, I can do up to the 2047 variable limit in a single command. 2,047 characters is a pretty good-sized chunk of text, so you should probably be good to go, and if you have SE or MP, I guess you can go beyond that, though I dunno how far.
Comment

ben earnhart

Join Date: May 2014
Posts: 1027

09 Feb 2015, 21:54

Duh. You have Stata 12.1, so no long strings, gsem, or unicorns for you. Here's a version that will work with Stata 12. The 240 character limit is moot, since it's only reading one character at a time.

Code:

cd c:\data\text
set more off
clear
set obs 2000
*====creating do file
gen str50 var1="cd c:\data\text" if _n==1
replace var1="clear" if _n==2
replace var1="#delim ;" if _n==3
replace var1="infix " if _n==4
*===meat of it.  Assumes 300 characters per line
forvalues i=1/300 {
replace var1=var1+" str1 var" + "`i'" + " " + "`i'" + "-" + "`i'" + " " if _n==`i' +4
replace var1=var1+ " using C:\data\text\rawtext.txt;" if `i'==300 & _n==305
}

outfile using "c:\data\text\readraw.do", replace noq

*=====now do the do-file we created!
clear
do readraw.do
gen obs=_n
reshape long var, i(obs) j(j)

Last edited by ben earnhart; 09 Feb 2015, 22:34.

Comment

Alexander Koplenig

Join Date: Jul 2014

Posts: 39
#5

10 Feb 2015, 00:11

Thank you Ben,

This was just the elegant solution I was looking for. It works fine AND: very quick!

Best

Ali

PS: yeah, at the moment, I "only" have Stata 12 MP with plenty of available RAM.
I thought about waiting to buy a new version until Stata finally supports Unicode (which is crucial for corpus linguistics).
Comment
Alexander Koplenig

Join Date: Jul 2014

Posts: 39
#6

10 Feb 2015, 03:21

Hi again,

I slightly modified Ben’s approach: instead of assuming 300 characters per line, the length of the longest word in the text document is first calculated. This value is then used to automatically create Ben’s readraw do-file.

Just compared this approach and my orginal filefilter-version with a bigger data set (around 1 million words).

Suprisingly, the filefilter-version is much faster. Using the timer command, Ben's version took 5.11 min, while the filefilter version only needed 0.67 min to finish.

It seems that the reshape part in Ben's approach is most time consuming, I would say more than 95%. So instead of using reshape, one could first outfile the raw data, filefilter superfluous spaces and then infile it again. However, with around one minute, this still takes longer than the filefilter-approach (and results in some errors).

Ali
Comment
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#7

10 Feb 2015, 12:12

Originally posted by Alexander Koplenig View Post

Just compared this approach and my orginal filefilter-version with a bigger data set (around 1 million words).

Suprisingly, the filefilter-version is much faster. Using the timer command, Ben's version took 5.11 min, while the filefilter version only needed 0.67 min to finish.

Ali

It's a good thing you tested. From -help filefilter-:

Because of the buffering design of filefilter, arbitrarily large files can be converted quickly.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
1 like
Comment

Announcement