Command infile with string data which includes blanks

Carlo Wix

Join Date: Apr 2014

Posts: 8
#1

Command infile with string data which includes blanks

24 Apr 2014, 05:19

Hi all,

I am trying to read data into Stata 13 from a text-file which has a format as in the attached file (test.txt). Essentially the file is one long string and one observation is 125 characters, sometimes including blank spaces. After 125 characters a new observations starts and there is no delimiter.

I tried reading the data using:

infile str125 v1 using "test.txt"

However, this gives me the following dta.-file (attachment: Test.png). So the infile command apparently can’t handle blanks in the 125 character string and starts a new observation. Is there any way to tell infile that it should include blanks in the string variable?

Best regards

Carlo
Attached Files

test.txt (1,008 Bytes, 1 view)

Last edited by Carlo Wix; 24 Apr 2014, 05:26.
Tags: None
Carlo Wix

Join Date: Apr 2014

Posts: 8
#2

24 Apr 2014, 05:28

Some additional info:

I have also tried reading the data using:

infix str v1 1-125 using "test.txt"

But Stata then tells me: "(0 observations read)".

However, if I manually format the data such that it looks like in test2.txt, then I get the following dta.-file (attachment: Test2.png), which is exactly what I am looking for. But since the actual txt.-file is quite large, there is no way I can manually format all the data.
Attached Files

test2.txt (1,022 Bytes, 1 view)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#3

24 Apr 2014, 05:36

Given your first structure, with blocks of 4 lines, you can go something like

Code:

egen block = seq(), block(4) sort block, stable by block : gen everything = v1 if _n == 1 by block : replace everything = everything[_n-1] + " " + v1 if _n > 1 by block : keep if _n == 4

You still have to parse the data.
Comment
Brendan Halpin

Join Date: Mar 2014

Posts: 152
#4

24 Apr 2014, 06:04

I was going to suggest that for your basic problem -- that your source file is a single line -- you would need to use something other than Stata, such as a text processing utility (awk, sed) or a general programming language like Python, to pre-process it into something that Stata can read.

Then I remembered that Mata is quite a general programming language, and has the raw input/output facilities needed:

Code:

mata: file = fopen("test.txt", "r") outfile = fopen("testout.dat","w") while ((line=fread(file,126))!=J(0,0,"")) { fwrite(outfile, line) fwrite(outfile,char(10)) } fclose(file) fclose(outfile) end

This reads 126 bytes at a time from your source file, and then writes it out to another file (which must not already exist), adding a line-break character. On Windows you may need to a carriage-return/line-break combination: put "fwrite(outfile,char(13))" before the "char(10)" line in that case.

Note that the data you provided is in 126 not 125-byte chunks.
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#5

24 Apr 2014, 06:46

Carlo,

Here's yet another solution which doesn't require any trickery. Create a dictionary (say, test.dct) containing the following:

Code:

dictionary using test.txt { _lrecl(126) str125 v1 %125s }

and deploy it with infile using test.dct.

Of course, you can also read in individual variables instead of one long string, according to your specifications.

Regards,
Joe
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#6

24 Apr 2014, 07:26

I haven't tried it, but Brendan's code

Code:

fwrite(outfile, line) fwrite(outfile, char(10))

could I imagine be tweaked to

Code:

fwrite(outfile, line + char(10))
Comment
Carlo Wix

Join Date: Apr 2014

Posts: 8
#7

24 Apr 2014, 07:39

Thanks a lot everyone for your help!

Joe's code solved the problem and was exactly what I was looking for.

@Nick: Your code worked perfectly for my provided example. However, in my real dataset I had the additional issue that the blocks were of varying size depending on the observation.

Best regards

Carlo
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#8

24 Apr 2014, 07:46

OK. The technique could easily be extended, e.g. suppose each block has the same starter text.

Code:

gen block = (v1 == "starter text") replace block = sum(block) sort block, stable by block : gen everything = v1 if _n == 1

and so forth with the last condition if _n == _N
Comment
Brendan Halpin

Join Date: Mar 2014

Posts: 152
#9

24 Apr 2014, 07:57

Joe's succinct solution has made me realise that infile via dictionary is rather more powerful and general than I thought.
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#10

24 Apr 2014, 08:42

I don't deserve much credit for this solution. I haven't had to do this before, but the computer scientist in me couldn't believe that Stata didn't have some sort of built-in (albeit little used) solution for this sort of problem. The term lrecl (and its meaning) is a leftover from days gone by when text files were frequently stored without carriage returns.
Comment
Brendan Halpin

Join Date: Mar 2014

Posts: 152
#11

24 Apr 2014, 08:46

Yes, something about lrecl made me recall FORTRAN and punchcards. Or VAX/VMS at any rate.
Comment
Salma Gallas

Join Date: May 2021

Posts: 9
#12

21 Jun 2021, 15:39

Hi all
I would like to know how to eliminate the observations concerning missing data which does not exist on my database while these data are treated in my model and treated as existing observations.thanks
Comment

Announcement

Command infile with string data which includes blanks

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment