Escape compound double quotes that occur in a macro?

West Addison

Join Date: Jun 2015

Posts: 13
#1

Escape compound double quotes that occur in a macro?

14 Jul 2018, 00:38

Hello,

I have some code that uses the file command to read in lines from a text file, perform some cleaning, and write them back out. The problem I am running into is that occasionally the line contains the characters "' , which Stata of course interprets as a closing compound double quote. When I try to output the line with file write `"`line'"' _n, the closing compound quote within the macro prematurely terminates the quoted string and the remainder of the line triggers a syntax error. The following code illustrates the problem:

Code:

local line = char(34) + char(39) + char(34) macro list _line file open test using test.txt, text write replace file write test `"`line'"' _n file close test

The file write `"`line'"' _n command in the above code triggers an invalid syntax r(198) error.

What I would ideally like here is some way of telling Stata to ignore any quotes that happen to occur within the macro when determining where the line ends. (Something kind of like macval() but for preventing interpretation of quotes rather than macros.) Does such a thing exist? Can anyone think of a good workaround for this problem?

Thanks very much!
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

14 Jul 2018, 05:58

Well, that was an amusing problem. The following seems to work. The trick is to write from a string variable rather than a local macro, as the variable is not subject to the command line parser.

Code:

clear set obs 1 gen str8 line = char(34) + char(39) + char(34) list line, clean noobs file open test using test.txt, text write replace capture noisily file write test (line) _n file close test type test.txt
Comment
West Addison

Join Date: Jun 2015

Posts: 13
#3

14 Jul 2018, 06:54

William,

Thanks very much for your ingenious solution! At the moment, however, I don't see a simple way to apply it to my exact problem. In my example code, I used

Code:

local line = char(34) + char(39) + char(34)

to try to keep things simple. In fact, what I really have is something like

Code:

file open input using input.txt, text read file read input line

The syntax of file read requires me to put the value of the line into a macro. I don't have access to Stata at the moment for testing, but it seems to me that if I try to transfer the value from the macro to a string variable using

Code:

gen str line = `"`line'"'

, I'm going to have the same problem. I suppose I could try to temporarily change occurrences of "' in the macro line to some different (and hopefully unique) string using subinstr and then change it back later, but even there I think I can foresee potential problems.

Thoughts?
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#4

14 Jul 2018, 08:45

Depending on what kind of cleaning you want to do, perhaps -filread- and -filewrite- would be useful. These commands do a binary read or write and don't notice the quoting or line structure of the file. I'm thinking of something like:

Code:

set obs 1 gen strL s = fileread("input.txt") .... string functions used to clean the file contained in s that do not entail paying attention to the line structure of the file gen long nb = filewrite("output.txt")
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

14 Jul 2018, 08:52

Are you familiar with the filefilter command and could it possibly accomplish the cleaning you need, bypassing this problem entirely?

My general take on your approach is that the appropriate thing to do is to read your input file into a Stata dataset as a string variable and then process the text in the variable, rather than work with a macro value that is subject to command line interpretation.

The infix command would seem to offer this capability. However, while the documentation makes no comment about any special handling of quotation marks, infix apparently drops quotation marks that are found as the first nonblank character of an input line, as shown below. If someone doesn't explain this behavior in the next few days, I will forward a link to this post to Stata Technical Services and seek their clarification.

I hope the same problem isn't apparent in filefilter.

A real hack would be to use filefilter to add a single nonblank character to the beginning of each line, the read the output of filefilter with infile, removing the added character as part of the cleaning. I'd be embarrassed every time I ran the code.

I'll continue to think about this.

Code:

. clear . type input_text.dct infix dictionary { str line 1-100 } . type input_text.txt Lorem ipsum dolor sit amet, consectetur adipiscing elit, "'" here are "normal quotes" in a line "'" "quotes" at the beginning of a line at the end of a line, "quotes" sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. . infix using input_text.dct, using(input_text.txt) infix dictionary { str line 1-100 } (9 observations read) . list, clean line 1. Lorem ipsum dolor sit amet, 2. consectetur adipiscing elit, 3. '" 4. here are "normal quotes" in a line 5. '" 6. quotes" at the beginning of a line 7. at the end of a line, "quotes" 8. sed do eiusmod tempor incididunt 9. ut labore et dolore magna aliqua.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

14 Jul 2018, 10:58

On further reflection, I have to come down in favor of Mike's approach. The extended example below shows that for whatever reason, infix does not reproduce the input file lines in full fidelity within a Stata string variable. At a minimum, leading blanks and tabs and trailing blanks are trimmed, along with the still-inexplicable trimming of a quotation mark possibly preceded by leading spaces,

Code:

. type input_text.txt
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
here are "normal quotes" in a line
"quotes" at the beginning of a line
  "quotes" following two blanks at the beginning of a line
        "quotes" following a tab at the beginning of a line
at the end of a line, "quotes"
     five blanks at the beginning of a line
five blanks you can't see at the end of a line
"'"
 "'"
x "'"
sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua.

. shell cat -e -t input_text.txt

Lorem ipsum dolor sit amet,$
consectetur adipiscing elit,$
here are "normal quotes" in a line$
"quotes" at the beginning of a line$
  "quotes" following two blanks at the beginning of a line$
^I"quotes" following a tab at the beginning of a line$
at the end of a line, "quotes"$
     five blanks at the beginning of a line$
five blanks you can't see at the end of a line$
"'"$
 "'"$
x "'"$
sed do eiusmod tempor incididunt$
ut labore et dolore magna aliqua.$

. infix using input_text.dct, using(input_text.txt) clear
infix dictionary {
str line 1-100
}
(14 observations read)

. replace line = ">"+line+"<"
(14 real changes made)

. list, clean

                                                            line  
  1.                               >Lorem ipsum dolor sit amet,<  
  2.                              >consectetur adipiscing elit,<  
  3.                        >here are "normal quotes" in a line<  
  4.                        >quotes" at the beginning of a line<  
  5.   >quotes" following two blanks at the beginning of a line<  
  6.        >quotes" following a tab at the beginning of a line<  
  7.                            >at the end of a line, "quotes"<  
  8.                    >five blanks at the beginning of a line<  
  9.            >five blanks you can't see at the end of a line<  
 10.                                                        >'"<  
 11.                                                        >'"<  
 12.                                                     >x "'"<  
 13.                          >sed do eiusmod tempor incididunt<  
 14.                         >ut labore et dolore magna aliqua.<

Comment

West Addison

Join Date: Jun 2015

Posts: 13
#7

14 Jul 2018, 11:03

Mike and William,

Thanks very much for your good ideas. I was familiar with neither fileread() nor filefilter, and both approaches show promise for my problem.

One issue with using fileread() for my particular problem is that the files I am dealing with are large and will typically exceed the 2 GB strL limit. Some sort of process for splitting the files into pieces would need to be devised, a potentially messy complication.

The filefilter command, if it doesn't suffer from the initial quotation mark problem, really looks very promising. (If it has the problem exhibited with infix, that's trouble, since the first character in all my files is a double quote.) It looks like it handles only one from-to transformation per call, so I would have to read and write the file multiple times to make all my edits, but that's an inefficiency I can live with.

Thanks again for your suggestions!

William, thanks for your most recent follow-up post, which I just saw. It looks like infix is out, but perhaps filefilter is not.

Last edited by West Addison; 14 Jul 2018, 11:10.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#8

14 Jul 2018, 11:09

The advantage of -fileread- is that you can use the whole variety of string manipulations for cleaning, including regular expressions (if necessary), and it's fast. The disadvantage, as you note, is the 2 GB limit. As for splitting up a text file, you might try -ssc desc chunky-, which I've found functional and easy to use.
Comment
West Addison

Join Date: Jun 2015

Posts: 13
#9

14 Jul 2018, 11:20

Thanks, Mike! I will have a look at chunky when I have access to Stata.
Comment
West Addison

Join Date: Jun 2015

Posts: 13
#10

14 Jul 2018, 20:40

Thanks again, Mike and William, for your input. I ended up using filefilter, since that seemed to require the smallest amount of coding in my particular case, and when I tested it I discovered it did not have the problems of stripping quotes and spaces that you encountered with infix, William. It also ran a lot faster than the file read / file write method that I had been (unsuccessfully) using.

An advantage of the fileread() / filewrite() method would have been the ability to use regular expressions, which would have reduced the number of separate transformations that I had to employ without that capability. However, that advantage seemed in my particular case to be outweighed by the need to split the files into pieces and then reassemble them, a job admittedly simplified a great deal by using chunky as you suggested, Mike.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#11

19 Jul 2018, 16:12

To close the loop here, I received the following advice from Stata Technical Services.

Although not what would be desired given the input file on Statalist, this is the intended behavior of -infix-. -infix- is attempting to parse fields that it is reading, expecting fields to be delimited by whitespace (hence the stripping of leading and trailing whitespace). It is assuming that the incoming file won't have bound strings and instead will have positional strings which a dictionary will somehow tell it to read in according to the position of each string.

For a completely unformatted file, the best way to bring it into a string variable with each line of the file being put into an observation of that string variable is to use -import delimited- with a few options to make sure it doesn't try to split up lines based on tabs or commas, doesn't try to interpret the first line as variable names, doesn't try to bind on quotes or strip quotes, and sees only one string column in the file:

Code:

import delimited using input_text.txt, /// varnames(nonames) /// bindquotes(nobind) /// stripquotes(no) /// stringcols(1) /// delimiters("ZZZZZ", asstring)

For the -delimiters()- option, you must just choose some sequence of characters you are sure does not appear anywhere in the file. -import delimited- will then think there is a single value per line in the file and not attempt to split the data into 2 or more columns.

If you have multiple variables in the text file, you can first import it as a single variable using the above method, then generate new variables using the -substr()- function with the exact starting column and length. For example, you can use code like

Code:

generate v2=substr(v1,20,10)

Mata is another approach. There are multiple ways you could process a file such as this in Mata, but one simple approach is to simply bring each line of the file into an element of a string vector in Mata. This then allows any of Mata's string functions to operate on that vector, including the regular expression functions. Or, if it is preferable to have the the lines of the file in a string variable in Stata, that can still be accomplished with Mata.

Code:

/* Example 1 -- bring file into string vector in Mata */ mata: lines = cat("text.txt")

That's it. Now there is a string colvector in Mata named 'lines' that has the lines of the file in it. Manipulations on these lines could be performed, and Mata's various file I/O functions can be used to write the lines back out into the desired changed file.
1 like
Comment

Announcement

Escape compound double quotes that occur in a macro?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment