Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Escape compound double quotes that occur in a macro?

    Hello,

    I have some code that uses the file command to read in lines from a text file, perform some cleaning, and write them back out. The problem I am running into is that occasionally the line contains the characters "' , which Stata of course interprets as a closing compound double quote. When I try to output the line with file write `"`line'"' _n, the closing compound quote within the macro prematurely terminates the quoted string and the remainder of the line triggers a syntax error. The following code illustrates the problem:
    Code:
    local line = char(34) + char(39) + char(34)
    macro list _line
    file open test using test.txt, text write replace
    file write test `"`line'"' _n
    file close test

    The file write `"`line'"' _n command in the above code triggers an invalid syntax r(198) error.

    What I would ideally like here is some way of telling Stata to ignore any quotes that happen to occur within the macro when determining where the line ends. (Something kind of like macval() but for preventing interpretation of quotes rather than macros.) Does such a thing exist? Can anyone think of a good workaround for this problem?

    Thanks very much!

  • #2
    Well, that was an amusing problem. The following seems to work. The trick is to write from a string variable rather than a local macro, as the variable is not subject to the command line parser.
    Code:
    clear
    set obs 1
    gen str8 line = char(34) + char(39) + char(34)
    list line, clean noobs
    file open test using test.txt, text write replace
    capture noisily file write test (line) _n
    file close test
    type test.txt

    Comment


    • #3
      William,

      Thanks very much for your ingenious solution! At the moment, however, I don't see a simple way to apply it to my exact problem. In my example code, I used
      Code:
      local line = char(34) + char(39) + char(34)
      to try to keep things simple. In fact, what I really have is something like
      Code:
      file open input using input.txt, text read
      file read input line
      The syntax of file read requires me to put the value of the line into a macro. I don't have access to Stata at the moment for testing, but it seems to me that if I try to transfer the value from the macro to a string variable using
      Code:
      gen str line = `"`line'"'
      , I'm going to have the same problem. I suppose I could try to temporarily change occurrences of
      "' in the macro line to some different (and hopefully unique) string using ​​​​subinstr and then change it back later, but even there I think I can foresee potential problems.

      Thoughts?

      Comment


      • #4
        Depending on what kind of cleaning you want to do, perhaps -filread- and -filewrite- would be useful. These commands do a binary read or write and don't notice the quoting or line structure of the file. I'm thinking of something like:
        Code:
        set obs 1
        gen strL s = fileread("input.txt")
        .... string functions used to clean the file contained in s that do not entail paying attention to the line structure of the file
        gen long nb = filewrite("output.txt")

        Comment


        • #5
          Are you familiar with the filefilter command and could it possibly accomplish the cleaning you need, bypassing this problem entirely?

          My general take on your approach is that the appropriate thing to do is to read your input file into a Stata dataset as a string variable and then process the text in the variable, rather than work with a macro value that is subject to command line interpretation.

          The infix command would seem to offer this capability. However, while the documentation makes no comment about any special handling of quotation marks, infix apparently drops quotation marks that are found as the first nonblank character of an input line, as shown below. If someone doesn't explain this behavior in the next few days, I will forward a link to this post to Stata Technical Services and seek their clarification.

          I hope the same problem isn't apparent in filefilter.

          A real hack would be to use filefilter to add a single nonblank character to the beginning of each line, the read the output of filefilter with infile, removing the added character as part of the cleaning. I'd be embarrassed every time I ran the code.

          I'll continue to think about this.

          Code:
          . clear
          
          . type input_text.dct
          infix dictionary {
          str line 1-100
          }
          
          . type input_text.txt
          Lorem ipsum dolor sit amet, 
          consectetur adipiscing elit, 
          "'"
          here are "normal quotes" in a line
           "'"
          "quotes" at the beginning of a line
          at the end of a line, "quotes"
          sed do eiusmod tempor incididunt 
          ut labore et dolore magna aliqua. 
          
          . infix using input_text.dct, using(input_text.txt) 
          infix dictionary {
          str line 1-100
          }
          (9 observations read)
          
          . list, clean
          
                                               line  
            1.          Lorem ipsum dolor sit amet,  
            2.         consectetur adipiscing elit,  
            3.                                   '"  
            4.   here are "normal quotes" in a line  
            5.                                   '"  
            6.   quotes" at the beginning of a line  
            7.       at the end of a line, "quotes"  
            8.     sed do eiusmod tempor incididunt  
            9.    ut labore et dolore magna aliqua.

          Comment


          • #6
            On further reflection, I have to come down in favor of Mike's approach. The extended example below shows that for whatever reason, infix does not reproduce the input file lines in full fidelity within a Stata string variable. At a minimum, leading blanks and tabs and trailing blanks are trimmed, along with the still-inexplicable trimming of a quotation mark possibly preceded by leading spaces,
            Code:
            . type input_text.txt
            Lorem ipsum dolor sit amet,
            consectetur adipiscing elit,
            here are "normal quotes" in a line
            "quotes" at the beginning of a line
              "quotes" following two blanks at the beginning of a line
                    "quotes" following a tab at the beginning of a line
            at the end of a line, "quotes"
                 five blanks at the beginning of a line
            five blanks you can't see at the end of a line
            "'"
             "'"
            x "'"
            sed do eiusmod tempor incididunt
            ut labore et dolore magna aliqua.
            
            . shell cat -e -t input_text.txt
            
            Lorem ipsum dolor sit amet,$
            consectetur adipiscing elit,$
            here are "normal quotes" in a line$
            "quotes" at the beginning of a line$
              "quotes" following two blanks at the beginning of a line$
            ^I"quotes" following a tab at the beginning of a line$
            at the end of a line, "quotes"$
                 five blanks at the beginning of a line$
            five blanks you can't see at the end of a line$
            "'"$
             "'"$
            x "'"$
            sed do eiusmod tempor incididunt$
            ut labore et dolore magna aliqua.$
            
            . infix using input_text.dct, using(input_text.txt) clear
            infix dictionary {
            str line 1-100
            }
            (14 observations read)
            
            . replace line = ">"+line+"<"
            (14 real changes made)
            
            . list, clean
            
                                                                        line  
              1.                               >Lorem ipsum dolor sit amet,<  
              2.                              >consectetur adipiscing elit,<  
              3.                        >here are "normal quotes" in a line<  
              4.                        >quotes" at the beginning of a line<  
              5.   >quotes" following two blanks at the beginning of a line<  
              6.        >quotes" following a tab at the beginning of a line<  
              7.                            >at the end of a line, "quotes"<  
              8.                    >five blanks at the beginning of a line<  
              9.            >five blanks you can't see at the end of a line<  
             10.                                                        >'"<  
             11.                                                        >'"<  
             12.                                                     >x "'"<  
             13.                          >sed do eiusmod tempor incididunt<  
             14.                         >ut labore et dolore magna aliqua.<

            Comment


            • #7
              Mike and William,

              Thanks very much for your good ideas. I was familiar with neither fileread() nor filefilter, and both approaches show promise for my problem.

              One issue with using fileread()​​ for my particular problem is that the files I am dealing with are large and will typically exceed the 2 GB strL limit. Some sort of process for splitting the files into pieces would need to be devised, a potentially messy complication.

              The filefilter command, if it doesn't suffer from the initial quotation mark problem, really looks very promising. (If it has the problem exhibited with infix, that's trouble, since the first character in all my files is a double quote.) It looks like it handles only one from-to transformation per call, so I would have to read and write the file multiple times to make all my edits, but that's an inefficiency I can live with.

              Thanks again for your suggestions!

              William, thanks for your most recent follow-up post, which I just saw. It looks like infix is out, but perhaps filefilter is not.
              Last edited by West Addison; 14 Jul 2018, 12:10.

              Comment


              • #8
                The advantage of -fileread- is that you can use the whole variety of string manipulations for cleaning, including regular expressions (if necessary), and it's fast. The disadvantage, as you note, is the 2 GB limit. As for splitting up a text file, you might try -ssc desc chunky-, which I've found functional and easy to use.

                Comment


                • #9
                  Thanks, Mike! I will have a look at chunky when I have access to Stata.

                  Comment


                  • #10
                    Thanks again, Mike and William, for your input. I ended up using filefilter, since that seemed to require the smallest amount of coding in my particular case, and when I tested it I discovered it did not have the problems of stripping quotes and spaces that you encountered with infix, William. It also ran a lot faster than the file read / file write method that I had been (unsuccessfully) using.

                    An advantage of the
                    fileread() / filewrite() method would have been the ability to use regular expressions, which would have reduced the number of separate transformations that I had to employ without that capability. However, that advantage seemed in my particular case to be outweighed by the need to split the files into pieces and then reassemble them, a job admittedly simplified a great deal by using chunky as you suggested, Mike.

                    Comment


                    • #11
                      To close the loop here, I received the following advice from Stata Technical Services.

                      Although not what would be desired given the input file on Statalist, this is the intended behavior of -infix-. -infix- is attempting to parse fields that it is reading, expecting fields to be delimited by whitespace (hence the stripping of leading and trailing whitespace). It is assuming that the incoming file won't have bound strings and instead will have positional strings which a dictionary will somehow tell it to read in according to the position of each string.

                      For a completely unformatted file, the best way to bring it into a string variable with each line of the file being put into an observation of that string variable is to use -import delimited- with a few options to make sure it doesn't try to split up lines based on tabs or commas, doesn't try to interpret the first line as variable names, doesn't try to bind on quotes or strip quotes, and sees only one string column in the file:
                      Code:
                        import delimited using input_text.txt,    ///
                              varnames(nonames)                   ///
                              bindquotes(nobind)                  ///
                              stripquotes(no)                     ///
                              stringcols(1)                       ///
                              delimiters("ZZZZZ", asstring)
                      For the -delimiters()- option, you must just choose some sequence of characters you are sure does not appear anywhere in the file. -import delimited- will then think there is a single value per line in the file and not attempt to split the data into 2 or more columns.

                      If you have multiple variables in the text file, you can first import it as a single variable using the above method, then generate new variables using the -substr()- function with the exact starting column and length. For example, you can use code like
                      Code:
                          generate v2=substr(v1,20,10)
                      Mata is another approach. There are multiple ways you could process a file such as this in Mata, but one simple approach is to simply bring each line of the file into an element of a string vector in Mata. This then allows any of Mata's string functions to operate on that vector, including the regular expression functions. Or, if it is preferable to have the the lines of the file in a string variable in Stata, that can still be accomplished with Mata.
                      Code:
                        /* Example 1 -- bring file into string vector in Mata */
                        mata:
                        lines = cat("text.txt")
                      That's it. Now there is a string colvector in Mata named 'lines' that has the lines of the file in it. Manipulations on these lines could be performed, and Mata's various file I/O functions can be used to write the lines back out into the desired changed file.

                      Comment

                      Working...
                      X