Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Escape compound double quotes that occur in a macro?

    Hello,

    I have some code that uses the file command to read in lines from a text file, perform some cleaning, and write them back out. The problem I am running into is that occasionally the line contains the characters "' , which Stata of course interprets as a closing compound double quote. When I try to output the line with file write `"`line'"' _n, the closing compound quote within the macro prematurely terminates the quoted string and the remainder of the line triggers a syntax error. The following code illustrates the problem:
    Code:
    local line = char(34) + char(39) + char(34)
    macro list _line
    file open test using test.txt, text write replace
    file write test `"`line'"' _n
    file close test

    The file write `"`line'"' _n command in the above code triggers an invalid syntax r(198) error.

    What I would ideally like here is some way of telling Stata to ignore any quotes that happen to occur within the macro when determining where the line ends. (Something kind of like macval() but for preventing interpretation of quotes rather than macros.) Does such a thing exist? Can anyone think of a good workaround for this problem?

    Thanks very much!

  • #2
    Well, that was an amusing problem. The following seems to work. The trick is to write from a string variable rather than a local macro, as the variable is not subject to the command line parser.
    Code:
    clear
    set obs 1
    gen str8 line = char(34) + char(39) + char(34)
    list line, clean noobs
    file open test using test.txt, text write replace
    capture noisily file write test (line) _n
    file close test
    type test.txt

    Comment


    • #3
      William,

      Thanks very much for your ingenious solution! At the moment, however, I don't see a simple way to apply it to my exact problem. In my example code, I used
      Code:
      local line = char(34) + char(39) + char(34)
      to try to keep things simple. In fact, what I really have is something like
      Code:
      file open input using input.txt, text read
      file read input line
      The syntax of file read requires me to put the value of the line into a macro. I don't have access to Stata at the moment for testing, but it seems to me that if I try to transfer the value from the macro to a string variable using
      Code:
      gen str line = `"`line'"'
      , I'm going to have the same problem. I suppose I could try to temporarily change occurrences of
      "' in the macro line to some different (and hopefully unique) string using ​​​​subinstr and then change it back later, but even there I think I can foresee potential problems.

      Thoughts?

      Comment


      • #4
        Depending on what kind of cleaning you want to do, perhaps -filread- and -filewrite- would be useful. These commands do a binary read or write and don't notice the quoting or line structure of the file. I'm thinking of something like:
        Code:
        set obs 1
        gen strL s = fileread("input.txt")
        .... string functions used to clean the file contained in s that do not entail paying attention to the line structure of the file
        gen long nb = filewrite("output.txt")

        Comment


        • #5
          Are you familiar with the filefilter command and could it possibly accomplish the cleaning you need, bypassing this problem entirely?

          My general take on your approach is that the appropriate thing to do is to read your input file into a Stata dataset as a string variable and then process the text in the variable, rather than work with a macro value that is subject to command line interpretation.

          The infix command would seem to offer this capability. However, while the documentation makes no comment about any special handling of quotation marks, infix apparently drops quotation marks that are found as the first nonblank character of an input line, as shown below. If someone doesn't explain this behavior in the next few days, I will forward a link to this post to Stata Technical Services and seek their clarification.

          I hope the same problem isn't apparent in filefilter.

          A real hack would be to use filefilter to add a single nonblank character to the beginning of each line, the read the output of filefilter with infile, removing the added character as part of the cleaning. I'd be embarrassed every time I ran the code.

          I'll continue to think about this.

          Code:
          . clear
          
          . type input_text.dct
          infix dictionary {
          str line 1-100
          }
          
          . type input_text.txt
          Lorem ipsum dolor sit amet, 
          consectetur adipiscing elit, 
          "'"
          here are "normal quotes" in a line
           "'"
          "quotes" at the beginning of a line
          at the end of a line, "quotes"
          sed do eiusmod tempor incididunt 
          ut labore et dolore magna aliqua. 
          
          . infix using input_text.dct, using(input_text.txt) 
          infix dictionary {
          str line 1-100
          }
          (9 observations read)
          
          . list, clean
          
                                               line  
            1.          Lorem ipsum dolor sit amet,  
            2.         consectetur adipiscing elit,  
            3.                                   '"  
            4.   here are "normal quotes" in a line  
            5.                                   '"  
            6.   quotes" at the beginning of a line  
            7.       at the end of a line, "quotes"  
            8.     sed do eiusmod tempor incididunt  
            9.    ut labore et dolore magna aliqua.

          Comment


          • #6
            On further reflection, I have to come down in favor of Mike's approach. The extended example below shows that for whatever reason, infix does not reproduce the input file lines in full fidelity within a Stata string variable. At a minimum, leading blanks and tabs and trailing blanks are trimmed, along with the still-inexplicable trimming of a quotation mark possibly preceded by leading spaces,
            Code:
            . type input_text.txt
            Lorem ipsum dolor sit amet,
            consectetur adipiscing elit,
            here are "normal quotes" in a line
            "quotes" at the beginning of a line
              "quotes" following two blanks at the beginning of a line
                    "quotes" following a tab at the beginning of a line
            at the end of a line, "quotes"
                 five blanks at the beginning of a line
            five blanks you can't see at the end of a line
            "'"
             "'"
            x "'"
            sed do eiusmod tempor incididunt
            ut labore et dolore magna aliqua.
            
            . shell cat -e -t input_text.txt
            
            Lorem ipsum dolor sit amet,$
            consectetur adipiscing elit,$
            here are "normal quotes" in a line$
            "quotes" at the beginning of a line$
              "quotes" following two blanks at the beginning of a line$
            ^I"quotes" following a tab at the beginning of a line$
            at the end of a line, "quotes"$
                 five blanks at the beginning of a line$
            five blanks you can't see at the end of a line$
            "'"$
             "'"$
            x "'"$
            sed do eiusmod tempor incididunt$
            ut labore et dolore magna aliqua.$
            
            . infix using input_text.dct, using(input_text.txt) clear
            infix dictionary {
            str line 1-100
            }
            (14 observations read)
            
            . replace line = ">"+line+"<"
            (14 real changes made)
            
            . list, clean
            
                                                                        line  
              1.                               >Lorem ipsum dolor sit amet,<  
              2.                              >consectetur adipiscing elit,<  
              3.                        >here are "normal quotes" in a line<  
              4.                        >quotes" at the beginning of a line<  
              5.   >quotes" following two blanks at the beginning of a line<  
              6.        >quotes" following a tab at the beginning of a line<  
              7.                            >at the end of a line, "quotes"<  
              8.                    >five blanks at the beginning of a line<  
              9.            >five blanks you can't see at the end of a line<  
             10.                                                        >'"<  
             11.                                                        >'"<  
             12.                                                     >x "'"<  
             13.                          >sed do eiusmod tempor incididunt<  
             14.                         >ut labore et dolore magna aliqua.<

            Comment


            • #7
              Mike and William,

              Thanks very much for your good ideas. I was familiar with neither fileread() nor filefilter, and both approaches show promise for my problem.

              One issue with using fileread()​​ for my particular problem is that the files I am dealing with are large and will typically exceed the 2 GB strL limit. Some sort of process for splitting the files into pieces would need to be devised, a potentially messy complication.

              The filefilter command, if it doesn't suffer from the initial quotation mark problem, really looks very promising. (If it has the problem exhibited with infix, that's trouble, since the first character in all my files is a double quote.) It looks like it handles only one from-to transformation per call, so I would have to read and write the file multiple times to make all my edits, but that's an inefficiency I can live with.

              Thanks again for your suggestions!

              William, thanks for your most recent follow-up post, which I just saw. It looks like infix is out, but perhaps filefilter is not.
              Last edited by West Addison; 14 Jul 2018, 11:10.

              Comment


              • #8
                The advantage of -fileread- is that you can use the whole variety of string manipulations for cleaning, including regular expressions (if necessary), and it's fast. The disadvantage, as you note, is the 2 GB limit. As for splitting up a text file, you might try -ssc desc chunky-, which I've found functional and easy to use.

                Comment


                • #9
                  Thanks, Mike! I will have a look at chunky when I have access to Stata.

                  Comment


                  • #10
                    Thanks again, Mike and William, for your input. I ended up using filefilter, since that seemed to require the smallest amount of coding in my particular case, and when I tested it I discovered it did not have the problems of stripping quotes and spaces that you encountered with infix, William. It also ran a lot faster than the file read / file write method that I had been (unsuccessfully) using.

                    An advantage of the
                    fileread() / filewrite() method would have been the ability to use regular expressions, which would have reduced the number of separate transformations that I had to employ without that capability. However, that advantage seemed in my particular case to be outweighed by the need to split the files into pieces and then reassemble them, a job admittedly simplified a great deal by using chunky as you suggested, Mike.

                    Comment

                    Working...
                    X