Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Crash occuring during repeated regular expression search

    I'm trying to determine what causes Stata to crash in the middle of a repeated regular expression search. I do mean
    "crash," as in it suddenly shuts down and I get an op. system message indicating that this happened, and I have
    to restart Stata. This occurs in the context of what was supposed to be some time trials with -fileread()-,
    strLs, and regular expressions, but that may not be the essence of the problem.

    I'll give a high level algorithm version of what I'm doing, some indications of what I've observed, and some
    actual code that, on my hardware/software, produces the crash. I'm running 64 bit Stata MP2 13.1 on a Windows
    machine. I know there's a lot of details below, but perhaps at least someone might have ideas for what further
    observations might be relevant to diagnosis.

    Code:
    Algorithm:
    Make simulated ascii text containing U.S. phone number strings, and save it to a file.
    forval trials = 1/3 {
       Get this file into a strL with fileread()
       while not all such phone number strings have been found {
           Search for and record any string that matches the pattern of a phone number.
       }
    }
    Comments:
    1. The crash always occurs for me within the while loop that does the "search" step, but not always at the same place, and not at the same value of the loop counter stuck in the while loop. This might be wrong, though, since all I have to work with what gets retained in the log file before the crash. The problem area is only about the last 10 lines; see "crash" below. Everything else, more or less, is of interest only to reproduce the problem in a self-contained chunk of code.

    2. The forvalues loop serves only to keep going until a crash occurs. In the real situation, the "make example ascii data" step was inside the forval loop, and I incremented the number of records at each trial, but that complexity is not necessary to create the crash on my machine.

    3. I thought perhaps there was some sort of hardware or op. system related timing issue, so I have tried putting a
    delay (sleep 100) in the while loop, but this does not prevent the crash.

    4. I don't see anything strange in the ascii data file.

    Regards, Mike
    Code:
    // Self-contained example
    // Make example data with phone numbers
    log using c:/temp/test.txt, text replace
    set seed 774656 // gives a crash with reps = 1e3
    local nrec = 1e3   // gives about a 60K data file
    clear
    qui set obs `nrec'  
    local D  "strofreal(floor(runiform()*10))"  // syntax for a single random digit
    local text = " The quick brown fox jumped over the lazy dog." + char(13) + char(10) // text, cr/lf
    gen strL s = "(" + `D' + `D' + `D' + ")" + `D' + `D' + `D'  + ///
                 "-" + `D' + `D' + `D' + `D' + "`text'"   
    compress           
    tempfile temp
    gen b = filewrite("`temp'", s,2)  // Each observation gets appended to an ascii file
    keep in 1                         // Don't need the whole file anymore, just one record and a strL
    // `temp' is as an ascii file with `nrec' records. Each line as follows, but with random digits:
    //   (999)-999-9999 The quick brown fox jumped over the lazy dog. cr/lf
    //
    // Now, start the processing phase.
    // Bring the ascii file into a strL, look for phone numbers, and keep doing
    // this until Stata crashes.  3 trials almost always does it on my machine .
    gen strL phone = ""              // will hold phone numbers
    gen str mstring = ""             // will hold a string that matches a phone # pattern
    gen byte match = .
    set tr on
    forval trials = 1/3 {
       qui replace s = fileread("`temp'")               // read the file
       local D3 = "[0-9]" * 3                           // 3 digits for a reg exp.
       local D4 = "[0-9]" * 4                           // 4 digits
       replace match = regexm(s, "\(`D3'\)`D3'\-`D4'")  // Any instance of a phone number string
       replace mstring = regexs(0)                                                   
       cap assert match == 0                            // _rc > 0 if any match
       local i 1
       // Crash occurs within this loop.
       while (_rc > 0)  {                               // loop until all phone numbers are found
          qui replace phone = phone + ", " + mstring    // record the phone number just found
          di "i = `i', _rc = " _rc
          qui replace s = subinstr(s, mstring, "", 1)   // remove # so as not to find it again.
          qui replace match = regexm(s, "\(`D3'\)`D3'\-`D4'")  // look for the next #
          cap assert match == 0
          qui if (_rc > 0) replace mstring = regexs(0)  // retain it
          local ++i
       }   
       //
    }
    log close
    //

  • #2
    Well, I don't have any thoughts about what's causing it. But I want to just note that I can replicate Mike's problem on my setup, which is also a Windows 7 platform running Stata 13.1 MP2.

    And as with Mike, the point at which the crash takes place is not consistent with respect to the loop counter `i', nor the last executed line of code within that while loop.

    Comment


    • #3
      Hi, it seems changing the length of s in line 38 of your example is problematic.
      Code:
            qui replace s = subinstr(s, mstring, "", 1)   // remove # so as not to find it again.
      Your code runs if changing the replacement string to the same length as mstring:
      Code:
            qui replace s = subinstr(s, mstring, "x"*13 , 1)   // remove # so as not to find it again.
      --
      Bjarte (Stata/MP 13.1 for Windows (64-bit x86-64) Revision 06 Aug 2014)
      Last edited by Bjarte Aagnes; 29 Sep 2014, 03:52.

      Comment


      • #4
        Thanks to both of you for taking the time to pile through a long post. I guess I'm regarding this as a bug at this point, since nothing I find in the substr() documentation about the length of a replacement string. I guess I'' pass it on to StataCorp as a bug. Good catch, Bjarte; I would not have thought of trying that. I think your observation is likely to be the key to diagnosing the problem.

        Regards, Mike

        Comment


        • #5
          I sent this in to Stata Tech. Support as a potential bug, and they promptly reported back that it is indeed a bug in -regexm()-, and that it will be fixed in a future update of the Stata executable.

          Comment

          Working...
          X