Curious Crash using Replace
Posted: Tue Dec 30, 2014 2:47 pm
I have been working on a Regex testing stack. I have two test text files of 632,368 (henceforth the 600K file) and 7,605,117 (the 7M file) long. The 600K file is the HTML source from a Web page. The 7M file is just a vocabulary list with word frequencies - a list of lines composed of three tab delimited entries of ASCII character strings.
I read the files individually into a local variable and then run "replace" on them - a very simple replace - "replace tMatchString with tReplacementString" (NOT using a regular expression, just a straight text replace).
It works (albeit after some time) on the local variable containing the 7M of plain text. But when I try with the 600K of text read in from the HTML source, LC goes white and "Not responding" then crashes. Note that the "replace" function is in a "try-catch" pair, which reports nothing. I thought it was size, so I deleted first the front half, then the second half of the source text, leaving about 300K in the buffer. LC crashed looking in the first half but not the second half.
Hmm, I hummed to myself. Something in the HTML source file is interfering with the replace function. Then I tried each quarter of the text file to try to binary search into the problem text (then about 150K long or so). No crash on any portion of the text file tested so I couldn't pinpoint a specific problem character or location in the file.
I then took the 600K file and escaped every character in it with a backslash, making it 1200K long. LC did not crash.
I don't plan on individually escaping and testing each character in the file to find the miscreant, that'll take me a year.
My question is, since "replace" in the Dictionary directory says "replace oldString with newString in container", yet the entry says "remove (synonym of replace)" what does "replace" do under the hood? There is no indication it uses Regex, yet escaping every character with backslash, which in Regex turns every char into a literal, stops the crash.
As a backup, I filtered out any char outside the range 32 to 126 and only found 10 (linefeed) in the file.
Anyone have any clues? Thanks!
LC 7.0 rc 1, Win7
Walt
I read the files individually into a local variable and then run "replace" on them - a very simple replace - "replace tMatchString with tReplacementString" (NOT using a regular expression, just a straight text replace).
It works (albeit after some time) on the local variable containing the 7M of plain text. But when I try with the 600K of text read in from the HTML source, LC goes white and "Not responding" then crashes. Note that the "replace" function is in a "try-catch" pair, which reports nothing. I thought it was size, so I deleted first the front half, then the second half of the source text, leaving about 300K in the buffer. LC crashed looking in the first half but not the second half.
Hmm, I hummed to myself. Something in the HTML source file is interfering with the replace function. Then I tried each quarter of the text file to try to binary search into the problem text (then about 150K long or so). No crash on any portion of the text file tested so I couldn't pinpoint a specific problem character or location in the file.
I then took the 600K file and escaped every character in it with a backslash, making it 1200K long. LC did not crash.
I don't plan on individually escaping and testing each character in the file to find the miscreant, that'll take me a year.
My question is, since "replace" in the Dictionary directory says "replace oldString with newString in container", yet the entry says "remove (synonym of replace)" what does "replace" do under the hood? There is no indication it uses Regex, yet escaping every character with backslash, which in Regex turns every char into a literal, stops the crash.
As a backup, I filtered out any char outside the range 32 to 126 and only found 10 (linefeed) in the file.
Anyone have any clues? Thanks!
LC 7.0 rc 1, Win7
Walt