Page 1 of 1

Another dumb regexp question

Posted: Mon May 25, 2009 5:26 pm
by WaltBrown
Regexp is tough!

Any thoughts on why I can't get this to work? I want to change all characters in field fInField except a-z, A-Z, and 0-9 into spaces.

replaceText(field fInField,"[^0-9A-Za-z]",space)

I've tried all combos of parens, square brackets, quotes, etc. I'm lost.

Thanks,
Walt

Posted: Mon May 25, 2009 7:48 pm
by bn
Walt,
your regex seems ok put rev wants this:

Code: Select all

put replacetext(field fInField,"[^0-9A-Za-z]",space) into field fInField
I find regular expressions difficult. But if they work they are pretty powerful.
regards
Bernd

btw. did your text marking work with what I proposed on the locked text?

Posted: Mon May 25, 2009 11:46 pm
by WaltBrown
Thanks, Bernd. Yes, I got the text marking going fine - what I am doing is parsing multiple input streams and marking sections differently based on past user habits and selections, ie simplistically grey=ignore, yellow=known interest, red=do you want more of this?, etc.

My fun now is improving the filtering time through many MB of real time text streams with multiple sets of search terms in any kind of useful time frame :-)

Walt

Posted: Tue May 26, 2009 2:33 am
by FourthWorld
FWIW, the brute-force method used in function f2 below benchmarks about 9% faster than the replaceText method when tested on a 21k file consisting of about half ASCII and half binary data.

Code: Select all

on mouseUp
  put 10 into n
  put fld 1 into s
  --
  put the millisecs into t
  repeat n
    put f1(s) into r1
  end repeat
  put the millisecs - t into t1
  --
  put the millisecs into t
  repeat n
    put f2(s) into r2
  end repeat
  put the millisecs - t into t2
  --
  put t1 && t2 &cr& (r1=r2)
end mouseUp


function f1 s
  return replacetext(s,"[^0-9A-Za-z]",space)
end f1

function f2 s
  put empty into x
  repeat for each char k in s
    if (k is in "abcdefghijklmnopqrstuvwxyz0123456789") then
      put k after x
    else
      put " " after x
    end if
  end repeat
  return x
end f2

Posted: Wed May 27, 2009 1:44 am
by WaltBrown
Hi Richard.

The replace functions are file size sensitive. The smaller the file, the better f2 was, but above around 150k words f1 started improving. Here are my results. Test f3 was your f2 but with the list of caps added to be fair with the regexp expression in f1:

"if (k is in "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789") then"

I ran on Vista x64 on a 2G 7350 dual core with 4GB of RAM. Different RAM amounts might give clues as to available buffer sizes affecting this.

walt
----------------

600k chars:
F1:1854 F2:2030 F3:2202
F1:1858 F2:2033 F3:2236
F1:1846 F2:2029 F3:2198
F1:1847 F2:2031 F3:2207
F1:1863 F2:2026 F3:2202

500k chars:
F1:1523 F2:1711 F3:1832
F1:1539 F2:1690 F3:1825
F1:1558 F2:1684 F3:1825
F1:1566 F2:1707 F3:1826
F1:1530 F2:1690 F3:1823

400k chars:
F1:1228 F2:1346 F3:1460
F1:1240 F2:1350 F3:1458
F1:1250 F2:1355 F3:1466
F1:1240 F2:1345 F3:1460
F1:1244 F2:1354 F3:1460

300k chars
F1:939 F2:1016 F3:1092
F1:954 F2:1016 F3:1089
F1:948 F2:1015 F3:1100
F1:948 F2:1014 F3:1096
F1:928 F2:1010 F3:1093

200k chars
F1:649 F2:677 F3:734
F1:658 F2:671 F3:730
F1:661 F2:671 F3:729
F1:653 F2:672 F3:722
F1:688 F2:674 F3:727

150k chars:
F1:529 F2:510 F3:549
F1:502 F2:520 F3:543
F1:510 F2:510 F3:552
F1:520 F2:509 F3:547
F1:504 F2:513 F3:553

100k chars:
F1:360 F2:335 F3:361
F1:357 F2:334 F3:363
F1:358 F2:333 F3:362
F1:344 F2:333 F3:360
F1:363 F2:335 F3:362

50k chars:
F1:213 F2:174 F3:183
F1:248 F2:166 F3:184
F1:220 F2:167 F3:189
F1:225 F2:167 F3:183
F1:204 F2:170 F3:182

25k chars:
F1:164 F2:85 F3:93
F1:129 F2:85 F3:91
F1:143 F2:83 F3:92
F1:144 F2:88 F3:95
F1:120 F2:86 F3:91

Posted: Wed May 27, 2009 5:07 am
by FourthWorld
WaltBrown wrote:The replace functions are file size sensitive. The smaller the file, the better f2 was, but above around 150k words f1 started improving.
Excellent benchmarking, Walt. I find a lot of these sorts of things are size-dependent, with one method being faster for some cases and another for larger data sets, longer lines, etc.

That you've taken the time to find the trade-off point is impressive. I'll keep your results in mind when optimizing some tasks here. Thanks for putting that together.

I wonder if there's also a consistent difference between data sets with mostly ASCI and those with mostly non-ASCII. Hmm.....
Test f3 was your f2 but with the list of caps added to be fair with the regexp expression in f1:

"if (k is in "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789") then"
Actually, the "is in" operator is case-insensitive by default, so f2 and f3 should be functionally equivalent but with f3 using a larger search space.

I'd added a statement in the results part of my test to compare the output from f1 and f2 to make sure they were returning the same data, so unless I mucked up the replaceText expression in f1 it seems f2 should be a reasonable replacement functionally.

Posted: Wed May 27, 2009 5:37 am
by WaltBrown
In a past life I was "Kernel Tuning Boy" for real time comms systems. One of the vagaries of many apps and OSs is what they do when they exceed some limit in a single list of some kind and need to open either a second peer list or a child list (depending on architecture).

This can incur all kinds of new tree structures, buffer swapping, additional layers of redirection, etc. Many (I've found them most OSes) do linear lookups internally, so when a list needs to be in two chunks, going from, say 1000 items in a list when your list len is 1000, to 1001 items, at that single increment you trigger non-linear performance reductions. Of real interest is that in some OSes, when your list goes back below 1000, they DON'T immediately release the new list structures, leading to performance hysteresis.

I won't dive into it here, but this is one of the reasons we created the IMS Platform Performance Benchmark, as the "normal" SPEC benchmarks did not fully "appreciate" the demands of comms protocols on microcomputer based platforms.