sritcp wrote:FourthWorld wrote:.... I would imagine that it the number of found elements were even close to have of the full data set the overhead of explicitly managing the lines between the found substring and the actual target phone number would be greater than even the setup cost for the array.....
This is the part that interests me the most -- relating tool effectiveness to data characteristics..
That's both the fascination and bane of benchmarking: we often find that the fastest solution for a given problem will depend on the specifics of that problem. As with so many other things in life, there's rarely a single "best" solution. Often I'll just settle for a compromise in performance if the solution will be used in multiple contexts, but if it's a one-off that needs to run fast it can sometimes be worth the extra effort to tailor an algo for the data being acted in.
The filter command is a good example. Suppose we had a multi-column list in which we want to search for matches in a given column, say records where the first name is "Sri". If the name is in the second column we could use something like this:
Code: Select all
filter tList with ("*" &tab& "Sri"& tab)
...and that would be pretty fast. But if the name column were much deeper into each row, say the 10th column, we'd need a much more complex filter:
Code: Select all
filter tList with ("*" &tab& "*" & tab& "*" &tab* "*" &tab* "*" &tab& "*" &tab& "*" &tab& "*" &tab& "*" &tab& "Sri" &tab)
...and all those bits and pieces would require much more processing time, reducing performance by a multiple for the number of criteria used, in proportion to the length of each item.
So while filter is often recommended as a quick way to find matching lines, I've seen cases where even a simple "repeat for each" will outperform it, depending on the nature of the data and what I need to do with it.
I wonder how Bernd's "search string as delimiter" method will hold up when we increase the proportion of the "founds". (By the way, I still use LC 6.7; so it came as a -- pleasant -- surprise to learn that LC 7 permits the use of arbitrary strings as delimiters. I can imagine very complex and specific searches using this feature).
As Bernd's results show, the code base refactoring needed for Unicode and other foundational elements benefiting v8 have yielded performance degradation nearly across the board. There are some things which are a bit faster, but many which are slower.
Now that the refactoring is complete, the team has been exploring opportunities for optimization. One of these was particularly clever, avoiding unneced memcopy calls by employing a copy-on-write-only policy for data passed to functions and commands: in the past we used to have to explicitly declare when a variable should be passed by reference or by value, but now all data is passed by reference so it just copies a 4-byte pointer rather than the entire data being pointed to, and as long as the data is never altered that works great. If the receiving handler alters the data, unless it's been explicitly declared as pass-by-reference at that point a copy of the data is created. This means optimal performance for argument passing, and requires no changes to our scripts. They're continuing to look for other such opportunities as they go.