Page 1 of 1

Compare comma-delimited strings

Posted: Wed Sep 28, 2016 11:37 pm
by RossG
Other than comparing item-by-item is there a way to find
duplicates?

My prog produces sets of eight numbers from a larger
string and often produces duplicates so I might have

"1,2,3,4,5,6,7,8"
"1,2,3,4,5,6,7,8"

Could delete the commas and use the "=" function.

Any other ways?

Re: Compare comma-delimited strings

Posted: Wed Sep 28, 2016 11:50 pm
by FourthWorld
You can still use "=" on the whole string, commas and all, as it'll treat it as a string.

If you want to reduce the list to only unique strings this would work:

Code: Select all

repeat for each line tLine in tList
  put tLine into tArray[tLine]
end repeat
put the keys of tArray into tUniqueList

Re: Compare comma-delimited strings

Posted: Thu Sep 29, 2016 12:26 am
by RossG
Richard
Thanks for those "magic" words.
I made a test stack and it didn't seem
to like the commas.
After reading your reply I tried it again
and it worked.
Darnedest thing.

Re: Compare comma-delimited strings

Posted: Thu Sep 29, 2016 12:35 am
by FourthWorld
It's like any tech support issue, Ross: the moment you get someone to help you the problem no longer shows itself. :)

Re: Compare comma-delimited strings

Posted: Thu Sep 29, 2016 7:51 pm
by AxWald
Hi,

Fun fact: Above 24 integers in the line you're faster doing an SHA1 hash, and comparing this!

Have fun!

Re: Compare comma-delimited strings

Posted: Thu Sep 29, 2016 8:31 pm
by FourthWorld
Ax, I'm not following: when comparing each line in a collection to look for duplicates, how would adding a call to the computationally-intensive SHA1digest function outperform the simpler "=" operator?

Re: Compare comma-delimited strings

Posted: Fri Sep 30, 2016 9:56 am
by AxWald
Hi Richard,

I made a test:
When hashing the lines the computation time/ line stays roughly the same.
The computation time w/o hash rises the longer the line is.
At ~24 items in the line the times are equal.
Above 24 items the hashing is faster :)

See the attached q&d demo stack - change the script of the "load" btn to get different values for comparison. And you may manually add double lines, the pseudo-random part works too well often ...

Have fun!

PS: Assume the reason is that the hash always has the same length (40 chars) ...

Edit: For LC 8 the "magic number" is about 10 higher (~34), seems text comparison is better there. My stack was initially tested with 6.7.10.

Re: Compare comma-delimited strings

Posted: Fri Sep 30, 2016 11:59 pm
by RossG
Attached is my test stack with a file of number sets.

Any better solutions?

Re: Compare comma-delimited strings

Posted: Sat Oct 01, 2016 7:37 am
by richmond62
If Livecode is "feeling funny" about a comma-delimited list you could always replace the commas with something else:
StriptheW.png

Re: Compare comma-delimited strings

Posted: Sat Oct 01, 2016 5:29 pm
by FourthWorld
AxWald wrote:When hashing the lines the computation time/ line stays roughly the same.
The computation time w/o hash rises the longer the line is.
At ~24 items in the line the times are equal.
Above 24 items the hashing is faster :)
Thanks for that test stack. Makes sense when using strings longer than the ones Ross showed, since SHA1 will reduce the string to a 20-byte value, so once we get past a certain length the overhead of SHA1digest is more than offset by the savings in the shorter comparisons.

There may be some variance due to CPU speed and/or instruction set features, as here I get a slightly slower score for the hash option at 25 items, but bumping that up to 50 shows hashing the clear winner.

While I had your handy test stack in hand I got curious about performance differences across recent LC versions, discovering that LC v8 is roughly on par with v6 and much faster than v7 (the latter isn't surprising given an optimization for lineoffset and some other operations in v8 that allows for more specialized handling of different delimiter lengths with Unicode than was first implemented when Unicode premiered in LCv7).

FWIW here are my results, running under Ubuntu 14.04 on a Haswell G3220 @3 GHz:

LC v6.7
-----------------------
25 items - String - 0 hits: 220 ms
25 items - Hash - 0 hits: 236 ms
25 items - String - 26 hits: 231 ms
25 items - Hash - 26 hits: 251 ms
--
50 items - String - 0 hits: 428 ms
50 items - Hash - 0 hits: 235 ms
50 items - String - 7 hits: 426 ms
50 items - Hash - 7 hits: 230 ms

LC v7.0.4
-----------------------
25 items - String - 0 hits: 357 ms
25 items - Hash - 0 hits: 375 ms
25 items - String - 26 hits: 419 ms
25 items - Hash - 26 hits: 427 ms
--
50 items - String - 0 hits: 713 ms
50 items - Hash - 0 hits: 374 ms
50 items - String - 7 hits: 728 ms
50 items - Hash - 7 hits: 383 ms

LC v8.1.1 RC1
-----------------------
25 items - String - 0 hits: 192 ms
25 items - Hash - 0 hits: 208 ms
25 items - String - 26 hits: 243 ms
25 items - Hash - 26 hits: 254 ms
--
50 items - String - 0 hits: 432 ms
50 items - Hash - 0 hits: 203 ms
50 items - String - 7 hits: 451 ms
50 items - Hash - 7 hits: 224 ms

Re: Compare comma-delimited strings

Posted: Sun Oct 02, 2016 9:49 pm
by AxWald
Hi,
FourthWorld wrote:[...] discovering that LC v8 is roughly on par with v6 [...]
actually it's a bit faster, even faster than 6.5.1 ...

Btw., playing on another machine I found that that the "magic number" actually is machine dependent. And something strange:

LC 8.02 (stable) & 8.1.1 (rc1) give identical results, LC 8.1 (stable) is identical in string compare too, but significant slower (+ 20%) while hashing.

Have fun!