Page 1 of 2

Need for Speed

Posted: Sat Sep 10, 2022 10:41 pm
by zink137
I’m generating large sets of integers (millions) which takes just a few seconds and then sorting them numerically (another couple seconds) and then deleting all the multiples which takes hours.

Repeat with zz=the number of lines of BigTemp down to 1 --delete duplicates
if line zz of BigTemp=line zz-1 of BigTemp then delete line zz-1 of BigTemp
end Repeat

How can the deletion process be speeded up? “Lines” deletes about twice as fast as “words”

Re: Need for Speed

Posted: Sat Sep 10, 2022 10:46 pm
by richmond62
Can I ask a few questions first?

1. How are you generating this large list numbers?

2. What is the advantage of generating a list that can contain duplicates so you then have to 'weed the garden'?

Re: Need for Speed

Posted: Sat Sep 10, 2022 11:00 pm
by zink137
on mouseUp
put the milliseconds into timeOn
put empty into fld 1

repeat with xx=2 to 50000
put xx into zz
repeat until zz=1

if zz mod 2 =1 then
put ((zz*3)+1) into zz
put zz &return after BigTemp
else
put zz/2 into zz
put zz &return after BigTemp
end if

end repeat
end repeat

put BigTemp into fld 1
put the number of lines of BigTemp&return&return before fld 1

put (the milliseconds-timeOn)/1000&" seconds"

end mouseUp

This is part of an investigation into the Collatz numbers.

Re: Need for Speed

Posted: Sat Sep 10, 2022 11:13 pm
by richmond62
Tangentially . . .

Why is SORT not very good?
-
SShot 2022-09-11 at 1.10.59.png
-
with this:

Code: Select all

on mouseUp
   sort lines of fld "TANK" descending
end mouseUp
results in this:
-
SShot 2022-09-11 at 1.11.17.png
-
AND even I, with my limited knowledge of Mathematics, know that 989 is bigger than 99.

Re: Need for Speed

Posted: Sat Sep 10, 2022 11:15 pm
by richmond62
https://en.wikipedia.org/wiki/Collatz_conjecture

Well, well, well: one learns something every day. 8)

Re: Need for Speed

Posted: Sat Sep 10, 2022 11:18 pm
by richmond62
OK:

Code: Select all

on mouseUp
   sort lines of fld "TANK" descending numeric
end mouseUp
numeric

Re: Need for Speed

Posted: Sat Sep 10, 2022 11:22 pm
by richmond62
So, over here [MacOS 13, 8 GB RAM] your script took just over 35 seconds to generate your list of numbers.

Re: Need for Speed

Posted: Sat Sep 10, 2022 11:39 pm
by zink137
My iMac usually generates the list in under 10 seconds. A little over 5 million integers, it's kind of astonishing.

But that's all beside the point. How to eliminate duplicates in a similar timely fashion?

Re: Need for Speed

Posted: Sun Sep 11, 2022 12:02 am
by SparkOut
Firstly, the reason for the slow iteration through your loop is that in the "repeat with <index>" version the engine has to keep track of every one of those millions of lines or words and count through the list every time to reach the index.
Generally it is far better speedwise to use the "repeat for each <word/item/line>" version. You can't then directly manipulate the original list, so in the processing part you would normally make your checks and append correct or desired results to a new variable. In this case you might

Code: Select all

put empty into tPrevTemp
put empty into tNewBigTemp  --not necessary if these are really transient variables 
repeat for each line tTemp in BigTemp 
  if tTemp is not tPrevTemp then
  .  put tTemp & cr after tNewBigTemp
  .  put tTemp into tPrevTemp
  end if
end repeat
--use tNewBigTemp in place of BigTemp instead
Personally I find the best way of deduplicating a list is to add 1 to the contents of an array key where the keys are the values being deduplicated (works with numbers or text, whatever).
So when you have your zz variable, instead of appending it to BigTemp you would

Code: Select all

add 1 to BigTempArray[(zz)]
What that means is that the unique elements will have a value of 1, and the non unique ones will have a higher value but still only the one key. The values ate irrelevant, so you can just

Code: Select all

put the keys of BigTempArray into tBigTemp
sort tBigTemp numeric ascending 

Re: Need for Speed

Posted: Sun Sep 11, 2022 12:07 am
by FourthWorld
It would be interesting to see a comparison with the array option that used to be included as an example of finding unique values, e.g.:

Code: Select all

function UniqueVals pList
  repeat for each line tLine in pList
     put tLine into tA[tLine]
  end repeat
  get the keys of tA
  sort lines of it numeric
  return it
end UniqueVals
Edit: Sparkout beat me to it. :) Would be good to know results just the same.

Re: Need for Speed

Posted: Sun Sep 11, 2022 12:17 am
by richmond62
Hmm: got a bit bogged down:

Code: Select all

on mouseup
   put empty into fld "fLINES2"
   put empty into fld "TANK2"
   put fld "TANK" into XYZ
   get DOUBLETROUBLE(XYZ)
   put it after fld "TANK2"
   put ((the number of lines in fld "TANK2") && "lines >>>") into fld "fLINES2"
end mouseup

function DOUBLETROUBLE WXYZ
   split WXYZ by return and comma
   combine WXYZ by return and comma
   sort WXYZ numeric
   return WXYZ
end DOUBLETROUBLE
-
SShot 2022-09-11 at 2.18.40.png

Re: Need for Speed

Posted: Sun Sep 11, 2022 12:25 am
by richmond62
Removed link as improved version below.

BE WARNED: On a Mac, once it has generated the Collatz numbers the stack BLOATS to about 89 MB. :?

Re: Need for Speed

Posted: Sun Sep 11, 2022 12:34 am
by richmond62
The commas are a curse.

This IS better (aesthetically at least, as it bungs a space at the end of each number rather than a comma):
-

Code: Select all

on mouseup
   set the vis of img "COOKING" to true
   put empty into fld "fLINES2"
   put empty into fld "TANK2"
   put fld "TANK" into XYZ
   get DOUBLETROUBLE(XYZ)
   put it after fld "TANK2"
   put ((the number of lines in fld "TANK2") && "lines >>>") into fld "fLINES2"
   set the vis of img "COOKING" to false
end mouseup

function DOUBLETROUBLE WXYZ
   split WXYZ by return and space
   combine WXYZ by return and space
   sort WXYZ numeric
   return WXYZ
end DOUBLETROUBLE
UNFORTUNATELY one cannot do this:

Code: Select all

function DOUBLETROUBLE WXYZ
   split WXYZ by return and space
   combine WXYZ by return
   sort WXYZ numeric
   return WXYZ
end DOUBLETROUBLE

Re: Need for Speed

Posted: Sun Sep 11, 2022 12:42 am
by richmond62
Ah, got it at last . . . 8)

Code: Select all

function DOUBLETROUBLE WXYZ
   split WXYZ by return and space
   combine WXYZ by return and null
   sort WXYZ numeric
   return WXYZ
end DOUBLETROUBLE
-
SShot 2022-09-11 at 2.50.33.png

Re: Need for Speed

Posted: Sun Sep 11, 2022 1:06 am
by FourthWorld
Good call on using both split and combine. I was thinking of the old example for getting frequency counts, but that's not needed here.

The old scripting adage abides:
- Know the engine
- Trust the engine
- Use the engine