Hi,
Jacques remark about speed gave me headaches, so I had to test.
At first I exported a table from a database, using a predefined format in HeidiSQL named "Excel CSV": Semicolon as field delimiter, Return as record delimiter, Quotes around all fields. That's what every spreadsheet can read. The data:
Code: Select all
File size: 4.497.398 Bytes; That's 16.912 Records à 38 Fields
Example (3 records):
"172";"30";"55";"1";"A product name xyz";"";"";"0000055";"6";"3,8500";"";"2014-10-02";"1";"2";"angelegt";"";"";"2014-10-05 19:14:29";"0,00";"0,00";"0,00";"0";"0";"0,00";\N;"0,00";"0,00";"0,00";"0,00";"0,00";"";"0,00";"0";"0";"0";"";"0";"0"
"171";"30";"54";"1";"Another product name abc";"";"";"0000054";"6";"7,6500";"";"2014-10-02";"1";"1";"angelegt";"";"";"2014-10-05 19:14:02";"0,00";"0,00";"0,00";"0";"0";"0,00";\N;"0,00";"0,00";"0,00";"0,00";"0,00";"";"0,00";"0";"0";"0";"";"0";"0"
"166";"28";"254";"1";"Still a product name 123";"";"";"0000254";"144";"1,2500";"";"2014-10-04";"1";"7";"angelegt";"ermaessigt";"";"2014-10-05 18:57:34";"0,00";"0,00";"0,00";"0";"0";"0,00";\N;"0,00";"0,00";"0,00";"0,00";"0,00";"";"0,00";"0";"0";"0";"";"0";"0"
(I added a Return here between the records for better distinguishing. And I don't know why there's 4 Spaces in the name field, but such happens ...)
At first, I try some simple stuff - it's German data, so the decimal divider is Comma. Let's make it a Point. And while we're at it, let's replace Quotes with Single Quotes:
The traditional way using "repeat with i", HC style.
Code: Select all
put fld "csv_fld_org" into MyVar
put 0*1 into MyCounter
put the milliseconds into MyStart
repeat with i = 1 to len(MyVar)
put char i of MyVar into MyChar
if MyChar = comma then
put "." into char i of MyVar
add 1 to MyCounter
else if MyChar = quote then
put "'" into char i of MyVar
add 1 to MyCounter
end if
end repeat
put MyCounter & " replacements / " & the milliseconds - MyStart & " ms;"
beep
put MyVar into fld "csv_fld"
Result:
6.7.10: 1425178 replacements / 2960 ms;
8.0: 1425178 replacements / 13400 ms;
(all time values are rounded, the last non-zero digit can vary a bit)
Now "repeat for each":
Code: Select all
put fld "csv_fld_org" into MyVar
put 0*1 into MyCounter
put the milliseconds into MyStart
repeat for each char MyChar in MyVar
if MyChar = comma then
put "." into MyChar
add 1 to MyCounter
else if MyChar = quote then
put "'" into MyChar
add 1 to MyCounter
end if
end repeat
put MyCounter & " replacements / " & the milliseconds - MyStart & " ms;"
beep
put MyVar into fld "csv_fld"
Result:
6.7.10: 1425178 replacements / 1970 ms;
8.0: 1425178 replacements / 10100 ms;
Slightly faster. Just - this doesn't change MyVar! It only plays with its own variables ;-) Left as an exercise to the reader to find out why ...
Btw, see the numbers for 8.0! Ouch!
But now I want speed:
Code: Select all
put fld "csv_fld_org" into MyVar
put the milliseconds into MyStart
replace comma with "." in MyVar
replace quote with "'" in MyVar
put the milliseconds - MyStart & " ms;"
beep
put MyVar into fld "csv_fld"
Result:
6.7.10: 48 ms;
8.0: 32 ms;
This is the way to go! And here 8.0 looks quite well, too.
-----------------------
Here I decided to do some real world tests, and to try the building of an output variable instead of changing the data in place. For this I reduced the data to the first 5000 records.
To start, a little cleanup:
Code: Select all
put fld "csv_fld_org" into MyVar
put the milliseconds into MyStart
replace "\N" with quote & quote in MyVar -- replace the Null values
replace quote & return & quote with return in MyVar -- strip the data
delete the first char of MyVar
delete the last char of MyVar
replace quote & ";" & quote with numtochar(17) in MyVar -- get nice itemdel
put the milliseconds - MyStart & " ms;"
beep
put MyVar into fld "csv_fld"
Result:
6.7.10: 53 ms;
8.0: 42 ms;
Now that we have clean data, let's go:
At first changing in place:
Code: Select all
put fld "csv_fld" into MyVar
put 0*1 into MyCounter
put the milliseconds into MyStart
set the itemdel to numtochar(17)
repeat with i = 1 to the number of lines in MyVar -- go through the records & mod the fields:
replace "," with "." in item 10 of line i of MyVar -- a German currency field 1,00 -> 1.00
put killSpaces(item 5 of line i of MyVar) into item 5 of line i of MyVar -- get rid of unused spaces
put item 12 of line i of MyVar into MyDate -- date conversion: 2014-10-02 -> 10/02/14
set the itemdel to "-"
put item 2 of MyDate & "/" & item 3 of MyDate & "/" & char -2 to -1 of item 1 of myDate into MyDate
set the itemdel to numtochar(17)
put MyDate into item 12 of line i of MyVar
add 1 to MyCounter
end repeat
replace numtochar(17) with ";" in MyVar -- set back, maybe you want to check it in a spreadsheet?
put MyCounter & " replacements / " & the milliseconds - MyStart & " ms;"
beep
put MyVar into fld "csv_fld"
Result:
6.7.10: 5000 replacements / 23500 ms;
8.0: 5000 replacements / 24400 ms;
Btw., the function to kill the spaces:
Code: Select all
function killSpaces MyStr
repeat until offset(" ", MyStr) = 0
replace " " with " " in MyStr
end repeat
return MyStr
end killSpaces
But now with building an output variable:
Code: Select all
put fld "csv_fld" into MyData
put 0*1 into MyCounter
put the milliseconds into MyStart
set the itemdel to numtochar(17)
repeat with i = 1 to the number of lines in MyData -- go through the records & mod the fields:
put line i of MyData into MyVar -- only working on a copy now!
replace "," with "." in item 10 of MyVar -- a German currency field 1,00 -> 1.00
put killSpaces(item 5 of MyVar) into item 5 of MyVar -- get rid of unused spaces
put item 12 of MyVar into MyDate -- date conversion: 2014-10-02 -> 10/02/14
set the itemdel to "-"
put item 2 of MyDate & "/" & item 3 of MyDate & "/" & char -2 to -1 of item 1 of myDate into MyDate
set the itemdel to numtochar(17)
put MyDate into item 12 of MyVar
add 1 to MyCounter
-- put MyVar into line i of MyOutput -- SLOWER!
put MyVar & return after MyOutput
end repeat
replace numtochar(17) with ";" in MyOutput -- set back, maybe you want to check it in a spreadsheet?
put MyCounter & " replacements / " & the milliseconds - MyStart & " ms;"
beep
put MyOutput into fld "csv_fld"
Result:
6.7.10:
put MyVar & return after MyOutput: 5000 replacements / 2960 ms;
put MyVar into line i of MyOutput: 5000 replacements / 10040 ms;
8.0:
put MyVar & return after MyOutput: 5000 replacements / 3980 ms;
put MyVar into line i of MyOutput: 5000 replacements / 10250 ms;
Now this is interesting! Not only that building an output var is so much faster, the difference in how to do it, too!
I learned quite a bit here. Thus I wrote this, maybe it will help others, too.
Seems my old HC habits will have to change, LC is too much different - and should I ever switch to LC 8, I'll have to change again - Arrgghh!
---------------------------
Anyways, re-reading the OP I try now with larger files: 17MB, 4x my initial table.
I add a line "put i" to see the progress, and retry the fastest variable again, with 5000 records first:
6.7.10: 5000 replacements / 9878 ms; 1,9756 ms/record
(3980 ms w/o the "put it")
And now the big file:
Clean-up: 720 ms;
6.7.10: 67648 replacements / 813392 ms; 12,0238 ms/record
Noticeable, indeed. How to circumvent this slowdown?
FourthWorld wrote:I can't help but wonder if there's a way to do replacements across the entire data set without using any loops at all.
Hmmm. Pointer arithmetic? Must make stop. Must think.
Have fun!