....I'm thinking the worst bottle neck as stated above is the line by line file read.
For those in the know, are bottlenecks cumulative? Or (s)lowest common denominator?
Either way, I'd revamp the whole thing to get as much speed out of it as possible.
As for your question, it depends on how the code is executing. Parallel code, it's the worst of the bunch that determines the time of them all. However, your code (and Revolution in general) is serial: do A, then do B, then do C. This means your total time will be T(A) + T(B) + T(C).
Here at work I have a wonderful MAP file that's 32 MB in size. Just for fun, I decided to run my own timings (take these w/ a grain of salt given that my machine is different, but ratio-wise, they should be similar).
Read file line by line: 29.802 seconds
Read file in a single read: 0.233 seconds
Read the file 155 characters at a time: 0.525 seconds
That last test is
very telling about the internals of Revolution (note: I've been experimenting a lot like this lately, as I'm still evaluating whether or not Rev is right for my team).
I'd bet money that internally, when reading by lines, Rev is reading character by character until it finds a newline. This is bad times. Reading byte-by-byte from a file is incredibly slow. What the Rev team
should be doing is caching reads. Meaning that even though you only asked for 1 line, it read 4K from the file, and then only returns you the first line of that 4K. When you ask for the next line, it's already in RAM. Repeat until done. If Rev is already doing that internally, then there's something
really wrong.
Perhaps someone from the Rev team can comment on these findings?
Anyway, Chris (the OP), you'll likely see a massive speedup just reading 155 characters at a time (what you said was the length of each entry). You may need to read 156 or 157, depending on line endings, though.
But, to continue on with the investigation, let's assume you wanted to go with the fastest option: reading the entire file in a single read. Now you need to actually "parse" the file. Again, you are left with either doing this line by line, or by some other means.
Loading up my 32 MB text file and just doing a simple loop of extracting each line, basically yielded in Rev locking up. I know it was running, but after several minutes, I didn't care any more...
Code: Select all
local tCount
put the number of lines in tData into tCount #fast!
repeat with i = 1 to tCount
get line i from tData #unbelievably slow!
end repeat
Now, there can be several reasons why the loop above is very slow. I'll venture a few guesses, though. Again, hopefully someone from the Rev team can jump in and comment:
1. The line positions aren't being cached. After you get line 1, Rev should "know" where line 2 begins. If that's not the case, when you ask for line 10284, it's actually parsing the previous 10283 lines before it. Bad times.
2. String allocations and garbage collection. Memory allocations take time. And freeing them takes time as well. With large data sets, it's possible we're overrunning an internal threshold for when GC should take place. Basically we're doing way more work than is necessary.
There's other possible reasons as well, but those are the ones that stand out there. Again, these are just guesses.
Anyway, Chris, back to your particular problem at hand...
Code: Select all
local tFile, tEntry
open file tFile
-- read each db entry 1 at a time
repeat forever
read from file tFile for 155 characters
if the result is "eof" then
# done!
end if
put it into tEntry
# parse tEntry, either with matchText or your old way
# follow the advice for the committed inserts
end repeat
close file tFile
The code above is going to run orders of magnitude faster for you.
HTH,
Jeff M.