LiveCode Forums.

Posted: **Wed Apr 08, 2009 3:02 am**

I have a 3.5 meg file of plain text. This happens to be routing numbers for banks. I wrote an import routine to parse the data and plug it into a sqlite database.

I am running this from the IDE and it takes about 25 minutes to complete the import. This seems slow to me and I was wondering if once the code is compiled will the speed pickup?

The text format is one fixed length record per line with the 18 fields position delimited. Each record is 155 characters long.

So the real question is: Will the compiled code run much faster? If so how much faster should I expect?

Posted: **Wed Apr 08, 2009 3:31 am**

I haven't spent much time with sqLite so I can't answer your question, but it does seem slow.

If the data isn't sensitive I would be interested in running some benchmarks with it using a different db engine. If you're up for that feel free to email it to:
ambassador@fourthworld.com

Thanks.

Posted: **Wed Apr 08, 2009 4:19 am**

For those interested here is the import code.

Code: Select all

on mouseUp
   global gConID
   answer file "please open csv file"
   if it is empty then exit mouseUp
   put it into theFile
   put url ("file:" & theFile) into tAllRtnums
   local tSQL
   put "INSERT INTO FedAch" & return & \
   "(routenum,officecd,svcfrb,rectypcd, changedt,newroutnum,custname,address,city,state,zip,zip4,areacd,phonepre,phonesuff,instat,datavcd,filler)" & \
   "VALUES(" & return & \
   ":1, :2, :3, :4, :5, :6, :7, :8, :9, :10, :11, :12, :13, :14, :15, :16, :17, :18, :19)" into tSQL
  
   repeat for each line aLine in tAllRtnums
      put aline into tdatat
      add 1 to trecount
      put trecount into tid
      put char 1 to 9 of tdata into trtnum
      put char 10 of tdata into toffcd
      put char 11 to 19 of tdata into tsfrb
      put char 20 of tdata into trtcd
      put char 21 to 26 of tdata into tcdt
      put char 27 to 35 of tdata into tnewrn
      put char 36 to 71 of tdata into tcustname
      put char 72 to 107 of tdata into tadd
      put char 108 to 127 of tdata into tcity
      put char 128 to 129 of tdata into tstate
      put char 130 to 134 of tdata into tzip
      put char 135 to 138 of tdata into tzip4
      put char 139 to 141 of tdata into tacd
      put char 142 to 144 of tdata into ttpre
      put char 145 to 148 of tdata into ttsuff
      put char 149 of tdata into tistat
      put char 150 of tdata into tdatavcd
      put char 151 to 155 of tdata into tfill
      
      -- Insert the data into the table
      local tResult
      revExecuteSQL gConID, tSQL, "tid", "trtnum", "toffcd", "tsfrb", "trtcd", "tcdt", "tnewrn", "tcustname", "tadd", "tcity", "tstate", "tzip", "tzip4", "tacd", "ttpre", "ttsuff", "tistat", "tdatavcd", "tfill"
      put the result into tResult
      add 1 to tcount
      
   end repeat
   answer tcount
end mouseUp

I am new to Revolution so there may be a better way to do the same import.

Posted: **Wed Apr 08, 2009 4:53 am**

Chris,

Well, Revolution definitely won't be as fast as C or some other languages. But, I'm sure there's things you could do to speed up your code.

First, it's important to find out what's actually slow. You really have 3 things going on. Discovering where the time sinks are is important:

Disk access.

You are currently reading the 3.5 MB file one line at a time. You already stated that each entry is 155 bytes. That's almost 23,000 lines being read from the file. This is going to be much slower than just reading the entire file in a single shot. To test this, create two buttons: one that reads the file in a single shot, and one that reads it a line at a time. Then compare the two times:

Code: Select all

--read 1 line at a time button
on mouseUp
    local tT0

    put the long seconds into tT0
    open file "my3.5MBFile"
    repeat forever
        read from file "my3.5MBFile" for 1 line
        if the result is "eof" then
            answer the long seconds - tT0 && "seconds to read the file."
            close file "my3.5MBFile"
            exit mouseUp
        end if
    end repeat
end mouseUp

-- read entire file in a single read
on mouseUp
    local tT0

    put the long seconds into tT0
    open file "my3.5MBFile"
    read from file "my3.5MBFile" until eof
    answer the long seconds - tT0 && "seconds to read the file."
    close file "my3.5MBFile"
end mouseUp

Parsing.

You are spending a good hunk of time parsing each line. Currently you are doing this grabbing characters from within the string. Another possible method of doing this would be to use a regular expression. That may be overkill (and it might be slower), but the regular expression goes through PCRE, which is very fast C code. You could test it with the same timing idea used above. Here's a sample match you could finish out...

Code: Select all

matchText(tLine, \
    "(.{9})(.)(.{9})(.)(.{6})(.{9})", \
    trtnum, toffcd, tsfrb, trtcd, tcdt, tnewrn)

Another advantage of using a regular expression is that matchText will return false if something is wrong (letting you know there's an error in the file).

Something else to try... if you can change the original 3.5 MB file, would be to delimit each line with a character (like comma, "|" or something else) and try using Revolution's built in string parsing chunks:

Code: Select all

put item 1 of tLine into trtnum
put item 2 of tLine into toffcd
...

The SQL Insert.

Last but not least, there's the SQL insert call. A good test for that would be to create the same query, and run it 23,000 times:

Code: Select all

command testSQLSpeed pQuery
    local tT0, tTAvg

    put the long seconds into tT0
    repeat 23000 times
        revExecuteSQL pQuery
    end repeat

    get the long seconds - tT0
    put it / 23000 into tTAvg

    answer "Total time:" && it & cr & "Time per INSERT:" && tTAvg
end testSQLSpeed

If you find that most of your bottleneck is in the SQL, there's likely nothing you can do about it (except perhaps bug the Rev team, or switch database apps, which probably isn't a real option for you).

If the bottleneck is in the parsing of each line, then you have options available to you to try. Others here may have other ideas for you to speed it up.

If the time is just in reading the file, then I think you'll be surprised how much faster reading the file in a single shot will be. Sadly, though, once done, you'll have to change your parsing code accordingly.

Well, there's some ideas. Hope they help!

Jeff M.

Posted: **Wed Apr 08, 2009 6:09 am**

I have another suggestion if it does turn out that the sqlite insertion portion is part of the bottleneck. (I'm guessing it is)

You might try building up one large transaction (or break it into chunks) and do the insert that way.

If I recall what I read correctly, each individual transaction requires opening, writing to, and closing the journal file. If you use the Begin transaction and End transaction (end transaction is the same as commit) you only open and close once.

I'm not sure how old this vaguely remembered page is, but if its still accurate, there should be a large time savings right there.

If I come across the information again, i'll post a link.

Posted: **Wed Apr 08, 2009 6:16 am**

Found this stuff in the current sqlite docs at http://www.sqlite.org/faq.html#q19.

Don't know if its possible with the embedded sqlite in rev tho, wouldn't know where to begin to try it.

(19) INSERT is really slow - I can only do few dozen INSERTs per second

Actually, SQLite will easily do 50,000 or more INSERT statements per second on an average desktop computer. But it will only do a few dozen transactions per second. Transaction speed is limited by the rotational speed of your disk drive. A transaction normally requires two complete rotations of the disk platter, which on a 7200RPM disk drive limits you to about 60 transactions per second.

Transaction speed is limited by disk drive speed because (by default) SQLite actually waits until the data really is safely stored on the disk service before the transaction is complete. That way, if you suddenly lose power or if your OS crashes, your data is still safe. For details, read about atomic commit in SQLite..

By default, each INSERT statement is its own transaction. But if you surround multiple INSERT statements with BEGIN...COMMIT then all the inserts are grouped into a single transaction. The time needed to commit the transaction is amortized over all the enclosed insert statements and so the time per insert statement is greatly reduced.

Another option is to run PRAGMA synchronous=OFF. This command will cause SQLite to not wait on data to reach the disk surface, which will make write operations appear to be much faster. But if you lose power in the middle of a transaction, your database file might go corrupt.

Posted: **Wed Apr 08, 2009 8:19 am**

Its me again. I did some experimentation to see how much difference the transaction style vs the line by line inserts makes, and there is no comparison.

I can shove 60k records into a table in around 3 seconds using the begin/commit method. This is despite my crappy inefficient programming too. To get the approximate time I just grabbed a beginning long time and an ending long time. Not accurate but good enough for this.

When using the alternate method, it takes more than 3 minutes to do the insert. The interesting thing is, if you open the directory that contains your db file you can actually watch it open and close the journal file. No wonder its slow.

Heres the code from my 2 buttons, sorta commented.
First, the slow method.

Code: Select all

global gConID
on mouseUp
   --open the db and get its id set
   get revOpenDatabase("sqlite", "testdbsqlite.db", , , , )
   put it into gConID 
   -- drop and recreate the table. 
   revExecuteSQL gConID, "drop table t1;"
   revExecuteSQL gConID, "create table t1 ( t1key INTEGER PRIMARY KEY, data TEXT);"
   -- start the transaction, set first (crude) time check here.
   put the long time into startTm
   
   --Same as the other script, but comment out begin and commit
   --revExecuteSQL gConID, "begin;"   
   
   -- loop to create the as yet uncalled transaction entries
   repeat with i = 1 to 60000
      put "Insert into t1 (data) values ('Test Data"  && i & "');" into querString
      revExecuteSQL gConID, querString
   end repeat
   -- This is where the other script actually commit the transactio
   -- revexecuteSQL gConID, "commit;"
   put the long time into endTm
   revCloseDatabase gConID
   answer startTM && endTM
end mouseUp

Next the begin/commit method.

Code: Select all

global gConID
on mouseUp
   --open the db and get its id set
   get revOpenDatabase("sqlite", "testdbsqlite.db", , , , )
   put it into gConID 
   -- drop and recreate the table. 
   revExecuteSQL gConID, "drop table t1;"
   revExecuteSQL gConID, "create table t1 ( t1key INTEGER PRIMARY KEY, data TEXT);"
   -- start the transaction, set first (crude) time check here.
   put the long time into startTm
   revExecuteSQL gConID, "begin;"   
   
   -- loop to create the as yet uncalled transaction entries
   repeat with i = 1 to 60000
      put "Insert into t1 (data) values ('Test Data"  && i & "');" into querString
      revExecuteSQL gConID, querString
   end repeat
   -- actually commit the transactio
   revexecuteSQL gConID, "commit;"
   put the long time into endTm
   revCloseDatabase gConID
   answer startTM && endTM
end mouseUp

Out of curiosity I pushed 400k records in using the faster method. I discovered something interesting. Its a bad idea to do that many at once on a macbook with low memory. Just thought i'd mention that. *cough*

Posted: **Wed Apr 08, 2009 5:02 pm**

I just have to say that in both format and content that's an exemplary post, Jeff. We should all aspire to write as clearly.

massung wrote:Chris,

Well, Revolution definitely won't be as fast as C or some other languages. But, I'm sure there's things you could do to speed up your code.

First, it's important to find out what's actually slow. You really have 3 things going on. Discovering where the time sinks are is important:

Disk access.

You are currently reading the 3.5 MB file one line at a time. You already stated that each entry is 155 bytes. That's almost 23,000 lines being read from the file. This is going to be much slower than just reading the entire file in a single shot. To test this, create two buttons: one that reads the file in a single shot, and one that reads it a line at a time. Then compare the two times:
Code: Select all
--read 1 line at a time button
on mouseUp
    local tT0

    put the long seconds into tT0
    open file "my3.5MBFile"
    repeat forever
        read from file "my3.5MBFile" for 1 line
        if the result is "eof" then
            answer the long seconds - tT0 && "seconds to read the file."
            close file "my3.5MBFile"
            exit mouseUp
        end if
    end repeat
end mouseUp

-- read entire file in a single read
on mouseUp
    local tT0

    put the long seconds into tT0
    open file "my3.5MBFile"
    read from file "my3.5MBFile" until eof
    answer the long seconds - tT0 && "seconds to read the file."
    close file "my3.5MBFile"
end mouseUp
Parsing.

You are spending a good hunk of time parsing each line. Currently you are doing this grabbing characters from within the string. Another possible method of doing this would be to use a regular expression. That may be overkill (and it might be slower), but the regular expression goes through PCRE, which is very fast C code. You could test it with the same timing idea used above. Here's a sample match you could finish out...
Code: Select all
matchText(tLine, \
    "(.{9})(.)(.{9})(.)(.{6})(.{9})", \
    trtnum, toffcd, tsfrb, trtcd, tcdt, tnewrn)
Another advantage of using a regular expression is that matchText will return false if something is wrong (letting you know there's an error in the file).

Something else to try... if you can change the original 3.5 MB file, would be to delimit each line with a character (like comma, "|" or something else) and try using Revolution's built in string parsing chunks:
Code: Select all
put item 1 of tLine into trtnum
put item 2 of tLine into toffcd
...
The SQL Insert.

Last but not least, there's the SQL insert call. A good test for that would be to create the same query, and run it 23,000 times:
Code: Select all
command testSQLSpeed pQuery
    local tT0, tTAvg

    put the long seconds into tT0
    repeat 23000 times
        revExecuteSQL pQuery
    end repeat

    get the long seconds - tT0
    put it / 23000 into tTAvg

    answer "Total time:" && it & cr & "Time per INSERT:" && tTAvg
end testSQLSpeed
If you find that most of your bottleneck is in the SQL, there's likely nothing you can do about it (except perhaps bug the Rev team, or switch database apps, which probably isn't a real option for you).

If the bottleneck is in the parsing of each line, then you have options available to you to try. Others here may have other ideas for you to speed it up.

If the time is just in reading the file, then I think you'll be surprised how much faster reading the file in a single shot will be. Sadly, though, once done, you'll have to change your parsing code accordingly.

Well, there's some ideas. Hope they help!

Jeff M.

Posted: **Wed Apr 08, 2009 7:51 pm**

Well thank you for clearing the issue on sqlite. So if I decide to use sqlite for one of my local databases then I will most definitely use transactions where speed is needed. I believe as you do my bottle neck was sqlite.

Second: Thank you for the excellent examples. This helps me learn more about revolution and sql databases. I have limited experience with both but I am determined to learn all I can about them.

I will rewrite my stack using your suggestions and report the difference.

Have a great day.

Posted: **Wed Apr 08, 2009 8:30 pm**

I've been re-reading all this again, and based on the results of my test with begin/commit.. While there is some bottleneck there, I'm thinking the worst bottle neck as stated above is the line by line file read.

For those in the know, are bottlenecks cumulative? Or (s)lowest common denominator?

Either way, I'd revamp the whole thing to get as much speed out of it as possible.

Let us know how much difference the changes make if you would? I'd also appreciate a re-posting of your modified code if its not too much trouble. I'm still trying to learn all this stuff too and it would help lots.

chris9610 wrote:Well thank you for clearing the issue on sqlite. So if I decide to use sqlite for one of my local databases then I will most definitely use transactions where speed is needed. I believe as you do my bottle neck was sqlite.

Second: Thank you for the excellent examples. This helps me learn more about revolution and sql databases. I have limited experience with both but I am determined to learn all I can about them.

I will rewrite my stack using your suggestions and report the difference.

Have a great day.

Posted: **Wed Apr 08, 2009 11:52 pm**

OK here is the issue. Sqlite is the bottleneck.

By commenting out the revExecuteSQL the rest of the routine runs through the 23K lines in under 1 second.

I will now try a transaction to see what happens.

BTW sturgis I was unable to get the matchtext to work for parsing if that was the purpose of the code you supplied. At this time it seems to me and correct me if I am wrong but matchtext only returns true or false.

Posted: **Wed Apr 08, 2009 11:57 pm**

....I'm thinking the worst bottle neck as stated above is the line by line file read.

For those in the know, are bottlenecks cumulative? Or (s)lowest common denominator?

Either way, I'd revamp the whole thing to get as much speed out of it as possible.

As for your question, it depends on how the code is executing. Parallel code, it's the worst of the bunch that determines the time of them all. However, your code (and Revolution in general) is serial: do A, then do B, then do C. This means your total time will be T(A) + T(B) + T(C).

Here at work I have a wonderful MAP file that's 32 MB in size. Just for fun, I decided to run my own timings (take these w/ a grain of salt given that my machine is different, but ratio-wise, they should be similar).

Read file line by line: 29.802 seconds
Read file in a single read: 0.233 seconds
Read the file 155 characters at a time: 0.525 seconds

That last test is very telling about the internals of Revolution (note: I've been experimenting a lot like this lately, as I'm still evaluating whether or not Rev is right for my team).

I'd bet money that internally, when reading by lines, Rev is reading character by character until it finds a newline. This is bad times. Reading byte-by-byte from a file is incredibly slow. What the Rev team should be doing is caching reads. Meaning that even though you only asked for 1 line, it read 4K from the file, and then only returns you the first line of that 4K. When you ask for the next line, it's already in RAM. Repeat until done. If Rev is already doing that internally, then there's something really wrong.

Perhaps someone from the Rev team can comment on these findings?

Anyway, Chris (the OP), you'll likely see a massive speedup just reading 155 characters at a time (what you said was the length of each entry). You may need to read 156 or 157, depending on line endings, though.

But, to continue on with the investigation, let's assume you wanted to go with the fastest option: reading the entire file in a single read. Now you need to actually "parse" the file. Again, you are left with either doing this line by line, or by some other means.

Loading up my 32 MB text file and just doing a simple loop of extracting each line, basically yielded in Rev locking up. I know it was running, but after several minutes, I didn't care any more...

Code: Select all

local tCount

put the number of lines in tData into tCount #fast!
repeat with i = 1 to tCount
  get line i from tData #unbelievably slow!
end repeat

Now, there can be several reasons why the loop above is very slow. I'll venture a few guesses, though. Again, hopefully someone from the Rev team can jump in and comment:

1. The line positions aren't being cached. After you get line 1, Rev should "know" where line 2 begins. If that's not the case, when you ask for line 10284, it's actually parsing the previous 10283 lines before it. Bad times.

2. String allocations and garbage collection. Memory allocations take time. And freeing them takes time as well. With large data sets, it's possible we're overrunning an internal threshold for when GC should take place. Basically we're doing way more work than is necessary.

There's other possible reasons as well, but those are the ones that stand out there. Again, these are just guesses.

Anyway, Chris, back to your particular problem at hand...

Code: Select all

local tFile, tEntry

open file tFile

-- read each db entry 1 at a time
repeat forever
  read from file tFile for 155 characters

  if the result is "eof" then
    # done!
  end if

  put it into tEntry

  # parse tEntry, either with matchText or your old way

  # follow the advice for the committed inserts
end repeat

close file tFile

The code above is going to run orders of magnitude faster for you.

HTH,

Jeff M.

Posted: **Thu Apr 09, 2009 12:06 am**

Wasn't me who gave you the matchtext, and alas I've never managed to get the hang of regular expressions.

Should be easy easy to change your code so that it works the faster way.

Based off the first code you posted, the 2 following changes will set it up as one transaction rather than individuals. Look for the comment lines with ### in them to locate the 2 changed spots.

Let me know how much difference this makes.

Code: Select all

on mouseUp
   global gConID
   answer file "please open csv file"
   if it is empty then exit mouseUp
   put it into theFile
   put url ("file:" & theFile) into tAllRtnums
   local tSQL
   put "INSERT INTO FedAch" & return & \
   "(routenum,officecd,svcfrb,rectypcd, changedt,newroutnum,custname,address,city,state,zip,zip4,areacd,phonepre,phonesuff,instat,datavcd,filler)" & \
   "VALUES(" & return & \
   ":1, :2, :3, :4, :5, :6, :7, :8, :9, :10, :11, :12, :13, :14, :15, :16, :17, :18, :19)" into tSQL
-- ################## Added the line below.
    revExecuteSQL gConID, "begin;"   

   repeat for each line aLine in tAllRtnums
      put aline into tdatat
      add 1 to trecount
      put trecount into tid
      put char 1 to 9 of tdata into trtnum
      put char 10 of tdata into toffcd
      put char 11 to 19 of tdata into tsfrb
      put char 20 of tdata into trtcd
      put char 21 to 26 of tdata into tcdt
      put char 27 to 35 of tdata into tnewrn
      put char 36 to 71 of tdata into tcustname
      put char 72 to 107 of tdata into tadd
      put char 108 to 127 of tdata into tcity
      put char 128 to 129 of tdata into tstate
      put char 130 to 134 of tdata into tzip
      put char 135 to 138 of tdata into tzip4
      put char 139 to 141 of tdata into tacd
      put char 142 to 144 of tdata into ttpre
      put char 145 to 148 of tdata into ttsuff
      put char 149 of tdata into tistat
      put char 150 of tdata into tdatavcd
      put char 151 to 155 of tdata into tfill
     
      -- Insert the data into the table
      local tResult
      revExecuteSQL gConID, tSQL, "tid", "trtnum", "toffcd", "tsfrb", "trtcd", "tcdt", "tnewrn", "tcustname", "tadd", "tcity", "tstate", "tzip", "tzip4", "tacd", "ttpre", "ttsuff", "tistat", "tdatavcd", "tfill"
      put the result into tResult
      add 1 to tcount
     
   end repeat
-- ##########Added the line below to commit
   revexecuteSQL gConID, "commit;"

   answer tcount
end mouseUp

chris9610 wrote:OK here is the issue. Sqlite is the bottleneck.

By commenting out the revExecuteSQL the rest of the routine runs through the 23K lines in under 1 second.

I will now try a transaction to see what happens.

BTW sturgis I was unable to get the matchtext to work for parsing if that was the purpose of the code you supplied. At this time it seems to me and correct me if I am wrong but matchtext only returns true or false.

@massung, Thx for the explanation. If I get the gist of what you said, the answer "are bottlenecks cumulative or lowest common denominator" the answer is Yes.... Depending.

In the case of the script chris is working on, and since as you stated, rev is serial, cumulative would be the answer here right?

Posted: **Thu Apr 09, 2009 12:19 am**

sturgis wrote:@massung, ...In the case of the script chris is working on, and since as you stated, rev is serial, cumulative would be the answer here right?

Correct.

Just for a fun follow-up. I ran one more test on my 32 MB file to test caching myself, to compare against what Rev should be doing under-the-hood (but isn't). Reading 4K at a time, then parsing each line from that 4K, then reading 4K more, rinse, repeat...

Total time = 0.72 seconds

As you can see, using Rev to read a (large) file line-by-line is just horrendously slow. You'll be much better off just doing it yourself.

That's very sad.

Jeff M.

Posted: **Thu Apr 09, 2009 12:46 am**

OK the gig is up. sturgis was correct!

I added 2 lines of code.

Code: Select all

revexecuteSQL gConID, "begin;"

before the repeat
and

Code: Select all

revexecuteSQL gConID, "commit;"

after end repeat

This solved the problem and execution time went from 25+ minutes to 12 seconds. No other changes.

This is fast enough for me.

LiveCode Forums.

Test of sqlite speed

Test of sqlite speed