Page 2 of 2
Posted: Thu Sep 17, 2009 12:04 am
by bn
Garrett,
there is a flaw in the script, when reading data in in chunks on never knows where the line ending is. I figured one reads in the chunk and then reads until a line ending and starts from there with repeat for each line.
the relevant part of the code is
Code: Select all
repeat until tReachedTheEnd
read from file varFileToRead for 1000000
if the result is "eof" then
put true into tReachedTheEnd
end if
put it into tPartText
-- the read could have ended before a line end, lets go on and read until the line is complete
-- this way we should always have complete lines
if not tReachedTheEnd then
--breakpoint
read from file varFileToRead for 1 line
put it after tPartText
end if
But I still did not figure out where the counting goes wrong. But with this modification the situation should become easier to understand. No more missing </title>
regards
Bernd
Posted: Thu Sep 17, 2009 12:53 am
by bn
Garret,
there seems to be a problem with the line delimiter. When I save the text file I use with just a return as a line delimiter the script works fine. If I save the text file with carriage return/linefeed as a delimiter I get into problems.
So maybe you try to determine what line endings your big file has.
regards
Bernd
Posted: Thu Sep 17, 2009 2:14 am
by Garrett
I'll see if I can extract some of the file without causing any changes to any characters of it, such as line endings, and upload it.
Posted: Thu Sep 17, 2009 10:59 am
by bn
Garrett,
I downloaded 3 articles from wikepedia (german) in xml format. I tried with this xml file and that did work. Maybe for debugging you work on small files that you can easily look at in text editors.
The version that works for me is this one
Code: Select all
on mouseUp
put empty into field "ListLog"
put "Started: " & the short system date & " - " & the short system time after field "ListLog"
set the enabled of button "Generate Indexing" to false
put "...\WikipediaIndex3.dat" into varIndexFile
put "...\enwiki-20090902-pages-articles.xml" into varFileToRead
open file varFileToRead for read
open file varIndexFile for write
put 0 into varCounter
put 0 into varTally
put false into tReachedTheEnd
-- repeat 100 times --until tReachedTheEnd
repeat until tReachedTheEnd
read from file varFileToRead for 1000000
if the result is "eof" then
put true into tReachedTheEnd
end if
put it into tPartText
-- the read could have ended before a line end, lets go on and read until the line is complete
-- this way we should always have complete lines
if not tReachedTheEnd then
--breakpoint
read from file varFileToRead for 1 line
put it after tPartText
end if
repeat for each line varLineData in tPartText
if "<title>" is among the chars of varLineData then
put (varCounter + offset("<title>",varLineData)) into varCharLoc
put char (offset("<title>",varLineData) + 7) to (offset("</title>",varLineData) -1 ) of varLineData & "|" & varCharLoc & cr after varIndexData
put varTally + 1 into varTally
end if
put the number of chars of varLineData + varCounter + 1 into varCounter
if varTally is 1000 then
put field "LabelStatusCount" + 1000 into field "LabelStatusCount"
wait 0 milliseconds
put 0 into varTally
write varIndexData to file varIndexFile at eof
put empty into varIndexData
end if
end repeat
--put varCounter - 1 into varCounter
end repeat
set the enabled of button "Generate Indexing" to true
set the enabled of field "ListLog" to true
put cr & "Complete: " & the short system date & " - " & the short system time after field "ListLog"
put field "LabelStatusCount" + varTally into field "LabelStatusCount"
write varIndexData to file varIndexFile at eof
close file varFileToRead
close file varIndexFile
answer "Wikipedia data indexed and ready for use."
end mouseUp
When I access the data at the indexed char number it always starts with <title>theTitleOfTheArticle...
there might still be problems that I dont see, but this probably can be worked out for the indexing. The xml file does indeed have the linefeed character = ASCII 10 as a line delimiter.
regards
Bernd
Posted: Thu Sep 17, 2009 6:20 pm
by FourthWorld
Reading until a specific character like a line ending will be slower than reading for a specified number of bytes, because the engine has to compare each incoming byte as it reads. So in addition to bypassing the issue with determining the correct line endings, you could speed things up by turning your algo inside out:
Rather than reading until a line ending, grab as much data as your buffer can reasonably hold each time through and process all of its lines but the last one, repeating this until you hit EOF, and then process the last line.
Posted: Thu Sep 17, 2009 8:04 pm
by Garrett
Here's an extract of the 23 gig file, it's near 100 megs uncompressed, and compressed in a zip it's a 35 meg download.
http://www.paraboliclogic.com/misc/enwi ... -chunk.zip
I'll read both replies above this in a bit...
Thanks a bunch

~Garrett
Posted: Thu Sep 17, 2009 9:14 pm
by Philhold
I don't know if this helps. I unzipped the file on Mac OSX and dropped it on BBedit. The line endings are \r.
Cheers
Phil
Posted: Fri Sep 18, 2009 12:16 am
by bn
Garret,
I think I know what happens:
When I download from
http://en.wikipedia.org/wiki/Special:Export
AccessibleComputing
Anarchism
AfghanistanHistory
AfghanistanGeography
AfghanistanPeople|106366
AfghanistanCommunications
AfghanistanTransportations
AfghanistanMilitary
AfghanistanTransnationalIssues
I get a xml file that has the line delimiter ASCII 10 as in Unix.
In the chunk file you posted the line delimiter is ASCII 13 ASCII 10 as in Windows. That is why the position number is off. (On top of that the first 3 chars are high ASCII chars that dont show up in the direkt download from Wikipedia, but they dont bother you)
If you change in the last working script I posted you do
put :
Code: Select all
put the number of chars of varLineData + varCounter + 2 into varCounter
i.e. you add _2_ instead of _1_ then the script works and you find for all of your chunk file the exact same position : <title>xxxx
if it is a Unix file add 1
if it is a Windows file add 2
BUT if you do a
binary read on your Windows file (as I just found out) you just have to add 1 and the script works. Rev then apparently treats return and linefeed each as a line delimiter which in the script is an empty line and the counter goes up one char
You can look at the ASCII values if you do a _binary_ read. Rev changes the line delimiter in a simple read to ASCII 10 (linefeed) and gets rid of the ASCII 13 (return)
make a stack and import with the following script 2000 chars and look for ascii 10, if a ascii 13 is before it is a Windows file
Code: Select all
on mouseUp
put "/Users/berndnig/Desktop/enwiki-20090902-pages-chunk.xml" into varFileToRead
put empty into field 2
open file varFileToRead for binary read
read from file varFileToRead for 2000
put it into temp
close file varFileToRead
lock screen
repeat with i = 1 to 2000
if chartoNum(char i of temp) = 10 then
put i && ":" && chartoNum(char i of temp) && char i of temp after field 2
else
put i && ":" && chartoNum(char i of temp) && char i of temp & return after field 2
end if
end repeat
end mouseUp
So the script works, just has to take into account the line delimiter, as you suspected earlier. If you go for a binary read you would not have to worry about what the delimiter is and it is even a little faster, always only adding 1 for each line.
like in
Code: Select all
open file varFileToRead for binary read
I hope I didnt confuse you as much as this confused me while looking into it.
You should hopefully be able to index your file now.
regards
Bernd
Posted: Sat Sep 19, 2009 7:08 pm
by Garrett
Bernd.... I can't thank you enough! Your code further above seems to have hit the nail right on the head. Not only are character positions coming up exactly as they should, the code is still blazing fast.
I did a few tests looping for 100, 200, 500, 1000 and 10000 times, checking various entries in the resulting index and all was coming up as it should. The last test of 10000 times took only 11 minutes to process on my pc which read and processed near half of the 23 gig file!
File size in bytes:
24,560,823,780
Chars read in:
10,002,099,113
(characters would represent bytes wouldn't it? my mind fails me sometimes so maybe I'm spacing out on this.)
Time Started:
10:45:03 AM
Complete:
10:55:56 AM
So if I'm not spacing out here, the end result should be that the entire file will be indexed in under 30 minutes... That's a far cry from what I started with and even expected. I was aiming to at least match that of the other programs I had tried, all of which took about 3 hours to process their own index files.
Well... now for the final test then..... Letting it run it's course through the entire file.
Back in a little while.
~Garrett
Posted: Sat Sep 19, 2009 7:59 pm
by Garrett
Final run results!!!!
* drum roll please *
It took a mere
32 minutes total to completely index the entire 23 gig XML file. The end result is an index listing 9,013,937 entries of which only around 3 million are actual entries.. the rest are extraneous entries which serve as redirects to the articles or are image references or category references. In my final version I'll be implementing a filter which will weed out as much of the extraneous stuff as possible. The index file only weighs in at 312 mb.
This type of result is all thanks to everyone of you who helped me out on this project. Without all of you helping me out, I'd still be sitting here for days indexing this monster.
Not only did you help me beat my goal of indexing in 3 hours, you beat my goal by 2 and a half hours!
I will eventually be making this entire project available in source for anyone else interested.
I still have many things to be done, such as automatic downloading of the 5 gig zipped file of the database, uncompress it and such. I want to see if I can also add support for pausing the download and such.
I may also attempt to write code to convert the XML file to hopefully something more compact than 23 gigs.
I had already started on the search/viewer part of this, so already have made a lot of headway on that side.. Parsing the data from the articles and presenting them in a minimal formatted view is near done.
Again.... Thank you all so very much for the help you have given me on this. You guys seriously went out of your way to help me and i can't thank you enough for that.

Posted: Sat Sep 19, 2009 11:28 pm
by bn
Garrett,
glad it did work out for you.
I never thoutght you would get id down to 32 minutes on a 23 GB file.
Congratulations!
regards
Bernd