Page 1 of 1
What is the best/fastest way to extract strings of text?
Posted: Tue Aug 02, 2011 8:22 am
by keithglong
Hello,
I am still playing with LiveCode and am now exploring chunks...
My question is as follows. Suppose I have a variable with a lot of text. Throughout the text I have various strings, separated by consistent tags, that I need to extract.
For example, the following text is in the variable myVar:
The boy <#B>went to the store<#E>. He enjoyed his day out.
The woman loves <#B>shopping at the mall<#E>. So do I.
The girl loves <#B>eating at the restaurant<#E>. So does he.
I am looking for the most efficient way to extract each of the strings of text between the <#B> and <#E> tags... I presume I will have to use a loop and the matchChunk function? I have experimented but am having a problem putting starting and ending positions into variables.
Is there a better way to accomplish the above?
Thanks!
- Boo
Re: What is the best/fastest way to extract strings of text?
Posted: Tue Aug 02, 2011 9:11 am
by jmburnod
Hi keithglong,
You can use offset on loop (43 ticks for a 100000 repeat)
Code: Select all
on mouseUp
put the ticks into old
repeat 100000 -- to see the faster way
put fld "myField" into pText
put "<#B>" into tStart
put "<#E>" into tEnd
put the length of tStart into tlength
put the itemdel into OID
set the itemdel to "#"
put empty into tExtract
repeat for each line tLine in pText
if tLine = empty then
next repeat
end if
put offset(tStart,tLine) into depText
if depText <> 0 then
put offset(tEnd,tLine) into endText
put char (deptext+tlength) to (endtext -1) of tLine &return after tExtract
end if
end repeat
set the itemdel to OID
delete char-1 of tExtract
end repeat
put tExtract into fld "theExtracts"
put the ticks-old
end mouseUp
or replaceText before the loop (44 ticks for a 100000 repeat)
Code: Select all
on mouseUp
put the ticks into old
repeat 100000 -- to see the faster way
put fld "myField" into pText
put "<#B>" into tStart
put "<#E>" into tEnd
replace tStart with numtochar(1) in pText
replace tEnd with numtochar(1) in pText
put the itemdel into OID
set the itemdel to numtochar(1)
put empty into tExtract
repeat for each line tLine in pText
if tLine = empty then
next repeat
end if
put item 2 of tline&return after tExtract
end repeat
end repeat
set the itemdel to OID
delete char-1 of tExtract
put tExtract into fld "theExtracts"
put the ticks-old
end mouseUp
Best regards
Jean-Marc
Re: What is the best/fastest way to extract strings of text?
Posted: Wed Aug 03, 2011 7:39 am
by keithglong
Thanks!
Re: What is the best/fastest way to extract strings of text?
Posted: Wed Aug 03, 2011 6:14 pm
by BvG
if you can assure that the data is correct (no empty tags), this code works equally fast in my test:
Code: Select all
on mouseUp
put the milliseconds into theTime
put field 1 into theText
set the itemdelimiter to "#"
put "" into theResult
repeat for each item theItem in theText
add one to theCount
if theCount mod 2 = 0 then
put char 3 to -2 of theItem & return after theResult
end if
end repeat
delete char -1 of theResult
put the milliseconds - theTime & return & return & theResult
end mouseUp
Milliseconds are more precise then ticks. Also, the give script does not actually store all results, instead throws away the results 100000 times. But not doing those put-after's is actually having a huge impact on speed (about doubling the time taken), so I removed the time-inflating repeats, instead opting for creating a field with 800000 lines containing your example lines. my script takes between 970 and 990 ms, while Jean-Marcs edited versiontakes about 960 to 990 ms. Variations are most likely to background noise, not as much due to the scripts changing in speed. I also tried a version with arrays (split theText by "#" then my "mod 2 & item" approach on each key) but that one took about 1650 to 1670, so almost double as long as the others.
Here is the modified version of Jean-Marc:
Code: Select all
on mouseUp
put fld 1 into pText
put the milliseconds into old
put "<#B>" into tStart
put "<#E>" into tEnd
put the length of tStart into tlength
put the itemdel into OID
set the itemdel to "#"
repeat for each line tLine in pText
if tLine = empty then
next repeat
end if
put offset(tStart,tLine) into depText
if depText <> 0 then
put offset(tEnd,tLine) into endText
put char (deptext+tlength) to (endtext -1) of tLine &return after tExtract
end if
end repeat
set the itemdel to OID
delete char-1 of tExtract
put the milliseconds-old & return & return & tExtract
end mouseUp
Re: What is the best/fastest way to extract strings of text?
Posted: Wed Aug 03, 2011 7:06 pm
by jmburnod
Hi BvG,
Thank one more for your comments.
They are very useful for me.
and
is a clever solution in this case
Best regards
Jean-Marc
Re: What is the best/fastest way to extract strings of text?
Posted: Wed Aug 03, 2011 8:52 pm
by keithglong
Hi Guys,
If I am not mistaken, the methods above will not extract strings containing lines (i.e., carriage returns). In the example text below, I think that only a part of each string will actually be extracted (i.e., not the entire string in the tags if the string contains lines):
The boy <#B>went to the store
and had
a lot of fun<#E>.
He enjoyed his day out.
The woman loves <#B>shopping at
the mall<#E>. So do I.
The girl loves <#B>eating at the restaurant<#E>. So does he.
(For the first two tagged strings above, I think only "went to the store" and "shopping at" will be extracted.)
Here is another way that extracts all of the strings (maintaining any lines):
Code: Select all
on mouseUp
put fld 1 into myVar
repeat
put offset("<#B>",myVar) into varBegin
if varBegin = 0 then
if varExtracted = "varExtracted" then
answer "No strings found!"
exit repeat
end if
answer varExtracted
exit repeat
end if
put varBegin + 4 into varBegin
put offset("<#E>",myVar) into varEnd
put varEnd - 1 into varEnd
get char varBegin to varEnd of myVar
put it after varExtracted
put CR after varExtracted
delete char varBegin -4 to varEnd +4 of myVar
end repeat
end mouseUp
Cheers!
- Boo
Re: What is the best/fastest way to extract strings of text?
Posted: Wed Aug 03, 2011 11:11 pm
by BvG
nope my method ignores all chars, unless they're #, so lines work fine
Re: What is the best/fastest way to extract strings of text?
Posted: Wed Aug 03, 2011 11:36 pm
by keithglong
Cool. (But what happens if there are other # characters in the text that are not part of tags?)
Try this text with your example:
The boy <#B>went to the store<#E>. He enjoyed his day out.
The #1 woman loves <#B>shopping at the mall<#E>. So do I.
The girl loves <#B>eating at the #1 restaurant<#E>. So does he.
Again, new to this. Just experimenting and learning...
Cheers,
- Boo
Re: What is the best/fastest way to extract strings of text?
Posted: Thu Aug 04, 2011 12:14 am
by bn
Hi Boo,
Code: Select all
on mouseUp
put the milliseconds into theTime
put field 1 into theText
put "<#B>" into tStartChunk
put length (tStartChunk) into tStartChunkLength
put "<#E>" into tEndChunk
put 0 into tStartOff
put 0 into tEndOff
put 0 into tCharsToSkip
put "" into theResult
repeat
put offset (tStartChunk,theText,tCharsToSkip) into tStartOff
if tStartOff = 0 then exit repeat -- not found -> exit
add tStartOff to tCharsToSkip
put offset (tEndChunk, theText, tCharsToSkip) into tEndOff
put char (tCharsToSkip + tStartChunkLength) to (tCharsToSkip + tEndOff -1) of theText & cr after theResult
end repeat
delete char -1 of theResult
put the milliseconds - theTime & return & return & theResult into field 2
end mouseUp
this will collect everything between the startChunk and the endChunk. As long as they come alternating.
This is about 5 percent slower than Björnke's solution but can be used for arbitrary start/endChunks (again only if alternating)
Kind regards
Bernd