What is the best/fastest way to extract strings of text?

Got a LiveCode personal license? Are you a beginner, hobbyist or educator that's new to LiveCode? This forum is the place to go for help getting started. Welcome!

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller

Post Reply
keithglong
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 348
Joined: Sun Jul 03, 2011 2:04 am

What is the best/fastest way to extract strings of text?

Post by keithglong » Tue Aug 02, 2011 8:22 am

Hello,

I am still playing with LiveCode and am now exploring chunks...

My question is as follows. Suppose I have a variable with a lot of text. Throughout the text I have various strings, separated by consistent tags, that I need to extract.

For example, the following text is in the variable myVar:

The boy <#B>went to the store<#E>. He enjoyed his day out.

The woman loves <#B>shopping at the mall<#E>. So do I.

The girl loves <#B>eating at the restaurant<#E>. So does he.


I am looking for the most efficient way to extract each of the strings of text between the <#B> and <#E> tags... I presume I will have to use a loop and the matchChunk function? I have experimented but am having a problem putting starting and ending positions into variables.

Is there a better way to accomplish the above?

Thanks!

- Boo

jmburnod
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 2729
Joined: Sat Dec 22, 2007 5:35 pm
Contact:

Re: What is the best/fastest way to extract strings of text?

Post by jmburnod » Tue Aug 02, 2011 9:11 am

Hi keithglong,

You can use offset on loop (43 ticks for a 100000 repeat)

Code: Select all

on mouseUp
   put the ticks into old
   repeat 100000 -- to see the faster way 
   put fld "myField" into pText
   put "<#B>" into tStart
   put "<#E>" into tEnd
   put the length of tStart into tlength
   put the itemdel into OID
   set the itemdel to "#"
   put empty into tExtract
   repeat for each line tLine in pText
      if tLine = empty then
         next repeat
      end if
     put offset(tStart,tLine) into depText
      if depText <> 0 then
         put offset(tEnd,tLine) into endText
         put char (deptext+tlength) to (endtext -1) of tLine &return after tExtract 
      end if
   end repeat
   set the itemdel to OID
   delete char-1 of tExtract
   end repeat
   put tExtract into fld "theExtracts"
   put the ticks-old
end mouseUp
or replaceText before the loop (44 ticks for a 100000 repeat)

Code: Select all

on mouseUp
   put the ticks into old
   repeat 100000 -- to see the faster way 
      put fld "myField" into pText
      put "<#B>" into tStart
      put "<#E>" into tEnd
      replace tStart with numtochar(1) in pText
      replace tEnd with numtochar(1) in pText
      
      put the itemdel into OID
      set the itemdel to numtochar(1) 
      put empty into tExtract
      repeat for each line tLine in pText
         if tLine = empty then
            next repeat
         end if
         put item 2 of tline&return after tExtract
      end repeat
   end repeat
   set the itemdel to OID
   delete char-1 of tExtract
   put tExtract into fld "theExtracts"
   put the ticks-old
end mouseUp
Best regards

Jean-Marc
https://alternatic.ch

keithglong
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 348
Joined: Sun Jul 03, 2011 2:04 am

Re: What is the best/fastest way to extract strings of text?

Post by keithglong » Wed Aug 03, 2011 7:39 am

Thanks!

BvG
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 1239
Joined: Sat Apr 08, 2006 1:10 pm
Contact:

Re: What is the best/fastest way to extract strings of text?

Post by BvG » Wed Aug 03, 2011 6:14 pm

if you can assure that the data is correct (no empty tags), this code works equally fast in my test:

Code: Select all

on mouseUp
   put the milliseconds into theTime
   put field 1 into theText
   set the itemdelimiter to "#"
   put "" into theResult
   repeat for each item theItem in theText
      add one to theCount
      if theCount mod 2 = 0 then
         put char 3 to -2 of theItem & return after theResult
      end if
   end repeat
   delete char -1 of theResult
   put the milliseconds - theTime & return & return &  theResult 
end mouseUp
Milliseconds are more precise then ticks. Also, the give script does not actually store all results, instead throws away the results 100000 times. But not doing those put-after's is actually having a huge impact on speed (about doubling the time taken), so I removed the time-inflating repeats, instead opting for creating a field with 800000 lines containing your example lines. my script takes between 970 and 990 ms, while Jean-Marcs edited versiontakes about 960 to 990 ms. Variations are most likely to background noise, not as much due to the scripts changing in speed. I also tried a version with arrays (split theText by "#" then my "mod 2 & item" approach on each key) but that one took about 1650 to 1670, so almost double as long as the others.


Here is the modified version of Jean-Marc:

Code: Select all

on mouseUp
   put fld 1 into pText
   put the milliseconds into old
   put "<#B>" into tStart
   put "<#E>" into tEnd
   put the length of tStart into tlength
   put the itemdel into OID
   set the itemdel to "#"
   repeat for each line tLine in pText
      if tLine = empty then
         next repeat
      end if
      put offset(tStart,tLine) into depText
      if depText <> 0 then
         put offset(tEnd,tLine) into endText
         put char (deptext+tlength) to (endtext -1) of tLine &return after tExtract
      end if
   end repeat
   set the itemdel to OID
   delete char-1 of tExtract
   put the milliseconds-old & return & return & tExtract
end mouseUp
Various teststacks and stuff:
http://bjoernke.com

Chat with other RunRev developers:
chat.freenode.net:6666 #livecode

jmburnod
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 2729
Joined: Sat Dec 22, 2007 5:35 pm
Contact:

Re: What is the best/fastest way to extract strings of text?

Post by jmburnod » Wed Aug 03, 2011 7:06 pm

Hi BvG,

Thank one more for your comments.
They are very useful for me.

and

Code: Select all

if theCount mod 2 = 0 then
is a clever solution in this case

Best regards

Jean-Marc
https://alternatic.ch

keithglong
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 348
Joined: Sun Jul 03, 2011 2:04 am

Re: What is the best/fastest way to extract strings of text?

Post by keithglong » Wed Aug 03, 2011 8:52 pm

Hi Guys,

If I am not mistaken, the methods above will not extract strings containing lines (i.e., carriage returns). In the example text below, I think that only a part of each string will actually be extracted (i.e., not the entire string in the tags if the string contains lines):

The boy <#B>went to the store

and had

a lot of fun<#E>.

He enjoyed his day out.

The woman loves <#B>shopping at

the mall<#E>. So do I.

The girl loves <#B>eating at the restaurant<#E>. So does he.



(For the first two tagged strings above, I think only "went to the store" and "shopping at" will be extracted.)

Here is another way that extracts all of the strings (maintaining any lines):

Code: Select all

on mouseUp

put fld 1 into myVar
repeat 
   put offset("<#B>",myVar) into varBegin
   if varBegin = 0 then 
      if varExtracted = "varExtracted" then
         answer "No strings found!"
         exit repeat
         end if
      answer varExtracted  
      exit repeat
   end if 
put varBegin + 4 into varBegin
put offset("<#E>",myVar) into varEnd 
put varEnd - 1 into varEnd
get char varBegin to varEnd of myVar  
put it after varExtracted
put CR after varExtracted
delete char varBegin -4 to varEnd +4 of myVar
   
end repeat

end mouseUp
Cheers!

- Boo

BvG
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 1239
Joined: Sat Apr 08, 2006 1:10 pm
Contact:

Re: What is the best/fastest way to extract strings of text?

Post by BvG » Wed Aug 03, 2011 11:11 pm

nope my method ignores all chars, unless they're #, so lines work fine
Various teststacks and stuff:
http://bjoernke.com

Chat with other RunRev developers:
chat.freenode.net:6666 #livecode

keithglong
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 348
Joined: Sun Jul 03, 2011 2:04 am

Re: What is the best/fastest way to extract strings of text?

Post by keithglong » Wed Aug 03, 2011 11:36 pm

Cool. (But what happens if there are other # characters in the text that are not part of tags?)

Try this text with your example:

The boy <#B>went to the store<#E>. He enjoyed his day out.

The #1 woman loves <#B>shopping at the mall<#E>. So do I.

The girl loves <#B>eating at the #1 restaurant<#E>. So does he.


Again, new to this. Just experimenting and learning...

Cheers,

- Boo

bn
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 4172
Joined: Sun Jan 07, 2007 9:12 pm

Re: What is the best/fastest way to extract strings of text?

Post by bn » Thu Aug 04, 2011 12:14 am

Hi Boo,

Code: Select all

on mouseUp
   put the milliseconds into theTime
   put field 1 into theText
   put "<#B>" into tStartChunk
   put length (tStartChunk)  into tStartChunkLength
   put "<#E>" into tEndChunk
   put 0 into tStartOff
   put 0 into tEndOff
   put 0 into tCharsToSkip
   put "" into theResult
   
   repeat 
      put offset (tStartChunk,theText,tCharsToSkip) into tStartOff
      if tStartOff = 0 then exit repeat -- not found -> exit
      add tStartOff to tCharsToSkip
      put offset (tEndChunk, theText, tCharsToSkip) into tEndOff
      put char (tCharsToSkip + tStartChunkLength) to (tCharsToSkip + tEndOff -1) of theText & cr after theResult
   end repeat
      delete char -1 of theResult
      put the milliseconds - theTime & return & return &  theResult into field 2
end mouseUp
this will collect everything between the startChunk and the endChunk. As long as they come alternating.

This is about 5 percent slower than Björnke's solution but can be used for arbitrary start/endChunks (again only if alternating)

Kind regards

Bernd

Post Reply