Trying to group phrases in text...

Anything beyond the basics in using the LiveCode language. Share your handlers, functions and magic here.

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller

Post Reply
RevDevelopment
Posts: 13
Joined: Mon Feb 16, 2009 7:39 pm

Trying to group phrases in text...

Post by RevDevelopment » Mon Feb 16, 2009 7:47 pm

I'm trying to take a block of text and grab all the 2 word and 3 word phrases from it.

So, say for example I have this chunk of text:

Mr. Dennis went to the flower shop to find a bucket of roses.

I was hoping to find a way to grab all the two word phrases...and then grab all the three word phrases.

Example:

2 Word Phrases:

Mr. Dennis
Dennis went
went to
to the
the flower
flower shop
shop to
to find
find a
a bucket
bucket of
of roses

3 Word Phrases:

Mr. Dennis went
Dennis went to
went to the
to the flower
the flower shop
flower shop to
shop to find
to find a
find a bucket
a bucket of
bucket of roses

I've got some of the first steps and am trying to put this all in an array...but the code for looping this in Revolution has got me stumped.

Thanks for any tips/advice.

Janschenkel
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 977
Joined: Sat Apr 08, 2006 7:47 am
Contact:

Post by Janschenkel » Tue Feb 17, 2009 6:10 am

Ah, the perfect way to kick-start your brain in the morning whilst eating breakfast :-) Off the top of my head:

Code: Select all

on mouseUp
   put field 1 into theText
   --
   put the number of words in theText into theWordCount
   put 0 into theTwoWordCount
   put 0 into theThreeWordCount
   repeat with theWordIndex = 1 to theWordCount
      if theWordIndex > 1 then
         add 1 to theTwoWordCount
         put word theWordIndex - 1 to theWordIndex of theText into theTwoWordArray[theTwoWordCount]
      end if
      if theWordIndex > 2 then
         add 1 to theThreeWordCount
         put word theWordIndex - 2 to theWordIndex of theText into theThreeWordArray[theThreeWordCount]
      end if
   end repeat
   --
   combine theTwoWordArray using return
   put theTwoWordArray into field 2
   combine theThreeWordArray using return
   put theThreeWordArray into field 3
end mouseUp
If you don't need to preserve the number of spaces between words, you can probably optimize this with something like:

Code: Select all

on mouseUp
   put field 1 into theText
   --
   put 0 into theWordIndex
   put 0 into theTwoWordCount
   put 0 into theThreeWordCount
   put empty into theTwoWordBuffer
   put empty into theThreeWordBuffer
   repeat for each word theWord in theText
      add 1 to theWordIndex
      if theWordIndex = 1 then
         put theWord into theTwoWordBuffer
         put theWord into theThreeWordBuffer
      else 
         put space & theWord after theTwoWordBuffer
         put space & theWord after theThreeWordBuffer
         if theWordIndex > 1 then
            add 1 to theTwoWordCount
            if theWordIndex > 2 then
               delete word 1 of theTwoWordBuffer
            end if
            put theTwoWordBuffer into theTwoWordArray[theTwoWordCount]
         end if
         if theWordIndex > 2 then
            add 1 to theThreeWordCount
            if theWordIndex > 3 then
               delete word 1 of theThreeWordBuffer
            end if
            put theThreeWordBuffer into theThreeWordArray[theThreeWordCount]
         end if
      end if
   end repeat
   --
   combine theTwoWordArray using return
   put theTwoWordArray into field 2
   combine theThreeWordArray using return
   put theThreeWordArray into field 3
end mouseUp
While you may not see much difference in execution time between the two approaches with short texts, the speed advantage of the second approach will become noticeable as the texts grow.

HTH,

Jan Schenkel.
Quartam Reports & PDF Library for LiveCode
www.quartam.com

RevDevelopment
Posts: 13
Joined: Mon Feb 16, 2009 7:39 pm

Post by RevDevelopment » Tue Feb 17, 2009 6:40 pm

:o Holy smokes...you rock!

Off to test this...

Are you available for offline consultation of Revolution? (Am I even allowed to ask this in the forum???...sorry if no)

mwieder
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 3581
Joined: Mon Jan 22, 2007 7:36 am
Contact:

Post by mwieder » Tue Feb 17, 2009 7:41 pm

Jan does indeed rock. Now that the coffee's hit I rise to the occasion and post an alternative way to do things:

Code: Select all

    repeat for each word tWord in field "fldText"
        put tWord & space after tTwoWords
        put tWord & space after tThreeWords
        if the number of words in tTwoWords is 2 then
            put tTwoWords after tTwoWordList
            put cr into char -1 of tTwoWordList
            delete word 1 of tTwoWords
        end if
        if the number of words in tThreeWords is 3 then
            put tThreeWords after tThreeWordList
            put cr into char -1 of tThreeWordList
            delete word 1 of tThreeWords
        else
        end if
    end repeat

RevDevelopment
Posts: 13
Joined: Mon Feb 16, 2009 7:39 pm

Post by RevDevelopment » Tue Feb 17, 2009 8:12 pm

Awesome....you both rock.

Now to perform a little speed test and see who Super Rocks :)

RevDevelopment
Posts: 13
Joined: Mon Feb 16, 2009 7:39 pm

Post by RevDevelopment » Tue Feb 17, 2009 8:40 pm

@mwieder

Ok...so I've hit just a small snag with the code.

It works perfectly...except I was hoping to place the values in an array so that I could reference them individually.

I tried this:

Code: Select all

on mouseUp
   local loopvalue
   put empty into loopvalue
   
   repeat for each word tWord in field "fldText"
      add 1 to loopvalue
      put tWord & space after tTwoWords
      put tWord & space after tThreeWords
      if the number of words in tTwoWords is 2 then
         put tTwoWords into tTwoWordList[loopvalue]
         put cr into char -1 of tTwoWordList
         delete word 1 of tTwoWords
      end if
      if the number of words in tThreeWords is 3 then
         put tThreeWords into tThreeWordList[loopvalue]
         put cr into char -1 of tThreeWordList
         delete word 1 of tThreeWords
      else
      end if
   end repeat 
   
   answer tTwoWordList[1]
end mouseUp
But that's not working...what is wrong with my array here?

Thanks!

RevDevelopment
Posts: 13
Joined: Mon Feb 16, 2009 7:39 pm

Post by RevDevelopment » Tue Feb 17, 2009 8:59 pm

@Jan,
That top code worked perfect...thanks for your help on this.

SparkOut
Posts: 2947
Joined: Sun Sep 23, 2007 4:58 pm

Post by SparkOut » Tue Feb 17, 2009 9:13 pm

I have played with this a little just for academic reasons and Mark's is the super rocker from my trials. I was doing the same sort of process in a "repeat for each word" loop. Given a source text of the "Dennis went" sentence repeated a bunch of times (about 50 or 60) the results were:

Mark's "repeat for each": 8 to 14 milliseconds

My "repeat for each" (similar to Mark's but not as efficient, obviously): 14 to 19 milliseconds.

Jan's routine 1: 16 to 21 milliseconds

Jan's routing 2: 28 to 36 milliseconds

Jan's routines seem to give some odd results too. I'm getting peculiar gaps in the list of two and three words - is that because of the combining of the arrays?


Just out of testingness to see how much a "repeat with i = 1 to..." loop is slowed compared with the "repeat for each" approach, I did that too, and the results were 28 to 31 milliseconds.

SparkOut
Posts: 2947
Joined: Sun Sep 23, 2007 4:58 pm

Post by SparkOut » Tue Feb 17, 2009 9:20 pm

And to get the array - just leave Mark's script as it is, don't add the loop counters or anything. Just let it build up the list and then at the very end

Code: Select all

split tTwoWordList by cr
split tThreeWordList by cr
That will give you two arrays each numerically indexed from 1.

SparkOut
Posts: 2947
Joined: Sun Sep 23, 2007 4:58 pm

Post by SparkOut » Tue Feb 17, 2009 9:22 pm

Code: Select all

on mouseUp
   put the milliseconds into tNow
   repeat for each word tWord in field "Field1" 
      put tWord & space after tTwoWords 
      put tWord & space after tThreeWords 
      if the number of words in tTwoWords is 2 then 
         put tTwoWords after tTwoWordList 
         put cr into char -1 of tTwoWordList 
         delete word 1 of tTwoWords 
      end if 
      if the number of words in tThreeWords is 3 then 
         put tThreeWords after tThreeWordList 
         put cr into char -1 of tThreeWordList 
         delete word 1 of tThreeWords 
      else 
      end if 
   end repeat
   split tTwoWordList by cr
   split tThreeWordList by cr
   put tTwoWordList[1] into field "Field2"
   put tThreeWordList[1] into field "Field3"
   put the milliseconds - tNow into field "fldTime"
end mouseUp
Mark's version above looks like the best result to me, based on the (relatively small) text sample. YMMV with a different sample text length, so you might need to do some more experiments of your own.

RevDevelopment
Posts: 13
Joined: Mon Feb 16, 2009 7:39 pm

Post by RevDevelopment » Tue Feb 17, 2009 9:25 pm

SparkOut,
Thanks so much for this info!

I'm going to test this out now...

mwieder
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 3581
Joined: Mon Jan 22, 2007 7:36 am
Contact:

Post by mwieder » Wed Feb 18, 2009 10:45 pm

Well, I wasn't really aiming for speed here, just that the overhead of putting everything into an array along the way seemed unnecessary. Out of curiousity, why do you want the results to end up in an array rather than just keeping them in a variable? Seems like you could use the line number just as easily as the array key. I don't see the advantage, so I must be missing out on something.

RevDevelopment
Posts: 13
Joined: Mon Feb 16, 2009 7:39 pm

Post by RevDevelopment » Wed Feb 18, 2009 10:48 pm

mweider,
There probably isn't an advantage...it's just that I wasn't aware you could call a line of text with a line number in Revolution.

Good idea...I'll try that.

mwieder
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 3581
Joined: Mon Jan 22, 2007 7:36 am
Contact:

Post by mwieder » Wed Feb 18, 2009 11:13 pm

put line x of tVariable into tLine

Arrays are one of the most powerful features of rev. Normally I would think of using arrays for speed, I just didn't see that there was anything to be gained in this case with numeric keys.

RevDevelopment
Posts: 13
Joined: Mon Feb 16, 2009 7:39 pm

Post by RevDevelopment » Wed Feb 18, 2009 11:59 pm

Awesome...thanks so much for your help!

I really appreciate it.

Post Reply