Page 1 of 1

Trying to group phrases in text...

Posted: Mon Feb 16, 2009 7:47 pm
by RevDevelopment
I'm trying to take a block of text and grab all the 2 word and 3 word phrases from it.

So, say for example I have this chunk of text:

Mr. Dennis went to the flower shop to find a bucket of roses.

I was hoping to find a way to grab all the two word phrases...and then grab all the three word phrases.

Example:

2 Word Phrases:

Mr. Dennis
Dennis went
went to
to the
the flower
flower shop
shop to
to find
find a
a bucket
bucket of
of roses

3 Word Phrases:

Mr. Dennis went
Dennis went to
went to the
to the flower
the flower shop
flower shop to
shop to find
to find a
find a bucket
a bucket of
bucket of roses

I've got some of the first steps and am trying to put this all in an array...but the code for looping this in Revolution has got me stumped.

Thanks for any tips/advice.

Posted: Tue Feb 17, 2009 6:10 am
by Janschenkel
Ah, the perfect way to kick-start your brain in the morning whilst eating breakfast :-) Off the top of my head:

Code: Select all

on mouseUp
   put field 1 into theText
   --
   put the number of words in theText into theWordCount
   put 0 into theTwoWordCount
   put 0 into theThreeWordCount
   repeat with theWordIndex = 1 to theWordCount
      if theWordIndex > 1 then
         add 1 to theTwoWordCount
         put word theWordIndex - 1 to theWordIndex of theText into theTwoWordArray[theTwoWordCount]
      end if
      if theWordIndex > 2 then
         add 1 to theThreeWordCount
         put word theWordIndex - 2 to theWordIndex of theText into theThreeWordArray[theThreeWordCount]
      end if
   end repeat
   --
   combine theTwoWordArray using return
   put theTwoWordArray into field 2
   combine theThreeWordArray using return
   put theThreeWordArray into field 3
end mouseUp
If you don't need to preserve the number of spaces between words, you can probably optimize this with something like:

Code: Select all

on mouseUp
   put field 1 into theText
   --
   put 0 into theWordIndex
   put 0 into theTwoWordCount
   put 0 into theThreeWordCount
   put empty into theTwoWordBuffer
   put empty into theThreeWordBuffer
   repeat for each word theWord in theText
      add 1 to theWordIndex
      if theWordIndex = 1 then
         put theWord into theTwoWordBuffer
         put theWord into theThreeWordBuffer
      else 
         put space & theWord after theTwoWordBuffer
         put space & theWord after theThreeWordBuffer
         if theWordIndex > 1 then
            add 1 to theTwoWordCount
            if theWordIndex > 2 then
               delete word 1 of theTwoWordBuffer
            end if
            put theTwoWordBuffer into theTwoWordArray[theTwoWordCount]
         end if
         if theWordIndex > 2 then
            add 1 to theThreeWordCount
            if theWordIndex > 3 then
               delete word 1 of theThreeWordBuffer
            end if
            put theThreeWordBuffer into theThreeWordArray[theThreeWordCount]
         end if
      end if
   end repeat
   --
   combine theTwoWordArray using return
   put theTwoWordArray into field 2
   combine theThreeWordArray using return
   put theThreeWordArray into field 3
end mouseUp
While you may not see much difference in execution time between the two approaches with short texts, the speed advantage of the second approach will become noticeable as the texts grow.

HTH,

Jan Schenkel.

Posted: Tue Feb 17, 2009 6:40 pm
by RevDevelopment
:o Holy smokes...you rock!

Off to test this...

Are you available for offline consultation of Revolution? (Am I even allowed to ask this in the forum???...sorry if no)

Posted: Tue Feb 17, 2009 7:41 pm
by mwieder
Jan does indeed rock. Now that the coffee's hit I rise to the occasion and post an alternative way to do things:

Code: Select all

    repeat for each word tWord in field "fldText"
        put tWord & space after tTwoWords
        put tWord & space after tThreeWords
        if the number of words in tTwoWords is 2 then
            put tTwoWords after tTwoWordList
            put cr into char -1 of tTwoWordList
            delete word 1 of tTwoWords
        end if
        if the number of words in tThreeWords is 3 then
            put tThreeWords after tThreeWordList
            put cr into char -1 of tThreeWordList
            delete word 1 of tThreeWords
        else
        end if
    end repeat

Posted: Tue Feb 17, 2009 8:12 pm
by RevDevelopment
Awesome....you both rock.

Now to perform a little speed test and see who Super Rocks :)

Posted: Tue Feb 17, 2009 8:40 pm
by RevDevelopment
@mwieder

Ok...so I've hit just a small snag with the code.

It works perfectly...except I was hoping to place the values in an array so that I could reference them individually.

I tried this:

Code: Select all

on mouseUp
   local loopvalue
   put empty into loopvalue
   
   repeat for each word tWord in field "fldText"
      add 1 to loopvalue
      put tWord & space after tTwoWords
      put tWord & space after tThreeWords
      if the number of words in tTwoWords is 2 then
         put tTwoWords into tTwoWordList[loopvalue]
         put cr into char -1 of tTwoWordList
         delete word 1 of tTwoWords
      end if
      if the number of words in tThreeWords is 3 then
         put tThreeWords into tThreeWordList[loopvalue]
         put cr into char -1 of tThreeWordList
         delete word 1 of tThreeWords
      else
      end if
   end repeat 
   
   answer tTwoWordList[1]
end mouseUp
But that's not working...what is wrong with my array here?

Thanks!

Posted: Tue Feb 17, 2009 8:59 pm
by RevDevelopment
@Jan,
That top code worked perfect...thanks for your help on this.

Posted: Tue Feb 17, 2009 9:13 pm
by SparkOut
I have played with this a little just for academic reasons and Mark's is the super rocker from my trials. I was doing the same sort of process in a "repeat for each word" loop. Given a source text of the "Dennis went" sentence repeated a bunch of times (about 50 or 60) the results were:

Mark's "repeat for each": 8 to 14 milliseconds

My "repeat for each" (similar to Mark's but not as efficient, obviously): 14 to 19 milliseconds.

Jan's routine 1: 16 to 21 milliseconds

Jan's routing 2: 28 to 36 milliseconds

Jan's routines seem to give some odd results too. I'm getting peculiar gaps in the list of two and three words - is that because of the combining of the arrays?


Just out of testingness to see how much a "repeat with i = 1 to..." loop is slowed compared with the "repeat for each" approach, I did that too, and the results were 28 to 31 milliseconds.

Posted: Tue Feb 17, 2009 9:20 pm
by SparkOut
And to get the array - just leave Mark's script as it is, don't add the loop counters or anything. Just let it build up the list and then at the very end

Code: Select all

split tTwoWordList by cr
split tThreeWordList by cr
That will give you two arrays each numerically indexed from 1.

Posted: Tue Feb 17, 2009 9:22 pm
by SparkOut

Code: Select all

on mouseUp
   put the milliseconds into tNow
   repeat for each word tWord in field "Field1" 
      put tWord & space after tTwoWords 
      put tWord & space after tThreeWords 
      if the number of words in tTwoWords is 2 then 
         put tTwoWords after tTwoWordList 
         put cr into char -1 of tTwoWordList 
         delete word 1 of tTwoWords 
      end if 
      if the number of words in tThreeWords is 3 then 
         put tThreeWords after tThreeWordList 
         put cr into char -1 of tThreeWordList 
         delete word 1 of tThreeWords 
      else 
      end if 
   end repeat
   split tTwoWordList by cr
   split tThreeWordList by cr
   put tTwoWordList[1] into field "Field2"
   put tThreeWordList[1] into field "Field3"
   put the milliseconds - tNow into field "fldTime"
end mouseUp
Mark's version above looks like the best result to me, based on the (relatively small) text sample. YMMV with a different sample text length, so you might need to do some more experiments of your own.

Posted: Tue Feb 17, 2009 9:25 pm
by RevDevelopment
SparkOut,
Thanks so much for this info!

I'm going to test this out now...

Posted: Wed Feb 18, 2009 10:45 pm
by mwieder
Well, I wasn't really aiming for speed here, just that the overhead of putting everything into an array along the way seemed unnecessary. Out of curiousity, why do you want the results to end up in an array rather than just keeping them in a variable? Seems like you could use the line number just as easily as the array key. I don't see the advantage, so I must be missing out on something.

Posted: Wed Feb 18, 2009 10:48 pm
by RevDevelopment
mweider,
There probably isn't an advantage...it's just that I wasn't aware you could call a line of text with a line number in Revolution.

Good idea...I'll try that.

Posted: Wed Feb 18, 2009 11:13 pm
by mwieder
put line x of tVariable into tLine

Arrays are one of the most powerful features of rev. Normally I would think of using arrays for speed, I just didn't see that there was anything to be gained in this case with numeric keys.

Posted: Wed Feb 18, 2009 11:59 pm
by RevDevelopment
Awesome...thanks so much for your help!

I really appreciate it.