Page 1 of 1
Trying to group phrases in text...
Posted: Mon Feb 16, 2009 7:47 pm
by RevDevelopment
I'm trying to take a block of text and grab all the 2 word and 3 word phrases from it.
So, say for example I have this chunk of text:
Mr. Dennis went to the flower shop to find a bucket of roses.
I was hoping to find a way to grab all the two word phrases...and then grab all the three word phrases.
Example:
2 Word Phrases:
Mr. Dennis
Dennis went
went to
to the
the flower
flower shop
shop to
to find
find a
a bucket
bucket of
of roses
3 Word Phrases:
Mr. Dennis went
Dennis went to
went to the
to the flower
the flower shop
flower shop to
shop to find
to find a
find a bucket
a bucket of
bucket of roses
I've got some of the first steps and am trying to put this all in an array...but the code for looping this in Revolution has got me stumped.
Thanks for any tips/advice.
Posted: Tue Feb 17, 2009 6:10 am
by Janschenkel
Ah, the perfect way to kick-start your brain in the morning whilst eating breakfast

Off the top of my head:
Code: Select all
on mouseUp
put field 1 into theText
--
put the number of words in theText into theWordCount
put 0 into theTwoWordCount
put 0 into theThreeWordCount
repeat with theWordIndex = 1 to theWordCount
if theWordIndex > 1 then
add 1 to theTwoWordCount
put word theWordIndex - 1 to theWordIndex of theText into theTwoWordArray[theTwoWordCount]
end if
if theWordIndex > 2 then
add 1 to theThreeWordCount
put word theWordIndex - 2 to theWordIndex of theText into theThreeWordArray[theThreeWordCount]
end if
end repeat
--
combine theTwoWordArray using return
put theTwoWordArray into field 2
combine theThreeWordArray using return
put theThreeWordArray into field 3
end mouseUp
If you don't need to preserve the number of spaces between words, you can probably optimize this with something like:
Code: Select all
on mouseUp
put field 1 into theText
--
put 0 into theWordIndex
put 0 into theTwoWordCount
put 0 into theThreeWordCount
put empty into theTwoWordBuffer
put empty into theThreeWordBuffer
repeat for each word theWord in theText
add 1 to theWordIndex
if theWordIndex = 1 then
put theWord into theTwoWordBuffer
put theWord into theThreeWordBuffer
else
put space & theWord after theTwoWordBuffer
put space & theWord after theThreeWordBuffer
if theWordIndex > 1 then
add 1 to theTwoWordCount
if theWordIndex > 2 then
delete word 1 of theTwoWordBuffer
end if
put theTwoWordBuffer into theTwoWordArray[theTwoWordCount]
end if
if theWordIndex > 2 then
add 1 to theThreeWordCount
if theWordIndex > 3 then
delete word 1 of theThreeWordBuffer
end if
put theThreeWordBuffer into theThreeWordArray[theThreeWordCount]
end if
end if
end repeat
--
combine theTwoWordArray using return
put theTwoWordArray into field 2
combine theThreeWordArray using return
put theThreeWordArray into field 3
end mouseUp
While you may not see much difference in execution time between the two approaches with short texts, the speed advantage of the second approach will become noticeable as the texts grow.
HTH,
Jan Schenkel.
Posted: Tue Feb 17, 2009 6:40 pm
by RevDevelopment

Holy smokes...you rock!
Off to test this...
Are you available for offline consultation of Revolution? (Am I even allowed to ask this in the forum???...sorry if no)
Posted: Tue Feb 17, 2009 7:41 pm
by mwieder
Jan does indeed rock. Now that the coffee's hit I rise to the occasion and post an alternative way to do things:
Code: Select all
repeat for each word tWord in field "fldText"
put tWord & space after tTwoWords
put tWord & space after tThreeWords
if the number of words in tTwoWords is 2 then
put tTwoWords after tTwoWordList
put cr into char -1 of tTwoWordList
delete word 1 of tTwoWords
end if
if the number of words in tThreeWords is 3 then
put tThreeWords after tThreeWordList
put cr into char -1 of tThreeWordList
delete word 1 of tThreeWords
else
end if
end repeat
Posted: Tue Feb 17, 2009 8:12 pm
by RevDevelopment
Awesome....you both rock.
Now to perform a little speed test and see who Super Rocks

Posted: Tue Feb 17, 2009 8:40 pm
by RevDevelopment
@mwieder
Ok...so I've hit just a small snag with the code.
It works perfectly...except I was hoping to place the values in an array so that I could reference them individually.
I tried this:
Code: Select all
on mouseUp
local loopvalue
put empty into loopvalue
repeat for each word tWord in field "fldText"
add 1 to loopvalue
put tWord & space after tTwoWords
put tWord & space after tThreeWords
if the number of words in tTwoWords is 2 then
put tTwoWords into tTwoWordList[loopvalue]
put cr into char -1 of tTwoWordList
delete word 1 of tTwoWords
end if
if the number of words in tThreeWords is 3 then
put tThreeWords into tThreeWordList[loopvalue]
put cr into char -1 of tThreeWordList
delete word 1 of tThreeWords
else
end if
end repeat
answer tTwoWordList[1]
end mouseUp
But that's not working...what is wrong with my array here?
Thanks!
Posted: Tue Feb 17, 2009 8:59 pm
by RevDevelopment
@Jan,
That top code worked perfect...thanks for your help on this.
Posted: Tue Feb 17, 2009 9:13 pm
by SparkOut
I have played with this a little just for academic reasons and Mark's is the super rocker from my trials. I was doing the same sort of process in a "repeat for each word" loop. Given a source text of the "Dennis went" sentence repeated a bunch of times (about 50 or 60) the results were:
Mark's "repeat for each": 8 to 14 milliseconds
My "repeat for each" (similar to Mark's but not as efficient, obviously): 14 to 19 milliseconds.
Jan's routine 1: 16 to 21 milliseconds
Jan's routing 2: 28 to 36 milliseconds
Jan's routines seem to give some odd results too. I'm getting peculiar gaps in the list of two and three words - is that because of the combining of the arrays?
Just out of testingness to see how much a "repeat with i = 1 to..." loop is slowed compared with the "repeat for each" approach, I did that too, and the results were 28 to 31 milliseconds.
Posted: Tue Feb 17, 2009 9:20 pm
by SparkOut
And to get the array - just leave Mark's script as it is, don't add the loop counters or anything. Just let it build up the list and then at the very end
Code: Select all
split tTwoWordList by cr
split tThreeWordList by cr
That will give you two arrays each numerically indexed from 1.
Posted: Tue Feb 17, 2009 9:22 pm
by SparkOut
Code: Select all
on mouseUp
put the milliseconds into tNow
repeat for each word tWord in field "Field1"
put tWord & space after tTwoWords
put tWord & space after tThreeWords
if the number of words in tTwoWords is 2 then
put tTwoWords after tTwoWordList
put cr into char -1 of tTwoWordList
delete word 1 of tTwoWords
end if
if the number of words in tThreeWords is 3 then
put tThreeWords after tThreeWordList
put cr into char -1 of tThreeWordList
delete word 1 of tThreeWords
else
end if
end repeat
split tTwoWordList by cr
split tThreeWordList by cr
put tTwoWordList[1] into field "Field2"
put tThreeWordList[1] into field "Field3"
put the milliseconds - tNow into field "fldTime"
end mouseUp
Mark's version above looks like the best result to me, based on the (relatively small) text sample. YMMV with a different sample text length, so you might need to do some more experiments of your own.
Posted: Tue Feb 17, 2009 9:25 pm
by RevDevelopment
SparkOut,
Thanks so much for this info!
I'm going to test this out now...
Posted: Wed Feb 18, 2009 10:45 pm
by mwieder
Well, I wasn't really aiming for speed here, just that the overhead of putting everything into an array along the way seemed unnecessary. Out of curiousity, why do you want the results to end up in an array rather than just keeping them in a variable? Seems like you could use the line number just as easily as the array key. I don't see the advantage, so I must be missing out on something.
Posted: Wed Feb 18, 2009 10:48 pm
by RevDevelopment
mweider,
There probably isn't an advantage...it's just that I wasn't aware you could call a line of text with a line number in Revolution.
Good idea...I'll try that.
Posted: Wed Feb 18, 2009 11:13 pm
by mwieder
put line x of tVariable into tLine
Arrays are one of the most powerful features of rev. Normally I would think of using arrays for speed, I just didn't see that there was anything to be gained in this case with numeric keys.
Posted: Wed Feb 18, 2009 11:59 pm
by RevDevelopment
Awesome...thanks so much for your help!
I really appreciate it.