Page 1 of 3

Get Number of Words Within Quotation Marks

Posted: Thu Oct 27, 2016 7:46 pm
by tonymac
I appeal to the forum experience and intellect...

Task: I need to download a .txt file from my website, put it in a variable, and then show the text "one WORD at a time". The text has hundreds and hundreds of quotes with quotation marks in it. Livecode treats all text between quotation marks as a single word. This does not work for me.

If Livecode comes across this text in a variable, "This is a quoted comment", it sees all the words between the quotes as a single word. I need to figure out how to get Livecode to see this as five separate words ("This is a quoted comment"). While I can see in some circumstances this behavior might be useful seeing it as a single word, if you type the text above into a Microsoft Word document, it shows there are five words which I think should be the default.

While I am still working on it, I have not had success. Tried treating the text in the variable as items and a space as the delimiter but that did not work.

Appreciate any thoughts or comments on what I could try.

Re: Get Number of Words Within Quotation Marks

Posted: Thu Oct 27, 2016 7:50 pm
by shaosean
Look at the keyword "token" (or maybe "tokens")..
Grab the quoted string as a word and then go over each token of that word..

Re: Get Number of Words Within Quotation Marks

Posted: Thu Oct 27, 2016 7:53 pm
by FourthWorld
See the trueWord chunk type in the Dictionary.

Token is useful for parsing LiveCode expressions, but will likely return a much higher number of elements than there are actual words.

TrueWord takes advantage of IBM's natural language algorithms in the standard Unicode libraries to provide a good work count that takes punctuation into account.

Re: Get Number of Words Within Quotation Marks

Posted: Thu Oct 27, 2016 8:00 pm
by dunbarx
Hi.

The old-fashioned way might be to replace all spaces with some obscure character, set the itemDel, and count the items. You would have to eliminate all empty items, though, resulting from consecutive spaces in the original text.

But as you see, there are more modern methods...

Craig Newman

EDIT, of course, you could use space as a delimiter too, I suppose. I am still in v6x, so there are no trueWords yet. Setting the itemDel to space will count "trueWords".

Re: Get Number of Words Within Quotation Marks

Posted: Thu Oct 27, 2016 9:24 pm
by tonymac
Well the problem with using the keyword "trueword" to determine the number of words between quotation marks is that it returns the first and last words between the quotes without the quotation mark itself.

Example: "Here is some-text between quotes."

The first Livecode trueword is: Here... not "Here I need the quotation mark to stay with the word... it's not separated by a space anyway so it should go with the word. Also even though "some-text" is not separated by a space, trueword shows this as two words ignoring the dash between some and text.

Text editors use the space character in determining the number of words. I was hoping I wasn't going to have to massage the text before displaying words... one word at a time in a field... separated by spaces.

Continuing to search for a solution...

Re: Get Number of Words Within Quotation Marks

Posted: Thu Oct 27, 2016 9:42 pm
by dunbarx
Hmmm.

Does the old-fashioned way then seem more apt? Quotes would travel with any chars between spaces, since they are just chars, after all. Spaces, again making sure you condense any strings of continuous spaces, don't care what is around them.

Craig

Re: Get Number of Words Within Quotation Marks

Posted: Thu Oct 27, 2016 9:45 pm
by FourthWorld
Whether punctuation surrounding a word is also part of the word may differ among software, but linguistically it would be seen as separate.

If IBM's Unicode libraries for natural language parsing won't cover your use case, it may be specific enough to expect to write a custom solution.

Do you want the number of words, or the words (or words+punctuation) themselves?

Re: Get Number of Words Within Quotation Marks

Posted: Thu Oct 27, 2016 10:20 pm
by dunbarx
Richard.

Old fashioned aside, doesn't massaging the text with space delimiters cut through any possible issue? Punctuation can be explicitly excluded by replacing specific chars with empty, and consecutive spaces can be readily discovered and shortened. What remains is a string containing a custom definition of a "word", which can then be processed as desired.

Craig

Re: Get Number of Words Within Quotation Marks

Posted: Thu Oct 27, 2016 10:42 pm
by [-hh]
Hi all.
TMHO, "token" (shaosean's post) is clearly a way to go. But "tokens" are hard to understand and to remember, one uses the keyword rather seldom.

Yet another method could be:

Code: Select all

replace quote with numToChar(1) in str
-- act on words in str
replace numToChar(1) with quote in str
@Craig. Sadly numToChar(42) is not usable without additional efforts because there may be more "*" in str ...

Re: Get Number of Words Within Quotation Marks

Posted: Fri Oct 28, 2016 8:35 am
by Thierry
Hi tonymac,

Not sure if this will make you happy, but following your idea of space as an item delimiter,
here is a quick and dirty working test:

in fld 1 (with some extra spaces and a return ):

Code: Select all

Example:    "Here is some-text between quotes."
Does this help?
the script:

Code: Select all

on mouseUp
   local T, R
   put replaceText( fld 1 , "[\t\s\r\n]+", space) into T
   set the itemdel to space
   repeat with n=1 to the number of items in T
      put n &": " & item n of T &cr after R
   end repeat
   put R into fld 2
   answer "Find " & n & " words."
end mouseUp
Result:

Code: Select all

1: Example:
2: "Here
3: is
4: some-text
5: between
6: quotes."
7: Does
8: this
9: help?
Thierry

Re: Get Number of Words Within Quotation Marks

Posted: Fri Oct 28, 2016 11:52 am
by richmond62
Strip out the quotation marks, then Livecode won't have the problem you described.
anti-quote2.png
anti-quote.livecode.zip
Here's the stack
(4.74 KiB) Downloaded 295 times

Re: Get Number of Words Within Quotation Marks

Posted: Fri Oct 28, 2016 1:13 pm
by [-hh]
Hi Richmond,
as I understand, the OP wants after the partitioning into words the quotes back to where they were before, for example
"Here I am" should translate to the three 'parts': <"here> and <I> and <am!">.

Re: Get Number of Words Within Quotation Marks

Posted: Fri Oct 28, 2016 2:02 pm
by dunbarx
Old fashioned.

Code: Select all

on mouseUp
   get fld 1 --with the text in it
   set the itemDel to space
   repeat with y = the number of chars of it down to 1
      if char y of it = space and char y - 1 of it = space then delete char y of it
   end repeat
   repeat for each item tItem in it
      put titem & return after temp
   end repeat
   answer temp
end mouseUp
Now this may take a bit of time with a large body of text. I bet there is a regex that can winnow out multiple spaces. That still makes it old-fashioned.

There is a hidden issue if a lone quote is surrounded by multiple spaces on both sides, or if two quotes sit together. Just another line or two to fix that.

Craig

Edit. Even with a half meg of text, only a second or two is required to finish.

Re: Get Number of Words Within Quotation Marks

Posted: Fri Oct 28, 2016 3:40 pm
by FourthWorld
dunbarx wrote:Richard.

Old fashioned aside, doesn't massaging the text with space delimiters cut through any possible issue?
I don't know. We'll need to hear back from tonymac on my question about what specifically he needs to do. His OP mentions needing the retain the punctuation surrounding words, but as outcomes me mentions only word counts.

If he's just looking for word counts it'll be hard to beat the simplicity and efficiency of leveraging the industry-standard Unicode parsing available to us with trueWords.

If he needs something else, the best solution will depend on what that something else is.

Re: Get Number of Words Within Quotation Marks

Posted: Fri Oct 28, 2016 9:12 pm
by AxWald
Craig.
dunbarx wrote:I bet there is a regex that can winnow out multiple spaces.
Bah. Brute force for the win!

Code: Select all

function killSpaces MyStr
   repeat until offset("  ", MyStr) = 0
      replace "  " with " " in MyStr
   end repeat
   return MyStr
end killSpaces
"A clockwork orange", converted from epub to text:

Code: Select all

Kill Spaces: Initial size: 319316 Bytes; Reduced by: 5334 Bytes; Milliseconds used: 38
;-)

Have fun!

PS: Found no peace until I tried this word list thingie. Again, poor "A clockwork orange" had to suffer:

Code: Select all

Kill Spaces: Initial size: 319316 Bytes; Reduced by: 5334 Bytes; Milliseconds used: 39
RubbishRem: Initial size: 313982 Bytes; Reduced by: 9245 Bytes; Milliseconds used: 166
ListMake: Initial size: 304737 Bytes; Resulting words: 5412; Milliseconds used: 3060
-----------------------------------------------------------------------------------
Time spent for all of this: 3.265 Seconds
List looks like:

Code: Select all

"Aaaaaaarhgh"
"About
"After
"Ah
"Ah"
"Aha
"Alekth
"All
"Am
...
Stack attached.