itemDelimiter misbehaving

richmond62 · Post by **richmond62** » Wed Oct 23, 2019 8:26 am

This is dead clunky:

if char 1 of word KOUNT2 of TEKST is "u" and char 2 of word KOUNT2 of TEKST is "n" then

especially if one wants to vary the number of letters in one's prefix.

Obviously this needs to be changed to something like:

Code: Select all

if char(1-2) of word KOUNT2 of TEKST is "un" then

or

Code: Select all

if char(1-2) of word KOUNT2 of TEKST contains "un" then

BUT neither of these work.

bogs · Post by **bogs** » Wed Oct 23, 2019 8:57 am

I suspect that last post is just having a go at us, as I've never seen
if char 1 to 2 of KOUNT2 is "un" then put "true"
(which does work) put down like
if char(1-2) of word KOUNT2...

*Edit - it surprises me more that this doesn't work
put character 2 to 1 of KOUNT2

I would expect to see "nu"

richmond62 · Post by **richmond62** » Wed Oct 23, 2019 11:44 am

I suspect that last post is just having a go at us

No I was not "having a go" at you: I just got the code wrong.

Klaus · Post by **Klaus** » Wed Oct 23, 2019 12:06 pm

LC first processes everything in parens, so I would exspect to get the last character ( 1-2 = -1) with -> char (2-1) of xyz
Just tested and that actually works! However not the way Richmond exspected.

Edit - it surprises me more that this doesn't work
put character 2 to 1 of KOUNT2

In LC the "startxxx" MUST be <= the "endxxx", where xxx stands for char, word, line etc.

richmond62 · Post by **richmond62** » Wed Oct 23, 2019 12:36 pm

-
This allows you to set the prefix that you are searching your text for.

richmond62 · Post by **richmond62** » Wed Oct 23, 2019 3:42 pm

So, my next problem is how to refer to a number of chars at the end of a word of unknown length.

This is for analysing texts for suffixes.

i.e. I should be able to "pick up" words such as:

playful, forgetful, successful and joyful because they all end in -ful, but they are off differing lengths.

Of course this could be done by measuring the length of each word (put the number of chars) as it is processed and
then doing some simple Mathematics, but that does seem unnecessarily clunky.

Klaus · Post by **Klaus** » Wed Oct 23, 2019 3:50 pm

Code: Select all

...
put char -3 to -1 of "thisisaververyverylongword" into the_last_three_chars
## -> ord
...

richmond62 · Post by **richmond62** » Wed Oct 23, 2019 4:06 pm

Thanks!

-

-
But you know what happens when you give the boy something, don't you?

He asks for more!

I am processing texts that consist of sentences ending in a variety of symbols including "." and "!"
so I should like to set the itemDelimiter to more than one symbol . . .

Klaus · Post by **Klaus** » Wed Oct 23, 2019 4:15 pm

richmond62 wrote: ↑
Wed Oct 23, 2019 4:06 pm
But you know what happens when you give the boy something, don't you?
He asks for more!

Really? Go figure, did not notice so far...

richmond62 wrote: ↑
Wed Oct 23, 2019 4:06 pm
I am processing texts that consist of sentences ending in a variety of symbols including "." and "!" so I should like to set the itemDelimiter to more than one symbol . . .

We can have an itemdelimiter consisting of one or more characters, but not different itemdelimiters in one run.

If you put the text into a variable, avoid updating any field during the loop (sic!) and use REPEAT FOR EACH, you will be surprised how fast LC can handle your tasks!

Klaus · Post by **Klaus** » Wed Oct 23, 2019 4:24 pm

I just made a little test with my first scrip on page 1:

Code: Select all

on mouseUp
   set itemDel to "."
   ## Accessing variables is a thousandfold faster that accessing fields
   put fld "noCR" into tText
   
   put the millisecs into tMS
   repeat 1000
      
      ## Remove unwanted linebreaks first, but add a SPACE,
      ## since some lines have a CR instead of the neccessary SPACE in it
      replace CR with " " in tText
      
      ## Now collect all items in a CR delimited list
      repeat for each item tItem in tText
         put tItem & "." & CR after tNewText
      end repeat
      
      ## Remove possible leading SPACES in text
      replace (CR & " ") with CR in tNewText
   end repeat
   
   put the millisecs - tMS
   
   put tNewText into fld 2
end mouseUp

And got 628 in the message box, for 1000 (THOUSAND) repeats.
See what I mean?

richmond62 · Post by **richmond62** » Wed Oct 23, 2019 4:27 pm

Klaus

You told me about speeding things up by avoiding fields in about 2004 . . .

However, as I do not have eyes in the back of my head, until I am sure this sort of things works
I will do the work with fields: once the functionality is "there" I can swap over to variables
without any trouble.

--------------

That is not my fixation at the moment.

I had a "waking dream" like this:

pseudocode

set the itemDelimiter to (".","!","?")

jacque · Post by **jacque** » Wed Oct 23, 2019 5:13 pm

Handy hints:

You can use "begins with" and "ends with"

Code: Select all

if word 2 of tData begins with "un" then... 

if last word of tData ends with "est" then...

If the second integer in a character range reference is smaller than the first, it indicates the position of the insertion point in a field. Specifying char 2 to 1 of a field will place the insertion point between those two characters. Doesn't apply to variables for obvious reasons.

FourthWorld · Post by **FourthWorld** » Wed Oct 23, 2019 5:31 pm

richmond62 wrote: ↑
Wed Oct 23, 2019 4:06 pm
I am processing texts that consist of sentences ending in a variety of symbols including "." and "!"
so I should like to set the itemDelimiter to more than one symbol . . .

…and "?", and sometimes "…", and likely others I'm not thinking of right now.

Parsing natural language is a very difficult problem, which is why much of the programming world has contributed to and relies on the ICU libraries now part of LC's Unicode implementation.

There have been comments earlier that using LC's built-in sentence chunk type would not be a good fit for parsing sentences, but I missed the explanation as to why. It already handles the full range of punctuation and structural indicators of sentence boundaries based on the collective several decades of experience of contributing programmers who specialize in such work.

Similarly, using the newer trueWord chunk type instead of word will avoid the many difficulties that arise from HC's definition of a word, which can include punctuation and other characters not actually part of the word, and treats any string of any length enclosed in double quotes as a single word.

For language analysis tasks, using modern language analysis tooling now built into LC can greatly simplify code, and will likely run much faster as it moves a lot of the heavy lifting of parsing from scripts into optimized machine code.

Related: I saw that you're using this to identify suffixes, is that correct? Are you making a stemmer? I have an old Porter stemmer (written by Ken Ray and Andrew Meit) lying around here somewhere. It's so old that I wouldn't use it in production today without rewriting it to take advantage of newer language features, but it may be helpful in providing some inspiration for a stemmer if that's what you're working on.

Klaus · Post by **Klaus** » Wed Oct 23, 2019 5:38 pm

FourthWorld wrote: ↑
Wed Oct 23, 2019 5:31 pm
...
There have been comments earlier that using LC's built-in sentence chunk type would not be a good fit for parsing sentences, but I missed the explanation as to why.

I only said that it would not fit in THIS situation!
https://forums.livecode.com/viewtopic.p ... 45#p184542

FourthWorld · Post by **FourthWorld** » Wed Oct 23, 2019 5:44 pm

Klaus wrote: ↑
Wed Oct 23, 2019 5:38 pm

FourthWorld wrote: ↑
Wed Oct 23, 2019 5:31 pm
...
There have been comments earlier that using LC's built-in sentence chunk type would not be a good fit for parsing sentences, but I missed the explanation as to why.
I only said that it would not fit in THIS situation!
https://forums.livecode.com/viewtopic.p ... 45#p184542

Thank you. I'd seen that, but it wasn't clear to me then, and even re-reading it now I still don't understand how it makes a case for writing new scripts that attempt to do a better job of natural language parsing than the ICU lib we have available in LC's newer chunk types.

Maybe I need more coffee.

LiveCode Forums.

itemDelimiter misbehaving

Re: itemDelimiter misbehaving

Re: itemDelimiter misbehaving

Re: itemDelimiter misbehaving

Re: itemDelimiter misbehaving

Re: itemDelimiter misbehaving

Re: itemDelimiter misbehaving

Re: itemDelimiter misbehaving

Re: itemDelimiter misbehaving

Re: itemDelimiter misbehaving

Re: itemDelimiter misbehaving

Re: itemDelimiter misbehaving

Re: itemDelimiter misbehaving

Re: itemDelimiter misbehaving

Re: itemDelimiter misbehaving

Re: itemDelimiter misbehaving