Adventures into matchChunk and regex

bn · Post by bn » Tue Dec 09, 2025 12:01 pm

Unfortunately "matchChunk" is not well explained in the dictionary.

The task was to break up short texts into 3, 4 or 5 paragraphs. I manged that by counting full stops "." and determine the number of sentences that should make up a paragraph. However the texts could contain decimal numbers and the decimal delimiter is a ".". Furthermore abbreviations used "." as in "e.g." which I had to navigate around. Doing this in native Livecode would have been quite some coding and I figured that matchChunk using a regex expression could be helpful. Now regex is something I avoided since my needs up to now could be solved using livecode syntax.

Stam recommends using "Regex101" website and I followed his advice and found it a very sensible test ground for testing regex expressions.

I want to post here my solution to the problem of cleaning up a text by replacing non full stop dots temporarily and then reintroduce dots after formatting which is not shown here.

Since I did not find any working solutions for "matchChunk" I post my function here just as an example using matchChunk.

Code: Select all

function escapeAbbrAndDezimalpoints pText
   local startMatch, endMatch, tStart, tBegin, tEnd, tVar, tRegex, tList
   local tFrom, tTo
   
   put pText into tVar
   
   ## this catches "e.g." and "2.7" 
   put "([a-zA-Z]\.[a-zA-Z]\.|[0-9]\.[0-9])" into tRegex
   
   ## define search range; tFrom will be changed if a hit is found, excluding part up to a hit
   put 1 into tFrom
   put the number of chars of tVar into tTo
   
   repeat
      if matchChunk(char tFrom to tTo of tVar, tRegex, tStart, tEnd) is true then
         
         ## built list of hits
         put tStart + tFrom - 1, tEnd + tFrom - 1 & return after tList
         
         ## update search range
         add tEnd to tFrom
         
         if tFrom > tTo then
            exit repeat
         end if
      else
         exit repeat
      end if
   end repeat
   
   if tList is empty then return tVar
   
   ## sort not necessary here but for cases where replace changes character count
   sort tList descending numeric by item 1 of each
    
   repeat for each line aLine in tList
      put item 1 of aLine into tBegin
      put item 2 of aLine into tEnd
      replace "." with "±" in char tBegin to tEnd of tVar
   end repeat
   return tVar
end escapeAbbrAndDezimalpoints

Kind regards
Bernd

stam · Post by **stam** » Tue Dec 09, 2025 12:29 pm

Hi Bernd,

Yes I swear but Regex101 for all regex testing. The only problem with it is that it defaults to JS regex which has a 'global' match flat (ie all matches are returned) whereas LiveCode uses PCRE regex, which doesn't have this (only the first match is returned).

As a workaround I uses this function, which returns an array with all matches, and the start and end offsets for each match. It only works if you have 1 pattern to match, but that's the usual case anyway.

Code: Select all

function globalMatch pSource, pRegex
    local tMatchesA, tVarCount, tVar1, tVar2, tSource, tTemp
    -- count occurences
    put replaceText(pSource, pRegex, "regexMatch" & return) into tTemp
    filter lines of tTemp with "*regexMatch"
    put the number of lines of tTemp into tVarCount
    
    -- get occurences by locating offsets with matchChunk(), loop to store values and delete occurence in copy of source.
    put pSource into tSource
    repeat with x = 1 to tVarCount
        if matchChunk(tSource, pRegex , tVar1, tVar2) then
            put char tVar1 to tVar2 of tSource into tMatchesA[x]["value"]
            put tVar1 into tMatchesA[x]["start"]
            put tVar2 into tMatchesA[x]["end"]
            delete char tVar1 to tVar2 of tSource
        end if
    end repeat
    return tMatchesA
end globalMatch

Not sure if that adds anything to your solution, but may be helpful for others...

richmond62 · Post by **richmond62** » Wed Mar 25, 2026 9:58 am

I manged (sic) that by counting full stops "."

However the texts could contain decimal numbers and the decimal delimiter is a ".". Furthermore abbreviations used "." as in "e.g." which I had to navigate around.

Not such a problem as it first appears for the simple reason that a full stop is ALWAYS followed by a space: so, if you do this sort of thing:

Code: Select all

set the itemDelimiter to ". "

this should weed out the decimal separator and the stuff such as 'v.g.', 'e.g.' and so forth.

Although phrase such as 'v.g. Sandy Smith' may cause probs because of the space after the 'v.g.'

Personally I'd trawl through my text first looking for 'v.g.', 'e.g.', and so forth

Code: Select all

set the itemDelimiter to '.g. '

Geddit?

and replace all those spaces with some 'daft' character a bit like this:

'v.g.⏃'

having chopped up my text as per the OP, I'd then go back and replace 'that thing' with a space again.

bn · Post by bn » Wed Mar 25, 2026 10:47 am

However:

A knockout (abbreviated to KO or K.O.) is a fight-ending...

(From Wikipedia)

Since not all abbreviations might end in ".g. " and it is hard to tell which abbreviations might appear in text I resorted to Regex.

That is the whole point of my post.

Kind regards
Bernd

richmond62 · Post by **richmond62** » Wed Mar 25, 2026 11:09 am

Since not all abbreviations might end in ".g. "

No, they don't, but ALL abbreviations (in English) that end in a full stop are followed by a SPACE.

And a little thought can weed out abbreviations before spotting sentence ends.

And I assume (???) all the 'chunk' stuff in xTalk either:

1. was put there so that xTalk users did NOT have to use Regex, or

2. were put there before Regex became available in xTalk.

Regex was invented in the middle 1950s.

my Danny Goodman HyperCard book (1993) has a section on Chunk stuff, but no mention of Regex.

I have absolutely no objection to using Regex whatsoever, but IFF something can be done in xTalk 'pure and simple' why resort to it?

richmond62 · Post by **richmond62** » Wed Mar 25, 2026 11:27 am

https://dn721601.ca.archive.org/0/items ... ndbook.pdf
-
Page 729

bn · Post by bn » Wed Mar 25, 2026 12:54 pm

I have absolutely no objection to using Regex whatsoever, but IFF something can be done in xTalk 'pure and simple' why resort to it?

How do you resolve this?

El Banco Nacional de México, S.A. ("Banamex"), Teléfonos de México, S.A.B. de C.V. ("TELMEX"), Operadora de Cinemas, S.A. de C.V. ("Cinemex"). That is O.K. and e.g. no simple K.O.

using chunks?

Kind regards
Bernd

richmond62 · Post by **richmond62** » Wed Mar 25, 2026 1:58 pm

How do you resolve this?

If you could explain to me what you mean by 'resolve' I may have a chance.

bn · Post by bn » Wed Mar 25, 2026 2:28 pm

The task was to break up short texts into 3, 4 or 5 paragraphs. I manged that by counting full stops "." and determine the number of sentences that should make up a paragraph

The question is how many sentences are there.
Kind regards
Bernd

richmond62 · Post by **richmond62** » Wed Mar 25, 2026 2:55 pm

There are 2 sentences: and I can work that out in about 5 seconds without the questionable benefit of a computer.

AND:
-

dunbarx · Post by **dunbarx** » Wed Mar 25, 2026 4:19 pm

Richmond.

Bernd's example is, er, an example of a string of chars that would fail when asking for the number of sentences.

What I (and Bernd) means is this: if you put Bernd's text in a field 1 and:

Code: Select all

on mouseUp
   answer the number of sentences in fld 1
end mouseUp

you get five, not 2. This can be understood when reading the dictionary entry for "sentence".

There is too much flexibility in the construction of english sentences to be able to rely on any simple delimiter, "sentence" or as you mentioned, ". "

Craig

richmond62 · Post by **richmond62** » Wed Mar 25, 2026 5:05 pm

One of the salient things I notice about this:

El Banco Nacional de México, S.A. ("Banamex"), Teléfonos de México, S.A.B. de C.V. ("TELMEX"), Operadora de Cinemas, S.A. de C.V. ("Cinemex"). That is O.K. and e.g. no simple K.O.

is that, apart from 'e.g.' all the full stops that are NOT sentence delimiters are preceded by capital letters.

Code: Select all

case-sensitive

FourthWorld · Post by **FourthWorld** » Wed Mar 25, 2026 5:52 pm

bn wrote: ↑
Tue Dec 09, 2025 12:01 pm
The task was to break up short texts into 3, 4 or 5 paragraphs. I manged that by counting full stops "." and determine the number of sentences that should make up a paragraph.

I would start this exercise at the beginning, asking, "For what purpose?"

A paragraph isn't a quantity, but a semantic unit.

"Wow!" can be a valid paragraph, and contains no period.

Semantic analysis is complex stuff, probably more than one would want to undertake in a scripting language alone.

Yet the value of semantic analysis suggests that someone out there has probably written an algorithm to at least determine sentence boundaries. That much is somewhat mechanical, and a good algo would need to account for not only exclamation and question marks but also occasional ellipses and trailing hyphens, and possibly others as well, in addition to full-stops used for things other than sentence stops.

Once sentence boundaries become known, the rest of the question remains: by what means can we construct the unit of meaning of a paragraph out of them?

That part I'd sooner leave to a linguistics philosopher, or, if quality is not a concern, an LLM.

richmond62 · Post by **richmond62** » Wed Mar 25, 2026 6:33 pm

As a qualified Linguist (M.A. Applied Linguistics, SIUC) I would like to point out (echoing FourthWorld's point) that what this discussion really comes down to is that the human mind is not a binary computer, and concepts such as 'sentence' and 'paragraph' are products of the human mind, and they are often human-language specific.

At the moment a lot of people who are not pausing to think are imagining that binary computers are capable of behaving in the same way as human minds.

It is possible that, in the future, clever programmers will make computer systems that are very good at imitating human minds: but it will only be imitation because (presupposing than humans are ONLY material things) a brain neuron has a multiplicity of synapses so the human brain is not binary.
-

dunbarx · Post by **dunbarx** » Wed Mar 25, 2026 9:09 pm

Richmond.

So you are saying that you could work out the parsing of sentences in five seconds, but a computer cannot.

OK. but if we are using computers, then what is the solution to parsing these, what Richard calls "semantic units", into semantic chunks, call them "sentences".

You cannot in all cases.

If you are using a computer you have to abide by computer stuff. I would say if you are the author of certain text and need this functionality, insert an invisible character at the end of each sentence and use the itemDelimiter. Otherwise you are oftentimes out of luck.

Craig

LiveCode Forums.

Adventures into matchChunk and regex

Adventures into matchChunk and regex

Re: Adventures into matchChunk and regex

Re: Adventures into matchChunk and regex

Re: Adventures into matchChunk and regex

Re: Adventures into matchChunk and regex

Re: Adventures into matchChunk and regex

Re: Adventures into matchChunk and regex

Re: Adventures into matchChunk and regex

Re: Adventures into matchChunk and regex

Re: Adventures into matchChunk and regex

Re: Adventures into matchChunk and regex

Re: Adventures into matchChunk and regex

Re: Adventures into matchChunk and regex

Re: Adventures into matchChunk and regex

Re: Adventures into matchChunk and regex