Page 1 of 1

Adventures into matchChunk and regex

Posted: Tue Dec 09, 2025 12:01 pm
by bn
Unfortunately "matchChunk" is not well explained in the dictionary.

The task was to break up short texts into 3, 4 or 5 paragraphs. I manged that by counting full stops "." and determine the number of sentences that should make up a paragraph. However the texts could contain decimal numbers and the decimal delimiter is a ".". Furthermore abbreviations used "." as in "e.g." which I had to navigate around. Doing this in native Livecode would have been quite some coding and I figured that matchChunk using a regex expression could be helpful. Now regex is something I avoided since my needs up to now could be solved using livecode syntax.

Stam recommends using "Regex101" website and I followed his advice and found it a very sensible test ground for testing regex expressions.

I want to post here my solution to the problem of cleaning up a text by replacing non full stop dots temporarily and then reintroduce dots after formatting which is not shown here.

Since I did not find any working solutions for "matchChunk" I post my function here just as an example using matchChunk.

Code: Select all

function escapeAbbrAndDezimalpoints pText
   local startMatch, endMatch, tStart, tBegin, tEnd, tVar, tRegex, tList
   local tFrom, tTo
   
   put pText into tVar
   
   ## this catches "e.g." and "2.7" 
   put "([a-zA-Z]\.[a-zA-Z]\.|[0-9]\.[0-9])" into tRegex
   
   ## define search range; tFrom will be changed if a hit is found, excluding part up to a hit
   put 1 into tFrom
   put the number of chars of tVar into tTo
   
   repeat
      if matchChunk(char tFrom to tTo of tVar, tRegex, tStart, tEnd) is true then
         
         ## built list of hits
         put tStart + tFrom - 1, tEnd + tFrom - 1 & return after tList
         
         ## update search range
         add tEnd to tFrom
         
         if tFrom > tTo then
            exit repeat
         end if
      else
         exit repeat
      end if
   end repeat
   
   if tList is empty then return tVar
   
   ## sort not necessary here but for cases where replace changes character count
   sort tList descending numeric by item 1 of each
    
   repeat for each line aLine in tList
      put item 1 of aLine into tBegin
      put item 2 of aLine into tEnd
      replace "." with "±" in char tBegin to tEnd of tVar
   end repeat
   return tVar
end escapeAbbrAndDezimalpoints

Kind regards
Bernd

Re: Adventures into matchChunk and regex

Posted: Tue Dec 09, 2025 12:29 pm
by stam
Hi Bernd,

Yes I swear but Regex101 for all regex testing. The only problem with it is that it defaults to JS regex which has a 'global' match flat (ie all matches are returned) whereas LiveCode uses PCRE regex, which doesn't have this (only the first match is returned).

As a workaround I uses this function, which returns an array with all matches, and the start and end offsets for each match. It only works if you have 1 pattern to match, but that's the usual case anyway.

Code: Select all

function globalMatch pSource, pRegex
    local tMatchesA, tVarCount, tVar1, tVar2, tSource, tTemp
    -- count occurences
    put replaceText(pSource, pRegex, "regexMatch" & return) into tTemp
    filter lines of tTemp with "*regexMatch"
    put the number of lines of tTemp into tVarCount
    
    -- get occurences by locating offsets with matchChunk(), loop to store values and delete occurence in copy of source.
    put pSource into tSource
    repeat with x = 1 to tVarCount
        if matchChunk(tSource, pRegex , tVar1, tVar2) then
            put char tVar1 to tVar2 of tSource into tMatchesA[x]["value"]
            put tVar1 into tMatchesA[x]["start"]
            put tVar2 into tMatchesA[x]["end"]
            delete char tVar1 to tVar2 of tSource
        end if
    end repeat
    return tMatchesA
end globalMatch
Not sure if that adds anything to your solution, but may be helpful for others...