Adventures into matchChunk and regex

FourthWorld · Post by **FourthWorld** » Wed Mar 25, 2026 11:59 pm

dunbarx wrote: ↑
Wed Mar 25, 2026 9:09 pm
Otherwise you are oftentimes out of luck.

The number and variety of rules for defining English sentence boundaries is probably knowable, and certainly within the range of modern computing horsepower to use.

The question is: who has the expertise in both linguistics and computer science to pull it off in any usefully complete form?

This pursuit reminds me of the story of the Porter stemmer, a task many had failed at until Martin Porter came along, and even his isn't perfect, just good enough for most uses.

Stemming is the process of finding a root word, useful in information retrieval for indexing related terms which may have different spelling.

For example, it's easy to see "run" in "running", but what do you do with "ran"? Different time sense, sure, but it's really the same word. How would you index "run" with "ran"?

Lemmatization is the set of cognitive processes native speakers pick up understand these word relationships, but it's too much to try to put into a machine too stupid to count past 1.

Porter looked at the problem from a purely mechanical perspective that focused on the outcomes of permutations, and then set about developing an algo to encode the most generalizable permutation rules useful for common indexing tasks.

https://en.wikipedia.org/wiki/Stemming

We may find that elsewhere in computer science history lies a similar answer for finding sentence boundaries. And as with the Porter stemmer, it'll save years of trial and error to find it and adopt it than to reinvent it.

dunbarx · Post by **dunbarx** » Thu Mar 26, 2026 4:28 am

Richard.

...to try to put into a machine too stupid to count past 1.

Craig

LiveCode Forums.

Adventures into matchChunk and regex

Re: Adventures into matchChunk and regex

Re: Adventures into matchChunk and regex