richmond62 wrote: ↑Wed Oct 23, 2019 4:06 pm
I am processing texts that consist of sentences ending in a variety of symbols including "." and "!"
so I should like to set the
itemDelimiter to more than one symbol . . .
…and "?", and sometimes "…", and likely others I'm not thinking of right now.
Parsing natural language is a very difficult problem, which is why much of the programming world has contributed to and relies on the ICU libraries now part of LC's Unicode implementation.
There have been comments earlier that using LC's built-in
sentence chunk type would not be a good fit for parsing sentences, but I missed the explanation as to why. It already handles the full range of punctuation and structural indicators of sentence boundaries based on the collective several decades of experience of contributing programmers who specialize in such work.
Similarly, using the newer
trueWord chunk type instead of
word will avoid the many difficulties that arise from HC's definition of a word, which can include punctuation and other characters not actually part of the word, and treats any string of any length enclosed in double quotes as a single word.
For language analysis tasks, using modern language analysis tooling now built into LC can greatly simplify code, and will likely run much faster as it moves a lot of the heavy lifting of parsing from scripts into optimized machine code.
Related: I saw that you're using this to identify suffixes, is that correct? Are you making a stemmer? I have an old Porter stemmer (written by Ken Ray and Andrew Meit) lying around here somewhere. It's so old that I wouldn't use it in production today without rewriting it to take advantage of newer language features, but it may be helpful in providing some inspiration for a stemmer if that's what you're working on.