RegEx filter with accented characters?

Zax · Post by **Zax** » Fri Jul 22, 2022 6:39 am

stam, you're right concerning the use of it, after answer or ask. Consider this script as a beta version.
Concerning its use in several functions, there is no problem: it is local to each function.

Than you Jacqueline to point the normalizeText() function I didn't know.

stam · Post by **stam** » Fri Jul 22, 2022 8:30 am

jacque wrote: ↑
Fri Jul 22, 2022 5:02 am
A two stroke keypress is actually how we do it on Mac OS, at least on Roman language keyboards. The option key prints diacriticals. So you type the primary character, like "e", and then option-type a diacritical. The OS combines them into a single accented character in the text.

I believe unicode allows two different representations for accented characters, one way uses a single character with a diacritical attached, and another way represents a plain character followed by a diacritical. But I'm no unicode expert.

Thanks Jacque but my point was really that this has nothing to do with the tilde key and you can’t produce this character by combing the tilde with an ‘n’ but was willing to be proven wrong. But tilde is not a diacritical.

On MacOS there is a better way to produce accented characters: do a long press of the ‘base’ char and you’ll be presented with a pop up menu with all the accented options which you pick with the asdociated numerical key…
Gone are the days when you had to memorise keyboard combos for this

richmond62 · Post by **richmond62** » Fri Jul 22, 2022 8:53 am

But tilde is not a diacritical.

That statement is like "Scottish is not a language." It depends on which side of the border you are sitting.

I believe unicode allows two different representations for accented characters, one way uses a single character with a diacritical attached, and another way represents a plain character followed by a diacritical.

Thanks Jacque for stating something I had already explained.

On MacOS there is a better way

Ah, subjective statements . . .

The problem, surely, is NOT whether some way of inputting some character with a combining character is input with a unitary keystroke
or with a combination of keystrokes taking various characters from the Unicode lexicon and combining them [this is why Unicode
defines many components as combining characters] is better (?), but what chummy, our end-user does; because it is what our end-user does
that we have to anticipate and build stuff into our system to cope with.

This jollification with combining characters is a current headache of mine:
-

stam · Post by **stam** » Fri Jul 22, 2022 10:15 am

richmond62 wrote: ↑
Fri Jul 22, 2022 8:53 am
That statement is like "Scottish is not a language." It depends on which side of the border you are sitting.

Errm... If you want to nitpick, the officially recognised language is Scots, not Scottish

This is close enough to English as to be considered a dialect but after years of moaning about it, the UK now recognises it as a regional language even though in reality is an umbrella term for many 'languages' and 'dialects'

But that falls in the area of many dubious practices in Scotland, such as 3 different banks printing 3 different own versions of UK money, which are all visually different from the Bank of England design which used everywhere (including Scotland). This used to be a real pain in the @ss - i worked in Dundee 25 years ago and would travel south - 90% of vendors would not accept these home-brewed Scottish bank notes as legal tender and often i had to exchange it at a bank for 'real' money. And the less said about the role of the Royal Bank of Scotland in the financial meltdown of 2008, the better.

You may be correct in wondering what this has to do with the current discussion and you'd be right.

The point of this discussion is not how you create accented characters, but if this is an issue for regex.
I'll re-state the question i posed earlier: if you feed a string to regex either using the accented char or a 'unicode' combination, will regex treat this differently?
I don't think so because a) regex doesn't read hex and b) regex doesn't read unicode codes, just text strings. However, i'm willing to be proven wrong, but I'd bet good money that this is not something you have tested and found to be a problem

richmond62 wrote: ↑
Fri Jul 22, 2022 8:53 am
Ah, subjective statements . . .

My comment is based on the fact that you no longer have to figure out or remember keyboard combos, which are different on different keyboard layouts and languages. The long-press has been there for a while and allows a more uniform and easy access regardless of keyboard setup. You may well be ultra-familiar with the keyboard combos to produce the desired characters on your keyboard, but what you do does not necessarily apply to anyone else. 'Better' means that everyone can access the same functionality with ease. Although not sure why i'd need to clarify that...

richmond62 wrote: ↑
Fri Jul 22, 2022 8:53 am
The problem, surely, is NOT whether some way of inputting some character with a combining character is input with a unitary keystroke
or with a combination of keystrokes taking various characters from the Unicode lexicon and combining them [this is why Unicode
defines many components as combining characters] is better (?), but what chummy, our end-user does; because it is what our end-user does
that we have to anticipate and build stuff into our system to cope with.

This is exactly my point. You can't produce a 'combination' version with the keyboard. The keyboard combos are just a trigger to the OS to produce the accented character. It may be possible to create the combo unicode in code, but a) does this produce two different distinct unicode codes? and b) i just can't see how this would be an issue for regex - where the developer will either be using text in handler or user-entered in a field.

And even if you could produce this somehow in the keyboard (i.e. visually the same accented character but with different unicode codes), would this be an issue for regex which doesn't read unicode codes?

Edit:
=============
Reading more about this, i'd also point out that tilde (ascii 126) is not the same as the unicode version called 'combining tilde' (U+0303) - the latter is designed specifically to be combined with other characters and it's placeholder is above the char, not in line with it like ascii 126.

The characters produced with a 'combining tilde' have their own discrete unicode codes, for example Ñ produced with a combining tilde has the unicode code of U+00D1. I can't see anywhere that Ñ can be represented with 2 different unicode codes although if that is the case please do point to a source for further reading.

LCMark · Post by **LCMark** » Fri Jul 22, 2022 12:42 pm

The point of this discussion is not how you create accented characters, but if this is an issue for regex.
I'll re-state the question i posed earlier: if you feed a string to regex either using the accented char or a 'unicode' combination, will regex treat this differently?

In an ideal world, no it wouldn't - in the real world things are a great deal more complicated - especially if you start looking at writing systems which aren't Roman-like (i.e. glyphs for a small number of letters, with some letters being base+diacritical).

I don't think so because a) regex doesn't read hex and b) regex doesn't read unicode codes, just text strings. However, i'm willing to be proven wrong, but I'd bet good money that this is not something you have tested and found to be a problem

The regex we use in LC (PCRE) deals with codepoints - these are the 'codes' in the Unicode character table which are used to build up characters, which then build up into strings.

The difference between codepoints and characters is precisely the topic that has given rise to this question - the different ways you can represent a single character - e.g. n-tilde can be the single codepoint n-tilde, or two codepoints n, then combining-tilde.

My comment is based on the fact that you no longer have to figure out or remember keyboard combos, which are different on different keyboard layouts and languages. The long-press has been there for a while and allows a more uniform and easy access regardless of keyboard setup. You may well be ultra-familiar with the keyboard combos to produce the desired characters on your keyboard, but what you do does not necessarily apply to anyone else. 'Better' means that everyone can access the same functionality with ease. Although not sure why i'd need to clarify that...

Its probably best not to make assumptions about how different people around the world enter their text - nor indeed, how the text they need to manipulate was constructed, or where it came from. Nor indeed what language they are typing in

Sure - if you are typing English, then the rare need for accented characters is probably well served by long-press on recent Apple IMEs, however, if you are typing in a language which uses such things (or indeed different things) frequently then it would probably become incredibly tiresome quickly. (e.g. Vietnamese vowels typically have multiple tone marks on each one...).

This is exactly my point. You can't produce a 'combination' version with the keyboard. The keyboard combos are just a trigger to the OS to produce the accented character. It may be possible to create the combo unicode in code, but a) does this produce two different distinct unicode codes? and b) i just can't see how this would be an issue for regex - where the developer will either be using text in handler or user-entered in a field.

Again best not to assume what a particular 'keyboard' on a particular platform might produce - or indeed what the OS does with the text after you've typed it and its gone and been stored in its particular context.

For example, macOS filenames will always used decomposed characters however they are entered - if you change the name of a file in Finder to `foé`, then copy the filename again - it will always be `f,o,e,combining-acute` and not `f,o,e-acute` (i.e. it will be four codepoints and not three).

The characters produced with a 'combining tilde' have their own discrete unicode codes, for example Ñ produced with a combining tilde has the unicode code of U+00D1. I can't see anywhere that Ñ can be represented with 2 different unicode codes although if that is the case please do point to a source for further reading.

Spanish has n-tilde - the tilde is a diacritic - its called a virgulilla (in Spanish). It is equally valid to represent it as n-tilde, or n+combining-tilde (in Unicode).

As an aside, the 'composed characters' which exist in Unicode today are the only ones which will ever exist in the standard (for the languages it already encodes) - there will never be any more (if they did it would change the normalization rules in a backwards-incompatible way). Any other characters (for any script) which are added which can be seen (by whatever metric the Unicode consortium uses) as 'base-character + overlaid marks' - will only be representable in a decomposed form. All combined forms which exist, previously existed in some 'legacy' language encoding and are there to ensure 100% round-tripping is possible (with a 1-1 mapping). You can see this rule has been applied already in Unicode - there are (polytonic?) greek characters missing some composed forms.

So back to regex:

It should be noted that on the whole this detail of *how* a character is represented in LiveCode is mostly hidden from you - its the domain of the 'formSensitive' property - the only place where the difference can become apparent (assuming you haven't fettled with that property) is the regex functionality (and array key access when caseSensitive is true - array keys are either exact match by codepoint, or case-folded match by character).

We did try numerous things to make matchText/Chunk and friends play well with caseSensitive and formSensitive - however, it ended up causing issues with some things - so the regex functions we have are a minimal wrapper around PCRE - you need to explicitly specify whether you want case-insensitivity (using (?i) at the front); and have to normalize unicode text first (if you are dealing with Unicode text - which you quite probably are not if you've just got Western European languages to deal with):

Code: Select all

   put normalizeText(tPattern, "nfc") into tNormPattern
   put normalizeText(tText, "nfc") into tNormText

I'm sure we'll return to these functions at some point to try and figure out how to make them a little more friendly and consistent with the rest of LC's string handling - but given issues (or uses of!) matching patterns containing characters against strings containing characters which have variant forms in Unicode has come up about two or three times in the last 8+ years, you can appreciate that it isn't exactly a huge priority

EDIT: I should point out the above is only relevant when using matchText/matchChunk/replaceText - if using 'filter with regex pattern' - then the engine does 'the right thing' - it handles formSensitive and caseSensitive for you - the reason *it* can do that is because it is only asserting the existence of a match on each line (or whatever chunk you requested) and thus what it has to do internally to make things work does not affect what script sees. The regex functions are lower-level - in particular, they (can) return and reference substrings, or change the input text and as such if the engine normalized (and indeed casefolded the text itself) it would not be consistent with how it operates on non-Unicode text thus is best left up to the scripter to control (for the time being, at least).

richmond62 · Post by **richmond62** » Fri Jul 22, 2022 12:58 pm

the 'composed characters' which exist in Unicode today are the only ones which will ever exist in the standard (for the languages it already encodes) - there will never be any more

Um . . . well, as you will see above, I posted a screenshot of some of the Glagolitic combining characters for
the Unicode 15 proposal due for release in September of this year.

There may be more, BUT, once they are rolled into the Unicode standard they will NOT move so can be relied on for ever
and let's NOT discuss what 'for ever' means because that is another whole can of worms).

kaveh1000 · Post by **kaveh1000** » Fri Jul 22, 2022 1:00 pm

As we are talking regex, can I remind LiveCode that a big element missing in LiveCode is the replace function of regex, i.e.

Search: (.)(.)
Replace: \2\1

to reverse two characters. The replace part is missing. we would love have it in LiveCode.

richmond62 · Post by **richmond62** » Fri Jul 22, 2022 1:01 pm

I can't see anywhere that Ñ can be represented with 2 different unicode codes although if that is the case please do point to a source for further reading.

Well, if you want to stick your neck out like that, here's 'Uncle Richmond' sharpening the blade . . .

As I wrote earlier, there are 2 ways of producing a tilded n, and the attached stack will illustrate this:
-

-
I am sorry, I made a mistake in my earlier posting, as a combining tilde 'makes its nest' at point 0x0303 in the Unicode standard.

LCMark · Post by **LCMark** » Fri Jul 22, 2022 1:19 pm

richmond62 wrote: ↑
Fri Jul 22, 2022 12:58 pm

the 'composed characters' which exist in Unicode today are the only ones which will ever exist in the standard (for the languages it already encodes) - there will never be any more
Um . . . well, as you will see above, I posted a screenshot of some of the Glagolitic combining characters for
the Unicode 15 proposal due for release in September of this year.

I said 'composed' characters (i.e. those which *also* have a representation as base+combiners) - there's nothing in the proposal docs for those Glagolitic characters I can see which suggests that sequences of them would 'normalize' to an existing codepoint.

IIRC that rule was introduced in Unicode 3.x - as part of ensuring that the normalization processes for Unicode were frozen forever. At that time they ensured that all characters in Unicode at the time which could be thought of as base+combiner, could be decomposed into such (by adding combining marks and normalization mappings as needed).

richmond62 · Post by **richmond62** » Fri Jul 22, 2022 1:21 pm

You are, as always, Mark, entirely correct.

richmond62 · Post by **richmond62** » Fri Jul 22, 2022 1:24 pm

If you want to nitpick, the officially recognised language is Scots

Aye, well , we always need non-Scots to teach us how to suck eggs.

We can call our language whatever we choose, and whether it is "officially recognised" by the English and their
colonial attitudes is neither here nor there.

Zax · Post by **Zax** » Fri Jul 22, 2022 2:06 pm

LCMark wrote: ↑
Fri Jul 22, 2022 12:42 pm
Code: Select all
   put normalizeText(tPattern, "nfc") into tNormPattern
   put normalizeText(tText, "nfc") into tNormText
[...]
I should point out the above is only relevant when using matchText/matchChunk/replaceText - if using 'filter with regex pattern' - then the engine does 'the right thing' - it handles formSensitive and caseSensitive for you.

Thank you very much for these accurate explanations.

LiveCode Forums.

RegEx filter with accented characters?

Re: RegEx filter with accented characters?

Re: RegEx filter with accented characters?

Re: RegEx filter with accented characters?

Re: RegEx filter with accented characters?

Re: RegEx filter with accented characters?

Re: RegEx filter with accented characters?

Re: RegEx filter with accented characters?

Re: RegEx filter with accented characters?

Re: RegEx filter with accented characters?

Re: RegEx filter with accented characters?

Re: RegEx filter with accented characters?

Re: RegEx filter with accented characters?