RegEx filter with accented characters?
Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller
Re: RegEx filter with accented characters?
stam, you're right concerning the use of it, after answer or ask. Consider this script as a beta version.
Concerning its use in several functions, there is no problem: it is local to each function.
Than you Jacqueline to point the normalizeText() function I didn't know.
Concerning its use in several functions, there is no problem: it is local to each function.
Than you Jacqueline to point the normalizeText() function I didn't know.
Re: RegEx filter with accented characters?
Thanks Jacque but my point was really that this has nothing to do with the tilde key and you can’t produce this character by combing the tilde with an ‘n’ but was willing to be proven wrong. But tilde is not a diacritical.jacque wrote: ↑Fri Jul 22, 2022 5:02 amA two stroke keypress is actually how we do it on Mac OS, at least on Roman language keyboards. The option key prints diacriticals. So you type the primary character, like "e", and then option-type a diacritical. The OS combines them into a single accented character in the text.
I believe unicode allows two different representations for accented characters, one way uses a single character with a diacritical attached, and another way represents a plain character followed by a diacritical. But I'm no unicode expert.
On MacOS there is a better way to produce accented characters: do a long press of the ‘base’ char and you’ll be presented with a pop up menu with all the accented options which you pick with the asdociated numerical key…
Gone are the days when you had to memorise keyboard combos for this

-
- Livecode Opensource Backer
- Posts: 10097
- Joined: Fri Feb 19, 2010 10:17 am
Re: RegEx filter with accented characters?
That statement is like "Scottish is not a language." It depends on which side of the border you are sitting.But tilde is not a diacritical.

Thanks Jacque for stating something I had already explained.I believe unicode allows two different representations for accented characters, one way uses a single character with a diacritical attached, and another way represents a plain character followed by a diacritical.
Ah, subjective statements . . .On MacOS there is a better way
The problem, surely, is NOT whether some way of inputting some character with a combining character is input with a unitary keystroke
or with a combination of keystrokes taking various characters from the Unicode lexicon and combining them [this is why Unicode
defines many components as combining characters] is better (?), but what chummy, our end-user does; because it is what our end-user does
that we have to anticipate and build stuff into our system to cope with.
This jollification with combining characters is a current headache of mine:
-
Re: RegEx filter with accented characters?
Errm... If you want to nitpick, the officially recognised language is Scots, not Scottishrichmond62 wrote: ↑Fri Jul 22, 2022 8:53 amThat statement is like "Scottish is not a language." It depends on which side of the border you are sitting.

This is close enough to English as to be considered a dialect but after years of moaning about it, the UK now recognises it as a regional language even though in reality is an umbrella term for many 'languages' and 'dialects'

But that falls in the area of many dubious practices in Scotland, such as 3 different banks printing 3 different own versions of UK money, which are all visually different from the Bank of England design which used everywhere (including Scotland). This used to be a real pain in the @ss - i worked in Dundee 25 years ago and would travel south - 90% of vendors would not accept these home-brewed Scottish bank notes as legal tender and often i had to exchange it at a bank for 'real' money. And the less said about the role of the Royal Bank of Scotland in the financial meltdown of 2008, the better.
You may be correct in wondering what this has to do with the current discussion and you'd be right.
The point of this discussion is not how you create accented characters, but if this is an issue for regex.
I'll re-state the question i posed earlier: if you feed a string to regex either using the accented char or a 'unicode' combination, will regex treat this differently?
I don't think so because a) regex doesn't read hex and b) regex doesn't read unicode codes, just text strings. However, i'm willing to be proven wrong, but I'd bet good money that this is not something you have tested and found to be a problem

My comment is based on the fact that you no longer have to figure out or remember keyboard combos, which are different on different keyboard layouts and languages. The long-press has been there for a while and allows a more uniform and easy access regardless of keyboard setup. You may well be ultra-familiar with the keyboard combos to produce the desired characters on your keyboard, but what you do does not necessarily apply to anyone else. 'Better' means that everyone can access the same functionality with ease. Although not sure why i'd need to clarify that...
This is exactly my point. You can't produce a 'combination' version with the keyboard. The keyboard combos are just a trigger to the OS to produce the accented character. It may be possible to create the combo unicode in code, but a) does this produce two different distinct unicode codes? and b) i just can't see how this would be an issue for regex - where the developer will either be using text in handler or user-entered in a field.richmond62 wrote: ↑Fri Jul 22, 2022 8:53 amThe problem, surely, is NOT whether some way of inputting some character with a combining character is input with a unitary keystroke
or with a combination of keystrokes taking various characters from the Unicode lexicon and combining them [this is why Unicode
defines many components as combining characters] is better (?), but what chummy, our end-user does; because it is what our end-user does
that we have to anticipate and build stuff into our system to cope with.
And even if you could produce this somehow in the keyboard (i.e. visually the same accented character but with different unicode codes), would this be an issue for regex which doesn't read unicode codes?
Edit:
=============
Reading more about this, i'd also point out that tilde (ascii 126) is not the same as the unicode version called 'combining tilde' (U+0303) - the latter is designed specifically to be combined with other characters and it's placeholder is above the char, not in line with it like ascii 126.
The characters produced with a 'combining tilde' have their own discrete unicode codes, for example Ñ produced with a combining tilde has the unicode code of U+00D1. I can't see anywhere that Ñ can be represented with 2 different unicode codes although if that is the case please do point to a source for further reading.
Re: RegEx filter with accented characters?
In an ideal world, no it wouldn't - in the real world things are a great deal more complicated - especially if you start looking at writing systems which aren't Roman-like (i.e. glyphs for a small number of letters, with some letters being base+diacritical).The point of this discussion is not how you create accented characters, but if this is an issue for regex.
I'll re-state the question i posed earlier: if you feed a string to regex either using the accented char or a 'unicode' combination, will regex treat this differently?
The regex we use in LC (PCRE) deals with codepoints - these are the 'codes' in the Unicode character table which are used to build up characters, which then build up into strings.I don't think so because a) regex doesn't read hex and b) regex doesn't read unicode codes, just text strings. However, i'm willing to be proven wrong, but I'd bet good money that this is not something you have tested and found to be a problem
The difference between codepoints and characters is precisely the topic that has given rise to this question - the different ways you can represent a single character - e.g. n-tilde can be the single codepoint n-tilde, or two codepoints n, then combining-tilde.
Its probably best not to make assumptions about how different people around the world enter their text - nor indeed, how the text they need to manipulate was constructed, or where it came from. Nor indeed what language they are typing inMy comment is based on the fact that you no longer have to figure out or remember keyboard combos, which are different on different keyboard layouts and languages. The long-press has been there for a while and allows a more uniform and easy access regardless of keyboard setup. You may well be ultra-familiar with the keyboard combos to produce the desired characters on your keyboard, but what you do does not necessarily apply to anyone else. 'Better' means that everyone can access the same functionality with ease. Although not sure why i'd need to clarify that...

Sure - if you are typing English, then the rare need for accented characters is probably well served by long-press on recent Apple IMEs, however, if you are typing in a language which uses such things (or indeed different things) frequently then it would probably become incredibly tiresome quickly. (e.g. Vietnamese vowels typically have multiple tone marks on each one...).
Again best not to assume what a particular 'keyboard' on a particular platform might produce - or indeed what the OS does with the text after you've typed it and its gone and been stored in its particular context.This is exactly my point. You can't produce a 'combination' version with the keyboard. The keyboard combos are just a trigger to the OS to produce the accented character. It may be possible to create the combo unicode in code, but a) does this produce two different distinct unicode codes? and b) i just can't see how this would be an issue for regex - where the developer will either be using text in handler or user-entered in a field.
For example, macOS filenames will always used decomposed characters however they are entered - if you change the name of a file in Finder to `foé`, then copy the filename again - it will always be `f,o,e,combining-acute` and not `f,o,e-acute` (i.e. it will be four codepoints and not three).
Spanish has n-tilde - the tilde is a diacritic - its called a virgulilla (in Spanish). It is equally valid to represent it as n-tilde, or n+combining-tilde (in Unicode).The characters produced with a 'combining tilde' have their own discrete unicode codes, for example Ñ produced with a combining tilde has the unicode code of U+00D1. I can't see anywhere that Ñ can be represented with 2 different unicode codes although if that is the case please do point to a source for further reading.
As an aside, the 'composed characters' which exist in Unicode today are the only ones which will ever exist in the standard (for the languages it already encodes) - there will never be any more (if they did it would change the normalization rules in a backwards-incompatible way). Any other characters (for any script) which are added which can be seen (by whatever metric the Unicode consortium uses) as 'base-character + overlaid marks' - will only be representable in a decomposed form. All combined forms which exist, previously existed in some 'legacy' language encoding and are there to ensure 100% round-tripping is possible (with a 1-1 mapping). You can see this rule has been applied already in Unicode - there are (polytonic?) greek characters missing some composed forms.
So back to regex:
It should be noted that on the whole this detail of *how* a character is represented in LiveCode is mostly hidden from you - its the domain of the 'formSensitive' property - the only place where the difference can become apparent (assuming you haven't fettled with that property) is the regex functionality (and array key access when caseSensitive is true - array keys are either exact match by codepoint, or case-folded match by character).
We did try numerous things to make matchText/Chunk and friends play well with caseSensitive and formSensitive - however, it ended up causing issues with some things - so the regex functions we have are a minimal wrapper around PCRE - you need to explicitly specify whether you want case-insensitivity (using (?i) at the front); and have to normalize unicode text first (if you are dealing with Unicode text - which you quite probably are not if you've just got Western European languages to deal with):
Code: Select all
put normalizeText(tPattern, "nfc") into tNormPattern
put normalizeText(tText, "nfc") into tNormText

EDIT: I should point out the above is only relevant when using matchText/matchChunk/replaceText - if using 'filter with regex pattern' - then the engine does 'the right thing' - it handles formSensitive and caseSensitive for you - the reason *it* can do that is because it is only asserting the existence of a match on each line (or whatever chunk you requested) and thus what it has to do internally to make things work does not affect what script sees. The regex functions are lower-level - in particular, they (can) return and reference substrings, or change the input text and as such if the engine normalized (and indeed casefolded the text itself) it would not be consistent with how it operates on non-Unicode text thus is best left up to the scripter to control (for the time being, at least).
Last edited by LCMark on Fri Jul 22, 2022 12:59 pm, edited 1 time in total.
-
- Livecode Opensource Backer
- Posts: 10097
- Joined: Fri Feb 19, 2010 10:17 am
Re: RegEx filter with accented characters?
Um . . . well, as you will see above, I posted a screenshot of some of the Glagolitic combining characters forthe 'composed characters' which exist in Unicode today are the only ones which will ever exist in the standard (for the languages it already encodes) - there will never be any more
the Unicode 15 proposal due for release in September of this year.
There may be more, BUT, once they are rolled into the Unicode standard they will NOT move so can be relied on for ever
and let's NOT discuss what 'for ever' means because that is another whole can of worms).
Last edited by richmond62 on Fri Jul 22, 2022 1:14 pm, edited 1 time in total.
Re: RegEx filter with accented characters?
As we are talking regex, can I remind LiveCode that a big element missing in LiveCode is the replace function of regex, i.e.
Search: (.)(.)
Replace: \2\1
to reverse two characters. The replace part is missing. we would love have it in LiveCode.
Search: (.)(.)
Replace: \2\1
to reverse two characters. The replace part is missing. we would love have it in LiveCode.
Kaveh
-
- Livecode Opensource Backer
- Posts: 10097
- Joined: Fri Feb 19, 2010 10:17 am
Re: RegEx filter with accented characters?
Well, if you want to stick your neck out like that, here's 'Uncle Richmond' sharpening the blade . . .I can't see anywhere that Ñ can be represented with 2 different unicode codes although if that is the case please do point to a source for further reading.

As I wrote earlier, there are 2 ways of producing a tilded n, and the attached stack will illustrate this:
- -
I am sorry, I made a mistake in my earlier posting, as a combining tilde 'makes its nest' at point 0x0303 in the Unicode standard.
- Attachments
-
- nyah-nyah-nyah.livecode.zip
- Stack.
- (38.61 KiB) Downloaded 145 times
Re: RegEx filter with accented characters?
I said 'composed' characters (i.e. those which *also* have a representation as base+combiners) - there's nothing in the proposal docs for those Glagolitic characters I can see which suggests that sequences of them would 'normalize' to an existing codepoint.richmond62 wrote: ↑Fri Jul 22, 2022 12:58 pmUm . . . well, as you will see above, I posted a screenshot of some of the Glagolitic combining characters forthe 'composed characters' which exist in Unicode today are the only ones which will ever exist in the standard (for the languages it already encodes) - there will never be any more
the Unicode 15 proposal due for release in September of this year.
IIRC that rule was introduced in Unicode 3.x - as part of ensuring that the normalization processes for Unicode were frozen forever. At that time they ensured that all characters in Unicode at the time which could be thought of as base+combiner, could be decomposed into such (by adding combining marks and normalization mappings as needed).
-
- Livecode Opensource Backer
- Posts: 10097
- Joined: Fri Feb 19, 2010 10:17 am
Re: RegEx filter with accented characters?
You are, as always, Mark, entirely correct. 

-
- Livecode Opensource Backer
- Posts: 10097
- Joined: Fri Feb 19, 2010 10:17 am
Re: RegEx filter with accented characters?
Aye, well , we always need non-Scots to teach us how to suck eggs.If you want to nitpick, the officially recognised language is Scots
We can call our language whatever we choose, and whether it is "officially recognised" by the English and their
colonial attitudes is neither here nor there.
Re: RegEx filter with accented characters?
Thank you very much for these accurate explanations.LCMark wrote: ↑Fri Jul 22, 2022 12:42 pm[...]Code: Select all
put normalizeText(tPattern, "nfc") into tNormPattern put normalizeText(tText, "nfc") into tNormText
I should point out the above is only relevant when using matchText/matchChunk/replaceText - if using 'filter with regex pattern' - then the engine does 'the right thing' - it handles formSensitive and caseSensitive for you.