Page 1 of 2
RegEx filter with accented characters?
Posted: Thu Jul 21, 2022 8:30 am
by Zax
Hello,
Considering this list of french monthes:
Janvier 520
Février 15
Mars 48
Avril 110
Août 63
Sometimes, if the writer is lazy, accented characters are omitted, though it's a grammatical error:
Janvier 520
Fevrier 15
Mars 48
Avril 110
Aout 63
I would like to filter lines of a given month
with or without accented characters. So if for example the given filter is "
Fevrier", I would like to match both "
Fevrier" and "
Février".
I guess this be done with a RegEx, and then use:
Code: Select all
filter lines of myMonthesList with regex pattern regexPattern
Does anyone know how to write such a regexPattern?
Thank you.
Re: RegEx filter with accented characters?
Posted: Thu Jul 21, 2022 9:05 am
by stam
I don't think there is a function in regex that allows you to by default treat accented characters the same as the 'base' character.
I think you would probably just have to add all contingencies in your query
https://regex101.com is a good test ground for regex formulae
Re: RegEx filter with accented characters?
Posted: Thu Jul 21, 2022 9:21 am
by stam
Thinking about this more, if you want to search for both Fevrier and Février (or even Fèvrier or Fêvrier or whatever) you can just substitute the accented character with a '.' which will match any character. Eg
will match all of the above. Looking online there are more complex patterns if you want to be more specific, but not sure if supported by LC and this should suffice for your use-case i think...
S.
--------------------------------------------------
EDIT: if you want to make the match a bit wider (eg case-insensitive) then use modifiers:
the modifier
i means to treat this as case-insensitive, as by default regex is case sensitive, so that
F.vrier will also match
fevrier
so this should work in LC (untested):
filter lines of myMonthesList with regex pattern "(?i)f.vrier"
Other modifiers i frequently use (can all be used together, eg (?xis):
m - multiline (^ signifies start of line, $ signifies end of line)
x - ignore white space in the query
s - single line : treats everything as 1 line, so that dot matches newline as well
--------------------------------------------------
EDIT 2
Just to be on the safe side, it tested with some simple code in a button:
Code: Select all
on mouseUp
local tLines, myMonthesList
put "Janvier 520" & return into myMonthesList
put "Février 15" & return after myMonthesList
put "Mars 48" & return after myMonthesList
put "Avril 110" & return after myMonthesList
put "Août 63" after myMonthesList
filter lines of myMonthesList with regex pattern "(?i)f.vrier" into tLines
end mouseUp
tLines now contains "Février 15"
Re: RegEx filter with accented characters?
Posted: Thu Jul 21, 2022 10:16 am
by Zax
Thank you
stam for your answers.
stam wrote: ↑Thu Jul 21, 2022 9:21 am
will match all of the above. Looking online there are more complex patterns if you want to be more specific, but not sure if supported by LC and this should suffice for your use-case i think...
Sorry if my question wasn't very accurate. The monthes list was just an example. The "real" list can contain everything, including weird things like "FFvrier", and I don't want "FFvrier" to be returned in the filtered list.
I'm actually trying to build a regex on the fly with things like (e|é|è|ê|ë) but it still doesn't work. I will post the code when it will be done.
Re: RegEx filter with accented characters?
Posted: Thu Jul 21, 2022 10:44 am
by stam
if you want to capture a list of alternatives for one character you should include these as you say with a bar
| ('or' in regex-speak) - but in a non-capturing group in the form
(?: ) - if you don't include the
?: but just the parentheses it will be treated as a matched group. Took me a little while to figure out non-capturing groups
so you can change the filter statement above to:
Code: Select all
filter lines of myMonthesList with regex pattern "(?i)f(?:e|è|é|ë|ê)vrier" into tLines
Tested and works with all the alternatives in the pattern
HTH
Stam
Re: RegEx filter with accented characters?
Posted: Thu Jul 21, 2022 11:31 am
by richmond62
One of the things that needs to be considered is that accented characters can come about in different ways:
Consider ñ
This can be n (0x006E) and a combining char [tilde} (0x02DC), or
ñ (0x00F1)
now you have no way, in advance, in which way a ñ might be encoded in data you are going to process.
Re: RegEx filter with accented characters?
Posted: Thu Jul 21, 2022 11:34 am
by stam
The OP is wanting to search for presoecified patters where a character may be a number of different variations.
Don’t see why you can’t add ñ to the formula above if that’s a variant to be included in the search.
Given that regex works with strings presented to it, and not hex code, how it’s constructed should be irrelevant.
Unless of course you tested this and found a problem?
Re: RegEx filter with accented characters?
Posted: Thu Jul 21, 2022 11:47 am
by Zax
stam wrote: ↑Thu Jul 21, 2022 10:44 am
so you can change the filter statement above to:
Code: Select all
filter lines of myMonthesList with regex pattern "(?i)f(?:e|è|é|ë|ê)vrier" into tLines
Tested and works with all the alternatives in the pattern
Interesting.
I was successfully using as regex pattern:
But I don't understand anything in regex, so I don't know which one is better.
EDIT : and yes, I think "ñ" will be managed in the same way but I will have to run more tests.
Re: RegEx filter with accented characters?
Posted: Thu Jul 21, 2022 11:53 am
by richmond62
Given that regex works with strings presented to it, and not hex code, how it’s constructed should be irrelevant.
Well, erm, possibly . . .
If one is searching for single characters (i.e. tilded n) then characters followed by a combining character would excluded,
if one is pattern matching there should be no problem . . .
although for my example one would probably have to present 'tilded n' and 'n + tilde' as 2 patterns.
Re: RegEx filter with accented characters?
Posted: Thu Jul 21, 2022 12:46 pm
by Zax
This is my code. It works but could certainly be shorten or optimized:
Code: Select all
local arrChars
on mouseup
ask "Show lines with:" with "fevrier"
if it = "" then exit to top
initArrChars
put fld 2 into myList
filter lines of myList with regex pattern "(?i)(" & getAccentedRegex(deAccent(it)) & ")" into myResult
put "" into fld 3
put myResult into fld 3
end mouseup
on initArrChars // init replacement chars array
put "[à|â|ä]" into arrChars["a"]
put "[é|è|ê|ë]" into arrChars["e"]
put "[î|ï]" into arrChars["i"]
put "[ô|ö]" into arrChars["o"]
put "[ù|û|ü]" into arrChars["u"]
put "[ÿ]" into arrChars["y"]
put "[ç]" into arrChars["c"]
put "[ñ]" into arrChars["n"]
end initArrChars
function deAccent str // replace all accented chars with 'normal' chars in given string
get str
repeat for each key thisKey in arrChars
get replaceText(it, arrChars[thisKey], thisKey)
end repeat
return it
end deAccent
function getAccentedRegex str // build regex
repeat with c = (the length of str) down to 1
get char c of str
if (arrChars[it] <> "") then // regex must be built for this char
get arrChars[it]
put char c of str & "|" after char 1 of it
put it into char c of str
end if
end repeat
return str
end getAccentedRegex
Testing list (fld 2 in the code):
Code: Select all
Janvier
Février
Fevrier
FFvrier ERROR
Mars
Avril
Août
Aout
sécurité
sécurite
securité
securite
Nina
Niña
français
francais
Thanks all for your help

Re: RegEx filter with accented characters?
Posted: Thu Jul 21, 2022 3:10 pm
by stam
Very nice code, glad you got it to work!
only 1 comment - i'd be a bit cautious with the prodigious use of
it as any intermediate step can assign a value to
it...
for example:
Code: Select all
ask "Show lines with:" with "fevrier"
if it = "" then exit to top
initArrChars
put fld 2 into myList
filter lines of myList with regex pattern "(?i)(" & getAccentedRegex(deAccent(it)) & ")" into myResult
it is being set in the ask
statement but is being used in several lines (and importantly several functions) below.
While this is probably OK since your code works, it is a potential source of error.
It's more resilient just to assign the
it from the
ask statement to a local variable so it won't be accidentally changed if in the future you change your code and introduce something that uses
it as well somewhere in-between...
Re: RegEx filter with accented characters?
Posted: Thu Jul 21, 2022 5:49 pm
by jacque
richmond62 wrote: ↑Thu Jul 21, 2022 11:31 am
One of the things that needs to be considered is that accented characters can come about in different ways:
Consider
ñ
This can be n (0x006E) and a combining char [tilde} (0x02DC), or
ñ (0x00F1)
now you have no way, in advance, in which way a ñ might be encoded in data you are going to process.
LC provides normalizeText to account for that. See also the formSensitive property.
Re: RegEx filter with accented characters?
Posted: Thu Jul 21, 2022 6:54 pm
by stam
Well it's easy to test and easy to cater for even if there is a difference.
If there is an actual difference that regex can detect (although personally i'm still wondering how you in practice combine a tilde sign with a letter - how is that even done?), they you can simply add both versions:
will accout for both.
So this is a not an actual problem and is presumably a thought-experiment rather an actual issue.
I'm still curious as to how you actually produced a ñ by combining a
tilde with an
n! Please do explain Richmond - preferably with a concrete example we can all try

Re: RegEx filter with accented characters?
Posted: Thu Jul 21, 2022 8:08 pm
by richmond62
Well, unfortunately, I have had neither the time nor the self-disciplne to read the Dictionary from cover to cover, and, as had been pointed out several times before, how the fudge does one know what to search for until one has found it?
Re: RegEx filter with accented characters?
Posted: Fri Jul 22, 2022 5:02 am
by jacque
I'm still curious as to how you actually produced a ñ by combining a tilde with an n!
A two stroke keypress is actually how we do it on Mac OS, at least on Roman language keyboards. The option key prints diacriticals. So you type the primary character, like "e", and then option-type a diacritical. The OS combines them into a single accented character in the text.
I believe unicode allows two different representations for accented characters, one way uses a single character with a diacritical attached, and another way represents a plain character followed by a diacritical. But I'm no unicode expert.