RegEx filter with accented characters?

Got a LiveCode personal license? Are you a beginner, hobbyist or educator that's new to LiveCode? This forum is the place to go for help getting started. Welcome!

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller

Zax
Posts: 519
Joined: Mon May 28, 2007 10:12 am
Contact:

RegEx filter with accented characters?

Post by Zax » Thu Jul 21, 2022 8:30 am

Hello,

Considering this list of french monthes:
Janvier 520
Février 15
Mars 48
Avril 110
Août 63
Sometimes, if the writer is lazy, accented characters are omitted, though it's a grammatical error:
Janvier 520
Fevrier 15
Mars 48
Avril 110
Aout 63
I would like to filter lines of a given month with or without accented characters. So if for example the given filter is "Fevrier", I would like to match both "Fevrier" and "Février".
I guess this be done with a RegEx, and then use:

Code: Select all

filter lines of myMonthesList with regex pattern regexPattern
Does anyone know how to write such a regexPattern?

Thank you.

stam
Posts: 3069
Joined: Sun Jun 04, 2006 9:39 pm

Re: RegEx filter with accented characters?

Post by stam » Thu Jul 21, 2022 9:05 am

I don't think there is a function in regex that allows you to by default treat accented characters the same as the 'base' character.
I think you would probably just have to add all contingencies in your query

https://regex101.com is a good test ground for regex formulae

stam
Posts: 3069
Joined: Sun Jun 04, 2006 9:39 pm

Re: RegEx filter with accented characters?

Post by stam » Thu Jul 21, 2022 9:21 am

Thinking about this more, if you want to search for both Fevrier and Février (or even Fèvrier or Fêvrier or whatever) you can just substitute the accented character with a '.' which will match any character. Eg

Code: Select all

F.vrier
will match all of the above. Looking online there are more complex patterns if you want to be more specific, but not sure if supported by LC and this should suffice for your use-case i think...

S.


--------------------------------------------------
EDIT: if you want to make the match a bit wider (eg case-insensitive) then use modifiers:

Code: Select all

(?i)F.vrier
the modifier i means to treat this as case-insensitive, as by default regex is case sensitive, so that F.vrier will also match fevrier

so this should work in LC (untested):
filter lines of myMonthesList with regex pattern "(?i)f.vrier"

Other modifiers i frequently use (can all be used together, eg (?xis):
m - multiline (^ signifies start of line, $ signifies end of line)
x - ignore white space in the query
s - single line : treats everything as 1 line, so that dot matches newline as well

--------------------------------------------------
EDIT 2
Just to be on the safe side, it tested with some simple code in a button:

Code: Select all

on mouseUp
    local tLines, myMonthesList
    put "Janvier 520" & return into myMonthesList
    put "Février 15" & return after myMonthesList
    put "Mars 48" & return after myMonthesList
    put "Avril 110" & return after myMonthesList
    put "Août 63" after myMonthesList
    
    filter lines of myMonthesList with regex pattern "(?i)f.vrier" into tLines
end mouseUp
tLines now contains "Février 15"

Zax
Posts: 519
Joined: Mon May 28, 2007 10:12 am
Contact:

Re: RegEx filter with accented characters?

Post by Zax » Thu Jul 21, 2022 10:16 am

Thank you stam for your answers.
stam wrote:
Thu Jul 21, 2022 9:21 am

Code: Select all

F.vrier
will match all of the above. Looking online there are more complex patterns if you want to be more specific, but not sure if supported by LC and this should suffice for your use-case i think...
Sorry if my question wasn't very accurate. The monthes list was just an example. The "real" list can contain everything, including weird things like "FFvrier", and I don't want "FFvrier" to be returned in the filtered list.

I'm actually trying to build a regex on the fly with things like (e|é|è|ê|ë) but it still doesn't work. I will post the code when it will be done.

stam
Posts: 3069
Joined: Sun Jun 04, 2006 9:39 pm

Re: RegEx filter with accented characters?

Post by stam » Thu Jul 21, 2022 10:44 am

if you want to capture a list of alternatives for one character you should include these as you say with a bar | ('or' in regex-speak) - but in a non-capturing group in the form (?: ) - if you don't include the ?: but just the parentheses it will be treated as a matched group. Took me a little while to figure out non-capturing groups ;)

so you can change the filter statement above to:

Code: Select all

filter lines of myMonthesList with regex pattern "(?i)f(?:e|è|é|ë|ê)vrier" into tLines
Tested and works with all the alternatives in the pattern

HTH
Stam

richmond62
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 10096
Joined: Fri Feb 19, 2010 10:17 am

Re: RegEx filter with accented characters?

Post by richmond62 » Thu Jul 21, 2022 11:31 am

One of the things that needs to be considered is that accented characters can come about in different ways:

Consider ñ

This can be n (0x006E) and a combining char [tilde} (0x02DC), or

ñ (0x00F1)

now you have no way, in advance, in which way a ñ might be encoded in data you are going to process.

stam
Posts: 3069
Joined: Sun Jun 04, 2006 9:39 pm

Re: RegEx filter with accented characters?

Post by stam » Thu Jul 21, 2022 11:34 am

The OP is wanting to search for presoecified patters where a character may be a number of different variations.

Don’t see why you can’t add ñ to the formula above if that’s a variant to be included in the search.

Given that regex works with strings presented to it, and not hex code, how it’s constructed should be irrelevant.

Unless of course you tested this and found a problem?

Zax
Posts: 519
Joined: Mon May 28, 2007 10:12 am
Contact:

Re: RegEx filter with accented characters?

Post by Zax » Thu Jul 21, 2022 11:47 am

stam wrote:
Thu Jul 21, 2022 10:44 am
so you can change the filter statement above to:

Code: Select all

filter lines of myMonthesList with regex pattern "(?i)f(?:e|è|é|ë|ê)vrier" into tLines
Tested and works with all the alternatives in the pattern
Interesting.
I was successfully using as regex pattern:

Code: Select all

(?i)(f[e|é|è|ê|ë]vrier)"
But I don't understand anything in regex, so I don't know which one is better.

EDIT : and yes, I think "ñ" will be managed in the same way but I will have to run more tests.

richmond62
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 10096
Joined: Fri Feb 19, 2010 10:17 am

Re: RegEx filter with accented characters?

Post by richmond62 » Thu Jul 21, 2022 11:53 am

Given that regex works with strings presented to it, and not hex code, how it’s constructed should be irrelevant.
Well, erm, possibly . . .

If one is searching for single characters (i.e. tilded n) then characters followed by a combining character would excluded,

if one is pattern matching there should be no problem . . .

although for my example one would probably have to present 'tilded n' and 'n + tilde' as 2 patterns.

Zax
Posts: 519
Joined: Mon May 28, 2007 10:12 am
Contact:

Re: RegEx filter with accented characters?

Post by Zax » Thu Jul 21, 2022 12:46 pm

This is my code. It works but could certainly be shorten or optimized:

Code: Select all

local arrChars

on mouseup
   ask "Show lines with:" with "fevrier"
   if it = "" then exit to top
   
   initArrChars
   put fld 2 into myList
   filter lines of myList with regex pattern "(?i)(" & getAccentedRegex(deAccent(it)) & ")" into myResult
   
   put "" into fld 3
   put myResult into fld 3
end mouseup

on initArrChars // init replacement chars array
   put "[à|â|ä]" into arrChars["a"]
   put "[é|è|ê|ë]" into arrChars["e"]
   put "[î|ï]" into arrChars["i"]
   put "[ô|ö]" into arrChars["o"]
   put "[ù|û|ü]" into arrChars["u"]
   put "[ÿ]" into arrChars["y"]
   put "[ç]" into arrChars["c"]
   put "[ñ]" into arrChars["n"]
end initArrChars

function deAccent str // replace all accented chars with 'normal' chars in given string
   get str
   repeat for each key thisKey in arrChars
      get replaceText(it, arrChars[thisKey], thisKey)
   end repeat
   return it
end deAccent

function getAccentedRegex str // build regex
   repeat with c = (the length of str) down to 1
      get char c of str
      if (arrChars[it] <> "") then // regex must be built for this char
         get arrChars[it]
         put char c of str & "|" after char 1 of it
         put it into char c of str
      end if
   end repeat
   
   return str
end getAccentedRegex
Testing list (fld 2 in the code):

Code: Select all

Janvier
Février
Fevrier
FFvrier ERROR
Mars
Avril
Août
Aout
sécurité
sécurite
securité
securite
Nina
Niña
français
francais
Thanks all for your help :)

stam
Posts: 3069
Joined: Sun Jun 04, 2006 9:39 pm

Re: RegEx filter with accented characters?

Post by stam » Thu Jul 21, 2022 3:10 pm

Very nice code, glad you got it to work!

only 1 comment - i'd be a bit cautious with the prodigious use of it as any intermediate step can assign a value to it...

for example:

Code: Select all

   ask "Show lines with:" with "fevrier"
   if it = "" then exit to top
   
   initArrChars
   put fld 2 into myList
   filter lines of myList with regex pattern "(?i)(" & getAccentedRegex(deAccent(it)) & ")" into myResult
it is being set in the ask statement but is being used in several lines (and importantly several functions) below.
While this is probably OK since your code works, it is a potential source of error.

It's more resilient just to assign the it from the ask statement to a local variable so it won't be accidentally changed if in the future you change your code and introduce something that uses it as well somewhere in-between...

jacque
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 7390
Joined: Sat Apr 08, 2006 8:31 pm
Contact:

Re: RegEx filter with accented characters?

Post by jacque » Thu Jul 21, 2022 5:49 pm

richmond62 wrote:
Thu Jul 21, 2022 11:31 am
One of the things that needs to be considered is that accented characters can come about in different ways:

Consider ñ

This can be n (0x006E) and a combining char [tilde} (0x02DC), or

ñ (0x00F1)

now you have no way, in advance, in which way a ñ might be encoded in data you are going to process.
LC provides normalizeText to account for that. See also the formSensitive property.
Jacqueline Landman Gay | jacque at hyperactivesw dot com
HyperActive Software | http://www.hyperactivesw.com

stam
Posts: 3069
Joined: Sun Jun 04, 2006 9:39 pm

Re: RegEx filter with accented characters?

Post by stam » Thu Jul 21, 2022 6:54 pm

Well it's easy to test and easy to cater for even if there is a difference.

If there is an actual difference that regex can detect (although personally i'm still wondering how you in practice combine a tilde sign with a letter - how is that even done?), they you can simply add both versions:

Code: Select all

put "[ñ|ñ]" into arrChars["n"]
will accout for both.

So this is a not an actual problem and is presumably a thought-experiment rather an actual issue.
I'm still curious as to how you actually produced a ñ by combining a tilde with an n! Please do explain Richmond - preferably with a concrete example we can all try ;)

richmond62
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 10096
Joined: Fri Feb 19, 2010 10:17 am

Re: RegEx filter with accented characters?

Post by richmond62 » Thu Jul 21, 2022 8:08 pm

Well, unfortunately, I have had neither the time nor the self-disciplne to read the Dictionary from cover to cover, and, as had been pointed out several times before, how the fudge does one know what to search for until one has found it?

jacque
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 7390
Joined: Sat Apr 08, 2006 8:31 pm
Contact:

Re: RegEx filter with accented characters?

Post by jacque » Fri Jul 22, 2022 5:02 am

I'm still curious as to how you actually produced a ñ by combining a tilde with an n!
A two stroke keypress is actually how we do it on Mac OS, at least on Roman language keyboards. The option key prints diacriticals. So you type the primary character, like "e", and then option-type a diacritical. The OS combines them into a single accented character in the text.

I believe unicode allows two different representations for accented characters, one way uses a single character with a diacritical attached, and another way represents a plain character followed by a diacritical. But I'm no unicode expert.
Jacqueline Landman Gay | jacque at hyperactivesw dot com
HyperActive Software | http://www.hyperactivesw.com

Post Reply