MatchChunk and UTF-8?

Got a LiveCode personal license? Are you a beginner, hobbyist or educator that's new to LiveCode? This forum is the place to go for help getting started. Welcome!

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller

Post Reply
Zax
Posts: 519
Joined: Mon May 28, 2007 10:12 am
Contact:

MatchChunk and UTF-8?

Post by Zax » Fri Jan 07, 2011 5:15 pm

Hello,

I don't understand how to use MatchChunk() function with string and regularExpression UTF-8 encoded.
In PHP, I know the "u" modifier (http://www.php.net/manual/en/reference. ... ifiers.php) but I don't know how to use something equivalent using MatchChunk with rev studio 4.0.

Thanks for your help.

Mark
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 5150
Joined: Thu Feb 23, 2006 9:24 pm
Contact:

Re: MatchChunk and UTF-8?

Post by Mark » Mon Jan 10, 2011 12:50 am

Hi Zax,

Unicode is NOT a beginners subject.

You can't tell Revolution/LiveCode that you are working with UTF8 text. It just doesn't handle UTF8 at all. If you want to display it, you need to convert it to LC's UTF16 equivalent first, using the uniencode function.

Fortunately, UTF8 mostly works just like regular text. If you really want to handle UTF8 text and don't need to display the text in a field, then you can just ignore the fact that it is UTF8 and use matchChunk as if you are using regular text. There might be a few small exceptions, which require you to escape a few characters.

Kind regards,

Mark
The biggest LiveCode group on Facebook: https://www.facebook.com/groups/livecode.developers
The book "Programming LiveCode for the Real Beginner"! Get it here! http://tinyurl.com/book-livecode

Zax
Posts: 519
Joined: Mon May 28, 2007 10:12 am
Contact:

Re: MatchChunk and UTF-8?

Post by Zax » Mon Jan 10, 2011 10:29 am

Thanks for your reply Mark, and sorry to post in the wrong forum.

Though I don't need to display UTF-8 texts, I encountered strange regex behavior with accented characters and UTF-8 encoded text files.
For example :

Code: Select all

  answer file "Choose a text file"
  if it <> "" then
    put it into myFile -- contains "sample string : et voilà" - Mac OS Roman encoded
    open file myFile for binary read
    read from file myFile until EOF
    close file myFile
    put it into data
    answer matchChunk(data,"voilà") -- true
  end if
Now if the text file to read is UTF-8 encoded, matchChunk(data,"voilà") returns false :(
In this case, I tried

Code: Select all

put uniEncode(it,"UTF8") into data
but the result is the same.
I also tried to open file for text read but the result is still the same.

Mark
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 5150
Joined: Thu Feb 23, 2006 9:24 pm
Contact:

Re: MatchChunk and UTF-8?

Post by Mark » Tue Jan 11, 2011 3:20 pm

Hi Zax,

uniEncode(it,"UTF8") encodes the variable it from UTF8 to Unicode (LiveCode's UTF16 flavour). However, it seems that the it variable contains Mac OS Roman text, which isn't UTF8 but English (also called ANSI).

The function matchChunk(data,"voilà") compares the contents of data with the string "voilà" where "voilà" is either Mac Roman or Windows.

So, you are compating a MacRoman string with something of which you think it is UTF8 but it isn't. Do you see in how many ways this goes wrong?

If you want to know whether a MacRoman encoded string from a file contains a particular string, you can use plain text on the Mac and you might want to convert it to ISO on Windows (although this goes wrong in some cases).

Try his:

Code: Select all

open file myFile for binary read
read from file myFile until EOF
close file myFile
-- it now contains MacRoman data
if the platform is not "MacOS" then
    get macToIso(it)
end if
put matchChunk(it,"voilà") into myMatch
Let me know if this works for you.

Kind regards,

Mark
The biggest LiveCode group on Facebook: https://www.facebook.com/groups/livecode.developers
The book "Programming LiveCode for the Real Beginner"! Get it here! http://tinyurl.com/book-livecode

Zax
Posts: 519
Joined: Mon May 28, 2007 10:12 am
Contact:

Re: MatchChunk and UTF-8?

Post by Zax » Wed Jan 12, 2011 12:48 pm

Maybe I don't understand your latest answer Mark but I think I wasn't accurate in my previous post.

I use rev studio 4.0 on Mac OS X.

The first code I posted works as expected:

Code: Select all

  answer file "Choose a text file"
  if it <> "" then
    put it into myFile -- contains "sample string : et voilà" - Mac OS Roman encoded
    open file myFile for binary read
    read from file myFile until EOF
    close file myFile
    put it into data
    put "voilà" into expression
    answer matchChunk(data,expression) -- true
  end if
The problem start with the second case, when the text file to read was UTF-8 encoded.
The code is now:

Code: Select all

  answer file "Choose a text file"
  if it <> "" then
    put it into myFile -- contains "sample string : et voilà" - UTF-8 encoded
    open file myFile for binary read
    read from file myFile until EOF
    close file myFile
    put it into data
    -- data contains UTF-8 text
    put "voilà" into expression
    -- expression contains Unicode text
    answer matchChunk(data,expression) -- false !!!!
  end if
As it's a variable, expression is Unicode encoded, so I understand why matchChunk(data,expression) returns false.

In order to have Unicode UTF-16 on both sides, I tried:

Code: Select all

put uniEncode(it,"UTF8") into data
but matchChunk(data,expression) always returns false... and I don't understand why.

Mark
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 5150
Joined: Thu Feb 23, 2006 9:24 pm
Contact:

Re: MatchChunk and UTF-8?

Post by Mark » Wed Jan 12, 2011 12:55 pm

Zax,

You have to convert expression to UTF16 as well, before you can compare data and expression.

Kind regards,

Mark
The biggest LiveCode group on Facebook: https://www.facebook.com/groups/livecode.developers
The book "Programming LiveCode for the Real Beginner"! Get it here! http://tinyurl.com/book-livecode

Zax
Posts: 519
Joined: Mon May 28, 2007 10:12 am
Contact:

Re: MatchChunk and UTF-8?

Post by Zax » Thu Jan 13, 2011 11:01 am

OK Mark, it works.
Thank you very much :)

Now I encounter another problem: "voilà" is found, even if the text file contains "et voil à".

Besides, I'm unable to use the needed pair of parentheses: the following codes return syntax errors when performing matchChunk

Code: Select all

put "(" & uniEncode(expression) & ")" into expression

Code: Select all

put "(voilà)" into expression
put uniEncode(expression) into expression

Mark
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 5150
Joined: Thu Feb 23, 2006 9:24 pm
Contact:

Re: MatchChunk and UTF-8?

Post by Mark » Thu Jan 13, 2011 11:06 am

Zax, when you mention "error", always include the error.

Mark
The biggest LiveCode group on Facebook: https://www.facebook.com/groups/livecode.developers
The book "Programming LiveCode for the Real Beginner"! Get it here! http://tinyurl.com/book-livecode

Zax
Posts: 519
Joined: Mon May 28, 2007 10:12 am
Contact:

Re: MatchChunk and UTF-8?

Post by Zax » Thu Jan 13, 2011 3:33 pm

OK, here are the errors:

Code: Select all

put "(" & uniEncode(expression) & ")" into expression
--> execution error at line 25 (matchChunk: error in pattern expression) near "bad escape sequence", char 3

Code: Select all

put "(voilà)" into expression
put uniEncode(expression) into expression
--> button "Button": execution error at line 25 (matchChunk: error in pattern expression) near "bad escape sequence", char 3

Mark
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 5150
Joined: Thu Feb 23, 2006 9:24 pm
Contact:

Re: MatchChunk and UTF-8?

Post by Mark » Thu Jan 13, 2011 4:26 pm

Hi Zax,

A problem that occurs here, is that you are doing a matchChunk on binary data. That's why I think it would be better to convert the unicode data to ANSI instead of the other way around.

Still, the following works for me:

Code: Select all

put matchchunk(uniencode("sample string : et (voilà)"),uniencode("voilà"))
and I'm not sure why you would get an error. I'm not sure whether encoded parentheses make any difference. With and without parentheses, the matchChunk function returns true.

I think it would be better to convert your unicode text to ANSI and do a matchChunk with the ANSI encoded strings.

Best,

Mark
The biggest LiveCode group on Facebook: https://www.facebook.com/groups/livecode.developers
The book "Programming LiveCode for the Real Beginner"! Get it here! http://tinyurl.com/book-livecode

Zax
Posts: 519
Joined: Mon May 28, 2007 10:12 am
Contact:

Re: MatchChunk and UTF-8?

Post by Zax » Fri Jan 14, 2011 10:15 am

Hello Mark,
Mark wrote:Hi Zax,
Still, the following works for me:

Code: Select all

put matchchunk(uniencode("sample string : et (voilà)"),uniencode("voilà"))
For me too, the problem is here:

Code: Select all

matchChunk(uniencode("sample string : et voilà"),uniencode("(voilà)")) -- error
As it is said in LiveCode doc : "If the regularExpression includes a pair of parentheses, the position of the substring matching the part of the regular expression inside the parentheses is placed in the variables in the positionVarsList."

I made several simple tests:

Code: Select all

matchChunk("sample string : et voilà","voilà") -- true
matchChunk("sample string : et voilà","voila") -- false

matchChunk(uniencode("sample string : et voilà"),uniencode("voilà")) -- true : seems OK, but wrong because it matches on "voil". See below:
matchChunk(uniencode("sample string : et voilà"),uniencode("voila")) -- true ! should be false
matchChunk(uniencode("sample string : et voilà","ANSI"),uniencode("voila","ANSI")) -- true ! should be false

matchChunk("sample string : et voilà","(voilà)") -- true
matchChunk(uniencode(data,"ANSI"),"(" & uniencode("voila","ANSI") & ")") -- error
matchChunk(uniencode(data,"ANSI"),uniencode("(voila)","ANSI")) -- error
Problems seem to start when using uniEncode(), whatever the language (UTF16, UTF8 or ANSI).
Maybe I don't understand anything, or maybe matchChunk() is bugged... Anyhow, thanks a lot for your time Mark.

Mark
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 5150
Joined: Thu Feb 23, 2006 9:24 pm
Contact:

Re: MatchChunk and UTF-8?

Post by Mark » Fri Jan 14, 2011 11:44 am

Hi Zax,

Nowhere in your syntax you used those position variables, so I assume you can do without them and don't need the parentheses in the regular expression.

Kind regards,

Mark
The biggest LiveCode group on Facebook: https://www.facebook.com/groups/livecode.developers
The book "Programming LiveCode for the Real Beginner"! Get it here! http://tinyurl.com/book-livecode

Post Reply