Page 1 of 1
MatchChunk and UTF-8?
Posted: Fri Jan 07, 2011 5:15 pm
by Zax
Hello,
I don't understand how to use MatchChunk() function with string and regularExpression UTF-8 encoded.
In PHP, I know the "u" modifier (
http://www.php.net/manual/en/reference. ... ifiers.php) but I don't know how to use something equivalent using MatchChunk with rev studio 4.0.
Thanks for your help.
Re: MatchChunk and UTF-8?
Posted: Mon Jan 10, 2011 12:50 am
by Mark
Hi Zax,
Unicode is NOT a beginners subject.
You can't tell Revolution/LiveCode that you are working with UTF8 text. It just doesn't handle UTF8 at all. If you want to display it, you need to convert it to LC's UTF16 equivalent first, using the uniencode function.
Fortunately, UTF8 mostly works just like regular text. If you really want to handle UTF8 text and don't need to display the text in a field, then you can just ignore the fact that it is UTF8 and use matchChunk as if you are using regular text. There might be a few small exceptions, which require you to escape a few characters.
Kind regards,
Mark
Re: MatchChunk and UTF-8?
Posted: Mon Jan 10, 2011 10:29 am
by Zax
Thanks for your reply Mark, and sorry to post in the wrong forum.
Though I don't need to display UTF-8 texts, I encountered strange regex behavior with accented characters and UTF-8 encoded text files.
For example :
Code: Select all
answer file "Choose a text file"
if it <> "" then
put it into myFile -- contains "sample string : et voilà" - Mac OS Roman encoded
open file myFile for binary read
read from file myFile until EOF
close file myFile
put it into data
answer matchChunk(data,"voilà") -- true
end if
Now if the text file to read is UTF-8 encoded, matchChunk(data,"voilà") returns false

In this case, I tried
Code: Select all
put uniEncode(it,"UTF8") into data
but the result is the same.
I also tried to open file for
text read but the result is still the same.
Re: MatchChunk and UTF-8?
Posted: Tue Jan 11, 2011 3:20 pm
by Mark
Hi Zax,
uniEncode(it,"UTF8") encodes the variable it from UTF8 to Unicode (LiveCode's UTF16 flavour). However, it seems that the it variable contains Mac OS Roman text, which isn't UTF8 but English (also called ANSI).
The function matchChunk(data,"voilà") compares the contents of data with the string "voilà" where "voilà" is either Mac Roman or Windows.
So, you are compating a MacRoman string with something of which you think it is UTF8 but it isn't. Do you see in how many ways this goes wrong?
If you want to know whether a MacRoman encoded string from a file contains a particular string, you can use plain text on the Mac and you might want to convert it to ISO on Windows (although this goes wrong in some cases).
Try his:
Code: Select all
open file myFile for binary read
read from file myFile until EOF
close file myFile
-- it now contains MacRoman data
if the platform is not "MacOS" then
get macToIso(it)
end if
put matchChunk(it,"voilà") into myMatch
Let me know if this works for you.
Kind regards,
Mark
Re: MatchChunk and UTF-8?
Posted: Wed Jan 12, 2011 12:48 pm
by Zax
Maybe I don't understand your latest answer Mark but I think I wasn't accurate in my previous post.
I use rev studio 4.0 on Mac OS X.
The first code I posted works as expected:
Code: Select all
answer file "Choose a text file"
if it <> "" then
put it into myFile -- contains "sample string : et voilà" - Mac OS Roman encoded
open file myFile for binary read
read from file myFile until EOF
close file myFile
put it into data
put "voilà" into expression
answer matchChunk(data,expression) -- true
end if
The problem start with the second case, when the text file to read was UTF-8 encoded.
The code is now:
Code: Select all
answer file "Choose a text file"
if it <> "" then
put it into myFile -- contains "sample string : et voilà" - UTF-8 encoded
open file myFile for binary read
read from file myFile until EOF
close file myFile
put it into data
-- data contains UTF-8 text
put "voilà" into expression
-- expression contains Unicode text
answer matchChunk(data,expression) -- false !!!!
end if
As it's a variable, expression is Unicode encoded, so I understand why matchChunk(data,expression) returns false.
In order to have Unicode UTF-16 on both sides, I tried:
Code: Select all
put uniEncode(it,"UTF8") into data
but matchChunk(data,expression) always returns false... and I don't understand why.
Re: MatchChunk and UTF-8?
Posted: Wed Jan 12, 2011 12:55 pm
by Mark
Zax,
You have to convert expression to UTF16 as well, before you can compare data and expression.
Kind regards,
Mark
Re: MatchChunk and UTF-8?
Posted: Thu Jan 13, 2011 11:01 am
by Zax
OK Mark, it works.
Thank you very much
Now I encounter another problem: "voilà" is found, even if the text file contains "et voil à".
Besides, I'm unable to use the needed pair of parentheses: the following codes return syntax errors when performing matchChunk
Code: Select all
put "(" & uniEncode(expression) & ")" into expression
Code: Select all
put "(voilà)" into expression
put uniEncode(expression) into expression
Re: MatchChunk and UTF-8?
Posted: Thu Jan 13, 2011 11:06 am
by Mark
Zax, when you mention "error", always include the error.
Mark
Re: MatchChunk and UTF-8?
Posted: Thu Jan 13, 2011 3:33 pm
by Zax
OK, here are the errors:
Code: Select all
put "(" & uniEncode(expression) & ")" into expression
--> execution error at line 25 (matchChunk: error in pattern expression) near "bad escape sequence", char 3
Code: Select all
put "(voilà)" into expression
put uniEncode(expression) into expression
--> button "Button": execution error at line 25 (matchChunk: error in pattern expression) near "bad escape sequence", char 3
Re: MatchChunk and UTF-8?
Posted: Thu Jan 13, 2011 4:26 pm
by Mark
Hi Zax,
A problem that occurs here, is that you are doing a matchChunk on binary data. That's why I think it would be better to convert the unicode data to ANSI instead of the other way around.
Still, the following works for me:
Code: Select all
put matchchunk(uniencode("sample string : et (voilà)"),uniencode("voilà"))
and I'm not sure why you would get an error. I'm not sure whether encoded parentheses make any difference. With and without parentheses, the matchChunk function returns true.
I think it would be better to convert your unicode text to ANSI and do a matchChunk with the ANSI encoded strings.
Best,
Mark
Re: MatchChunk and UTF-8?
Posted: Fri Jan 14, 2011 10:15 am
by Zax
Hello Mark,
Mark wrote:Hi Zax,
Still, the following works for me:
Code: Select all
put matchchunk(uniencode("sample string : et (voilà)"),uniencode("voilà"))
For me too, the problem is here:
Code: Select all
matchChunk(uniencode("sample string : et voilà"),uniencode("(voilà)")) -- error
As it is said in LiveCode doc : "If the regularExpression includes a pair of parentheses, the position of the substring matching the part of the regular expression inside the parentheses is placed in the variables in the positionVarsList."
I made several simple tests:
Code: Select all
matchChunk("sample string : et voilà","voilà") -- true
matchChunk("sample string : et voilà","voila") -- false
matchChunk(uniencode("sample string : et voilà"),uniencode("voilà")) -- true : seems OK, but wrong because it matches on "voil". See below:
matchChunk(uniencode("sample string : et voilà"),uniencode("voila")) -- true ! should be false
matchChunk(uniencode("sample string : et voilà","ANSI"),uniencode("voila","ANSI")) -- true ! should be false
matchChunk("sample string : et voilà","(voilà)") -- true
matchChunk(uniencode(data,"ANSI"),"(" & uniencode("voila","ANSI") & ")") -- error
matchChunk(uniencode(data,"ANSI"),uniencode("(voila)","ANSI")) -- error
Problems seem to start when using uniEncode(), whatever the language (UTF16, UTF8 or ANSI).
Maybe I don't understand anything, or maybe matchChunk() is bugged... Anyhow, thanks a lot for your time Mark.
Re: MatchChunk and UTF-8?
Posted: Fri Jan 14, 2011 11:44 am
by Mark
Hi Zax,
Nowhere in your syntax you used those position variables, so I assume you can do without them and don't need the parentheses in the regular expression.
Kind regards,
Mark