BOM in UTF-8 files
Posted: Sat May 14, 2011 4:41 am
I'm following the pointers in Devin Asay's article on Unicode at /spaces/lessons/buckets/1412/lessons/20441-Unicode.
Here is his example in Step 5:
put url ("binfile:/path/to/file/myUniText.ut8") into tRawTxt
set the unicodetext of fld "display" to uniencode(tRawTxt,"UTF8")
I am finding that this process has an unmentioned twist to it, at least when the file (Devin's myUniText.ut8) is a Windows UTF-8 file, e.g. saved from Notepad as a UTF-8 text file.
The first three bytes of such files are the so-called BOM (byte-order marker). The useful text of the field (Devin's "display") really starts at character 3, because the first two characters in that field (when I follow his example) are always the extraneous bytes that I think come from that BOM. They are displayed as the umlauted y and another non-English letter that looks a little like the English lower case p. These two extraneous characters are not visible in the field, but you can see them if you do this in the Message box, after duplicating Devin's example:
put char 1 to 2 of field "display"
ÿþ
or:
put charToNum(char 1 of field "display")
255
put charToNum(char 2 of field "display")
254
Question:
In my application, I use a field similar to Devin's "display" as a container for Unicode strings that serve as constants in my scripts. Should I always discard the first two characters of the field, no matter what OS the application runs in? Or will that trip me up some day? Or is the issue here something altogether different?
Thanks,
Slava
Here is his example in Step 5:
put url ("binfile:/path/to/file/myUniText.ut8") into tRawTxt
set the unicodetext of fld "display" to uniencode(tRawTxt,"UTF8")
I am finding that this process has an unmentioned twist to it, at least when the file (Devin's myUniText.ut8) is a Windows UTF-8 file, e.g. saved from Notepad as a UTF-8 text file.
The first three bytes of such files are the so-called BOM (byte-order marker). The useful text of the field (Devin's "display") really starts at character 3, because the first two characters in that field (when I follow his example) are always the extraneous bytes that I think come from that BOM. They are displayed as the umlauted y and another non-English letter that looks a little like the English lower case p. These two extraneous characters are not visible in the field, but you can see them if you do this in the Message box, after duplicating Devin's example:
put char 1 to 2 of field "display"
ÿþ
or:
put charToNum(char 1 of field "display")
255
put charToNum(char 2 of field "display")
254
Question:
In my application, I use a field similar to Devin's "display" as a container for Unicode strings that serve as constants in my scripts. Should I always discard the first two characters of the field, no matter what OS the application runs in? Or will that trip me up some day? Or is the issue here something altogether different?
Thanks,
Slava