Page 1 of 1

BOM in UTF-8 files

Posted: Sat May 14, 2011 4:41 am
by sp27
I'm following the pointers in Devin Asay's article on Unicode at /spaces/lessons/buckets/1412/lessons/20441-Unicode.

Here is his example in Step 5:

put url ("binfile:/path/to/file/myUniText.ut8") into tRawTxt
set the unicodetext of fld "display" to uniencode(tRawTxt,"UTF8")

I am finding that this process has an unmentioned twist to it, at least when the file (Devin's myUniText.ut8) is a Windows UTF-8 file, e.g. saved from Notepad as a UTF-8 text file.

The first three bytes of such files are the so-called BOM (byte-order marker). The useful text of the field (Devin's "display") really starts at character 3, because the first two characters in that field (when I follow his example) are always the extraneous bytes that I think come from that BOM. They are displayed as the umlauted y and another non-English letter that looks a little like the English lower case p. These two extraneous characters are not visible in the field, but you can see them if you do this in the Message box, after duplicating Devin's example:

put char 1 to 2 of field "display"
ÿþ

or:
put charToNum(char 1 of field "display")
255

put charToNum(char 2 of field "display")
254

Question:
In my application, I use a field similar to Devin's "display" as a container for Unicode strings that serve as constants in my scripts. Should I always discard the first two characters of the field, no matter what OS the application runs in? Or will that trip me up some day? Or is the issue here something altogether different?

Thanks,

Slava

Re: BOM in UTF-8 files

Posted: Sat May 14, 2011 3:48 pm
by Mark
Hi Slava,

First, you need to make sure that you are dealing with a UTF8 file. If and only if you know this, you should check whether the data start with a UTF8 BOM. If they do, you should discard the BOM.

Do NOT think that the BOM indicates whether a file is a unicode file. Files may accidentally start with a BOM, because that's how the data are. Files may also lack a BOM while they they are unicode files. However, once you know that a file is a unicode file and a BOM is present at the start of the data, then the BOM provides useful information about the unicode file.

Kind regards,

Mark