BOM in UTF-8 files

Got a LiveCode personal license? Are you a beginner, hobbyist or educator that's new to LiveCode? This forum is the place to go for help getting started. Welcome!

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller

Post Reply
sp27
Posts: 135
Joined: Mon May 09, 2011 3:01 pm

BOM in UTF-8 files

Post by sp27 » Sat May 14, 2011 4:41 am

I'm following the pointers in Devin Asay's article on Unicode at /spaces/lessons/buckets/1412/lessons/20441-Unicode.

Here is his example in Step 5:

put url ("binfile:/path/to/file/myUniText.ut8") into tRawTxt
set the unicodetext of fld "display" to uniencode(tRawTxt,"UTF8")

I am finding that this process has an unmentioned twist to it, at least when the file (Devin's myUniText.ut8) is a Windows UTF-8 file, e.g. saved from Notepad as a UTF-8 text file.

The first three bytes of such files are the so-called BOM (byte-order marker). The useful text of the field (Devin's "display") really starts at character 3, because the first two characters in that field (when I follow his example) are always the extraneous bytes that I think come from that BOM. They are displayed as the umlauted y and another non-English letter that looks a little like the English lower case p. These two extraneous characters are not visible in the field, but you can see them if you do this in the Message box, after duplicating Devin's example:

put char 1 to 2 of field "display"
ÿþ

or:
put charToNum(char 1 of field "display")
255

put charToNum(char 2 of field "display")
254

Question:
In my application, I use a field similar to Devin's "display" as a container for Unicode strings that serve as constants in my scripts. Should I always discard the first two characters of the field, no matter what OS the application runs in? Or will that trip me up some day? Or is the issue here something altogether different?

Thanks,

Slava

Mark
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 5150
Joined: Thu Feb 23, 2006 9:24 pm
Contact:

Re: BOM in UTF-8 files

Post by Mark » Sat May 14, 2011 3:48 pm

Hi Slava,

First, you need to make sure that you are dealing with a UTF8 file. If and only if you know this, you should check whether the data start with a UTF8 BOM. If they do, you should discard the BOM.

Do NOT think that the BOM indicates whether a file is a unicode file. Files may accidentally start with a BOM, because that's how the data are. Files may also lack a BOM while they they are unicode files. However, once you know that a file is a unicode file and a BOM is present at the start of the data, then the BOM provides useful information about the unicode file.

Kind regards,

Mark
The biggest LiveCode group on Facebook: https://www.facebook.com/groups/livecode.developers
The book "Programming LiveCode for the Real Beginner"! Get it here! http://tinyurl.com/book-livecode

Post Reply