Page 1 of 1

length of a Unicode string and field

Posted: Wed May 18, 2011 4:09 am
by sp27
This is not a problem report. I'm asking for confirmation of my findings, because they're just too crazy, though it's probably by design. I want to be sure I'm not misinterpreting some basics. I've only spent a few days playing with LC.

This is about how the LC function length() measures mixed-language strings encoded in UTF8 and in UTF16. In my case, the string is English and Russian.

I have a tiny text file that contains three Russian letters and three English letters, АБВABC.

--I read that file into a variable.
--I call length() to determine the length of that variable.
--I convert that variable to UTF16 and put it into a field. I see the text correctly displayed in the field.
--I call length() to determine the length of that field.
--I look at the two length results, and try to interpret them.

Here it is, copied from the Message Box:

Code: Select all

put u-r-l("binfile:E:/LiveCode/test.txt") into locFromFile
--test.txt was saved from Notepad as a UTF-8 file; it contains 3 Russian and 3 English letters
delete char 1 to 3 of locFromFile --this removes the BOM
put length(locFromFile) into locFileLength
set the unicodeText of field "TestField" of this card to uniEncode(locFromFile, "UTF8")
--I see the text of the file correctly displayed in the field

put length(the unicodeText of field  "TestField" of this card) into locFieldLength
put "length of variable: " & locFileLength & " length of field: " & locFieldLength
--the Results pane shows: length of variable: 9 length of field: 12
That's the length of 9 for the variable that holds 6 letters in UTF-8 and the length of 12 for the same letters in UTF-16.
I explain the 9 as follows: each English letter counts as 1 and each Russian letter coutns as 2.

I open my little text file and delete one Russian letter and one English letter, and run the same code from the Message Box. I see:
length of variable: 6 and length of field: 8
My explanation is still the same: 2 English letters plus 2*2 for the Russian letters.

I open my little text file again and add one Russian letter, and run the same code from the Message Box. I see:
length of variable: 8 and length of field: 10.
The same formula explains the 8.

Now when my text file contains three English letters, the results are predictable: 3 and 8.

The length() of the UTF-16 string is always double the number of letters, whether they are Russian or English. That's cool and reliable, and I can live with it. But the length of the UTF-8 variable depends on the ratio of Russian vis-a-vis English letters in the variable. How do you work with that?

Slava

Re: length of a Unicode string and field

Posted: Wed May 18, 2011 6:27 am
by dglass
Seems like 'length' isn't really returning the number of characters in a string, but rather the number of bytes.

For UTF-16 that is going to be 2 bytes for every character.
For UTF-8 that is going to be 2 bytes for characters that don't match an ASCII character, and 1 byte for those that do.

I'm not sure what you mean by 'How do you work with that?', but if you need your lengths to be consistent you're probably better off working with UTF-16 encoded strings.

Otherwise you'll probably have to parse the string to determine how many ASCII characters are present so you can do whatever math you require.

Re: length of a Unicode string and field

Posted: Wed May 18, 2011 7:58 am
by sp27
Thanks for thinking about this, dglass. Yes, length() returns the number of bytes, but all string functions (like offset(), etc.) work with characters, don't they? If I don't know whether in my variable a character is one byte or two, how do I use those functions?

You said "you're better off working with UTF-16 encoded strings." I understand. Every character is two bytes then. When my application reads text from a file or retrieves data from my database, I can be reasonably sure that it's a UTF-8 file and convert its contents immediately after reading. ASCII characters will get a null byte added to them, so they're all two bytes long. But what happens when the user types into a field? Or pastes something into a field? Or when my script retrieves something from the Web?

Some of what she types may be English, and some of what she types may be Russian, or any mixture of the two languages, and then, let's say, I need to find if there are any spaces in her input, or any double quotes, etc. Can I always I treat her input as a UTF-16 string? To test whether the second character she typed was a space, do I set useUnicode to true and test charToNum(char 3 to 4 of the unicodeText of field X)? Or charToNum(char 2 of field X)?

That's what I meant by "how do you work without" knowning whether the text in your variables is ASCII, ANSI, or Unicode. In other environments that I use (JavaScript, ColdFusion, Lingo), I never have to worry about that. All text is treated the same way in variables, functions, fields, etc. I'm very new to LC, as you have probably noticed. Still struggling.

Appreciate your help,

Slava

Re: length of a Unicode string and field

Posted: Wed May 18, 2011 2:44 pm
by dglass
sp27 wrote:Yes, length() returns the number of bytes, but all string functions (like offset(), etc.) work with characters, don't they? If I don't know whether in my variable a character is one byte or two, how do I use those functions?
I don't know.

Given that the documentation on 'length' is wrong, IMO, I'm hard-pressed to trust any other documentation that says a command works on characters. I'd like to think that if 'characters' really means 'bytes' in the description of 'length' then it means 'bytes' everywhere, but that would surprise me.

I don't know how you should proceed. The first thing I would do is get clarification that either 'length' is working improperly, or the documentation is wrong.

You'll probably just have to test the other string functions to see what results you get.