This is about how the LC function length() measures mixed-language strings encoded in UTF8 and in UTF16. In my case, the string is English and Russian.
I have a tiny text file that contains three Russian letters and three English letters, АБВABC.
--I read that file into a variable.
--I call length() to determine the length of that variable.
--I convert that variable to UTF16 and put it into a field. I see the text correctly displayed in the field.
--I call length() to determine the length of that field.
--I look at the two length results, and try to interpret them.
Here it is, copied from the Message Box:
Code: Select all
put u-r-l("binfile:E:/LiveCode/test.txt") into locFromFile
--test.txt was saved from Notepad as a UTF-8 file; it contains 3 Russian and 3 English letters
delete char 1 to 3 of locFromFile --this removes the BOM
put length(locFromFile) into locFileLength
set the unicodeText of field "TestField" of this card to uniEncode(locFromFile, "UTF8")
--I see the text of the file correctly displayed in the field
put length(the unicodeText of field "TestField" of this card) into locFieldLength
put "length of variable: " & locFileLength & " length of field: " & locFieldLength
--the Results pane shows: length of variable: 9 length of field: 12
I explain the 9 as follows: each English letter counts as 1 and each Russian letter coutns as 2.
I open my little text file and delete one Russian letter and one English letter, and run the same code from the Message Box. I see:
length of variable: 6 and length of field: 8
My explanation is still the same: 2 English letters plus 2*2 for the Russian letters.
I open my little text file again and add one Russian letter, and run the same code from the Message Box. I see:
length of variable: 8 and length of field: 10.
The same formula explains the 8.
Now when my text file contains three English letters, the results are predictable: 3 and 8.
The length() of the UTF-16 string is always double the number of letters, whether they are Russian or English. That's cool and reliable, and I can live with it. But the length of the UTF-8 variable depends on the ratio of Russian vis-a-vis English letters in the variable. How do you work with that?
Slava