Page 1 of 1

word breaks in Russian Unicode text

Posted: Sat May 14, 2011 6:30 am
by sp27
Step 4 in Devin Asay's article on Unicode at /spaces/lessons/buckets/1412/lessons/20441-Unicode "moves" two Russian words from one field to another:

set the unicodeText of fld "other" to word 1 to 2 of the unicodeText of fld "this"

I can reproduce his experience with the words that he used, but when I use the complete Russian alphabet, I find that the word breaks are not where I expect them. For instance, word 1 to 2 of this string:

АБВГДЕЖ ЗИЙКЛМНОПРСТУ ФХЦЧШЩЪЫЬЭЮЯ

should contain 8 letters in word 1 and 13 letters in word 2. However, Devin's command, when applied to this text in field "this" results in this display in field "other":

АБВГДЕЖ ЗИЙКЛМНОП

As you can see, the second word is chopped off at its 10th letter. That character is just another letter of the alphabet--not punctuation or anything like that.

With this string of lower case letters in field "this":

абв где жзи йкл мно прс туф хцч шщъ ыьэ юя

the script

Code: Select all

set the unicodeText of fld "other" to word 1 to 2 of the unicodeText of fld "this"
produces the expected result:

абв где

but if I change "word 1 to 2" to be "word 2 to 3", the display in field "other" is in Chinese characters:
㌀㐄㔄 㘀㜄㠄

Am I doing something terribly wrong? If anyone is working with Unicode, can they share their experience or perhaps confirm my findings?

This is with LC 4.6.1, Windows 7 64-bit.

Thanks,

Slava

Re: word breaks in Russian Unicode text

Posted: Sat May 14, 2011 10:06 am
by Mark
Hi Slava,

Looks like Devin made a small mistake.

Code: Select all

set the unicodeText of fld "other" to word 1 to 2 of the unicodeText of fld "this"
should be

Code: Select all

set the unicodeText of fld "other" to the unicodeText of word 1 to 2 of fld "this"
Best,

Mark