Page 1 of 1

Problems with text conversions

Posted: Sat May 28, 2011 11:29 am
by exheusden
I have a stack, written in RevMedia 4, that creates ePub formats from plain text files.

I have now discovered that if I try to use a Unicode UTF-8 format text as its input, the resulting text in the ePub document is to some extent garbled, with curly quotes, em-dashes and some other characters substituted incorrectly (at least displayed incorrectly in ePub readers such as Calibre, iBooks, etc.): for example a closing curly quote is shown thus �äô

The command used to encode the text is

put textToEntities(it) & return into chapterText

where it contains the text, which is read from its file with read from file textfile until "#!@" (The "#!@" being a marker I set myself in the original text file).

How can I achieve correct character display, even from a UTF-8 format text?

Re: Problems with text conversions

Posted: Sat May 28, 2011 2:42 pm
by BvG
I'm not sure i understand your question, but most likely, somewhere in your code you need one or both of these functions:

Code: Select all

function createUtf8TextfromRevText theText
   return unidecode(uniencode(theText),"utf8")
end createUtf8TextfromRevText

Code: Select all

function getRevTextFromUtf8 theText
   return unidecode(uniencode(theText,"utf8"))
end getRevTextFromUtf8

Re: Problems with text conversions

Posted: Sat May 28, 2011 5:24 pm
by exheusden
The variable "it" already contains UTF-8 text; that is the text that is read fromm a Unicode UTF-8 formatted file.

Is it then still necessary to go though this uniencode-unidecode process?

And what would happen with a plain text file (MacOS Roman, for example)?

Re: Problems with text conversions

Posted: Sat May 28, 2011 6:59 pm
by BvG
For utf-8 text, that is the proper approach, weird i know. Basically it converts the utf8 text to utf16 text, and then that to rev-field compatible text.

No for "normal" text files you do not decode as if it'd be utf-8.

Re: Problems with text conversions

Posted: Sun May 29, 2011 11:56 am
by exheusden
I have tried using both functions. I tested one function at a time.

After having read the text, I then passed it to the function being tested and worked further with the text returned by that function.

The result was just the same as without the use of the functions: "special" characters, such as curly quotes, em-dashes, etc. remain garbled.

I expect I am not using the functions correctly. Perhaps I have to use them both: one after having read the text from the input file and one prior to writing the text to the output file. If so, which is which?

Re: Problems with text conversions

Posted: Sun May 29, 2011 2:00 pm
by BvG
no most likely the data is not utf8 at all. or your approach to read the file is garbling it, or your code is wrong, this works and i've done it before. Do you us url or open file/close file? do you use binfile or file? do you put stuff into fields somewhen (shouldn't do that). etc.

Re: Problems with text conversions

Posted: Sun May 29, 2011 7:30 pm
by exheusden
The input file is certainly in UTF-8 encoding, as that is the way it is saved with TextEdit.

The file is opened with Open File and closed with Close File. No binfile.

Nothing is put into fields.

I didn't say your functions don't work; I said they didn't work when I tried them as I understood how they had to be used, which I expect is not the correct way.

What is the sequence? Am I to pass the read text to createUtf8TextfromRevText, process it and then pass it to getRevTextFromUtf8 prior to saving the result? Or perhaps something else…?

Re: Problems with text conversions

Posted: Sun May 29, 2011 8:08 pm
by bn
Hi Exheusden,

could you zip the text file an upload it to the forum? There is a tab with Upload attachement, then you choose the zipped file from you hard disk and then click Add the file.

Unicode is a mess and it is best to see the text file in question.

Have a look here for the background of unicode and Livecode

http://livecode.byu.edu/unicode/unicodeInRev.php

(not that I understand everything he explains :) )

Kind regards

Bernd

Re: Problems with text conversions

Posted: Sun May 29, 2011 10:29 pm
by BvG
alright try this then:

Code: Select all

answer file ""
put (url "binfile:" & it) into theData
put unidecode(uniencode(theData,"utf8")) into field 1
garbled or not?