Page 1 of 1
Encodings - why platform specific?
Posted: Sat Mar 14, 2015 1:10 pm
by append[x]
I have a project where csv-files are delivered from a customer from the windows platform and then they are supposed to be processed to utf-8 encoded XML-files on an OSX machine.
The textDecode, textEncode and open file for encoding commands will filter the list of available encodings by the OS, though. So I cannot convert the customer file, of which I know it is CP1252-encoded, when my converter is running on OSX.
Why is that? It makes absolutely no sense to me to have those restrictions. It is very likely that one will encounter any encoding on any platform in the wild (dealing with it, for example with BOMs etc, is another topic that can be left to the programmer).
Is there any workaround?
Re: Encodings - why platform specific?
Posted: Sat Mar 14, 2015 2:31 pm
by FourthWorld
It's often much easier to diagnose issues when we can see the code in question. Without the code, I can only guess that perhaps you're reading the files in text mode rather than binary.
Text mode is the default, and is a good choice for ASCII files since it automatically converts line endings from whatever conventions are specific to the platform the script is running on to LiveCode's internal line endings, which use the Unix convention of 0x10.
But for non-ASCII encodings you'll want to use the binary format, which reads the data unaltered, e.g.:
Code: Select all
open file tSomeFilePath for binary read
read from file tSomeFilePath until EOF
put it into gSomeData
close file tSomeFilePath
Or:
Code: Select all
put url ("binfile:"& tSomeFilePath) into gSomeData
Re: Encodings - why platform specific?
Posted: Sat Mar 14, 2015 4:40 pm
by append[x]
OK, so I have the raw text in gSomeData, and now I have to declare that it is Windows-Encoded (which LiveCode cannot know), and then I have to change the Encoding to utf-8 for my target xml-file.
But:
Code: Select all
put textDecode(gSomeData,"CP1252") into gSomeDataWithDefinedEncoding
wouldn't work, as "CP1252" is not allowed to be used with textDecode on OSX. (Error message: "textDecode: could not decode data")
The Dictionary of 7.0.3 does not tell so, but the Release Notes 2/20/15 state, that CP1252 is Windows only.
In fact, this works for me:
Code: Select all
open file tSomeFilePath for "CP1252" text read
read from file tSomeFilePath until EOF
put it into gSomeData
close file tSomeFilePath
...
(do some processing)
...
open file tTargetFilePath for "UTF-8" text write
write gSomeData to file tTargetFilePath
close file tTargetFilePath
but this code only works, when the stack is run on the Windows operating system, not when run on OSX.
It seems to me that I am never able to change the encoding of a Windows-file to something else on the Mac, and cannot quite understand the reason for that limitation.
Re: Encodings - why platform specific?
Posted: Sun Mar 15, 2015 6:28 pm
by jacque
Can you just process the raw data without converting it? Then feed the resulting data directly to the UTF8 conversion? That is, read it in as binary data and work with that.
It would depend on the data, but you might also be able to read it in as Latin-1 which is similar except for a few code points. That’s not really a perfectly safe option though.
You might want to submit a feature request for this in the QCC too.
Re: Encodings - why platform specific?
Posted: Sun Mar 15, 2015 8:03 pm
by append[x]
Can you just process the raw data without converting it? Then feed the resulting data directly to the UTF8 conversion?
This won't work. Any encoding conversion needs to know the source encoding (with the one exception, that the source text contains only 7-bit-ASCII characters).
This is pretty much like working with colorspace definitions in images. You cannot convert into a colorspace if you do not know the orginating colorspace (a super-common error in image processing - a colorspace is applied without conversion, because the source has no colorspace embedded. The color assignments are arbitrary).
but you might also be able to read it in as Latin-1
Latin-1 would be good, but there is no option for it. And ISO-8859-1 would be just as good, but is only available on Linux, not OSX.
You might want to submit a feature request for this in the QCC too.
Yes, I will do so! It should not be hard to implement - all the code is already there. Just some encodings have been limited to certain OSs for whatever reason.
Thanks anyway!
Andreas