Page 1 of 1

Unicode manipulation sans objects in the loop

Posted: Sun Feb 12, 2017 5:06 pm
by theotherbassist
I'm currently pulling text from web sources (XML) and then analysing it according to word frequencies. When I do this I don't want S.N.L.' and S.N.L.â, for example, to show up as different words. Using "trueword" won't remove the single quote on the end of the latter, because it's a "â".

So at the moment I'm putting everything into a field via uniEncode to mesh all the text into ASCII prior to analysis. This weeds out the differences between sources that use unicode and sources that don't.

But it seems so silly, and I'm sure it slows everything down--I have all my data in arrays, then I put it into a field, and then back to arrays again. Is there a way to convert the unicode text using no objects and only variables? If there is, I can't seem to get the syntax right.

Is there some way to do it without employing charToNum() and numToChar()? The CPU cost of doing that with hundreds of keys averaging ~15 words each seems unnecessary.

Re: Unicode manipulation sans objects in the loop

Posted: Mon Feb 13, 2017 7:10 pm
by jacque
Whenever you pull data from an external source, run it through textDecode to convert the unicode to UTF16. That should fix things. As of LC 7 you shouldn't need the old uniEncode/decode functions any more.

Re: Unicode manipulation sans objects in the loop

Posted: Mon Feb 13, 2017 11:55 pm
by theotherbassist
Thanks. Didn't know about textDecode.