Unicode manipulation sans objects in the loop

theotherbassist · Post by **theotherbassist** » Sun Feb 12, 2017 5:06 pm

I'm currently pulling text from web sources (XML) and then analysing it according to word frequencies. When I do this I don't want S.N.L.' and S.N.L.â, for example, to show up as different words. Using "trueword" won't remove the single quote on the end of the latter, because it's a "â".

So at the moment I'm putting everything into a field via uniEncode to mesh all the text into ASCII prior to analysis. This weeds out the differences between sources that use unicode and sources that don't.

But it seems so silly, and I'm sure it slows everything down--I have all my data in arrays, then I put it into a field, and then back to arrays again. Is there a way to convert the unicode text using no objects and only variables? If there is, I can't seem to get the syntax right.

Is there some way to do it without employing charToNum() and numToChar()? The CPU cost of doing that with hundreds of keys averaging ~15 words each seems unnecessary.

jacque · Post by **jacque** » Mon Feb 13, 2017 7:10 pm

Whenever you pull data from an external source, run it through textDecode to convert the unicode to UTF16. That should fix things. As of LC 7 you shouldn't need the old uniEncode/decode functions any more.

theotherbassist · Post by **theotherbassist** » Mon Feb 13, 2017 11:55 pm

Thanks. Didn't know about textDecode.

LiveCode Forums.

Unicode manipulation sans objects in the loop

Unicode manipulation sans objects in the loop

Re: Unicode manipulation sans objects in the loop

Re: Unicode manipulation sans objects in the loop