Unicode manipulation sans objects in the loop
Posted: Sun Feb 12, 2017 5:06 pm
I'm currently pulling text from web sources (XML) and then analysing it according to word frequencies. When I do this I don't want S.N.L.' and S.N.L.â, for example, to show up as different words. Using "trueword" won't remove the single quote on the end of the latter, because it's a "â".
So at the moment I'm putting everything into a field via uniEncode to mesh all the text into ASCII prior to analysis. This weeds out the differences between sources that use unicode and sources that don't.
But it seems so silly, and I'm sure it slows everything down--I have all my data in arrays, then I put it into a field, and then back to arrays again. Is there a way to convert the unicode text using no objects and only variables? If there is, I can't seem to get the syntax right.
Is there some way to do it without employing charToNum() and numToChar()? The CPU cost of doing that with hundreds of keys averaging ~15 words each seems unnecessary.
So at the moment I'm putting everything into a field via uniEncode to mesh all the text into ASCII prior to analysis. This weeds out the differences between sources that use unicode and sources that don't.
But it seems so silly, and I'm sure it slows everything down--I have all my data in arrays, then I put it into a field, and then back to arrays again. Is there a way to convert the unicode text using no objects and only variables? If there is, I can't seem to get the syntax right.
Is there some way to do it without employing charToNum() and numToChar()? The CPU cost of doing that with hundreds of keys averaging ~15 words each seems unnecessary.