Page 1 of 2
is ASCII operator
Posted: Fri Jun 21, 2013 11:55 pm
by monte
This would be a helpful one for lcVCS... all custom properties need to be tested for if they are ASCII or not so they can be base64Encoded if they need to be. Currently using this function:
Code: Select all
function needsEncoding pData
if pData contains null then return true
repeat for each byte tByte in pData
put charToNum(tByte) into tNum
if tNum > 127 then
return true
end if
end repeat
return false
end needsEncoding
I think such a test would be considerably faster in the engine. What do we think?
Re: is ASCII operator
Posted: Sat Jun 22, 2013 3:06 am
by mwieder
Depends. Is this for xml? If so there's more than just ASCII to watch out for.
Also, ASCII is more restrictive if you're looking at printing characters, so you'll need something like
Code: Select all
function needsEncoding pData
if pData contains null then return true
repeat for each byte tByte in pData
-- if XML then
--if tByte is ">" or tByte is "<" then
--return true
--end if
put charToNum(tByte) into tNum
if tNum < 32 then
return true
end if
if tNum > 126 then
return true
end if
-- also, don't put punctuation into xml tags
--if tNum < 48 then
--return true
--end if
-- switch tByte
--case 58
--case 59
--case 60
--case 61
--case 62
--case 63
--case 64
--case 91
--case 92
--case 93
--case 94
--case 96
--case 123
--case 124
--case 125
--case 126
--return true
--end switch
end repeat
return false
end needsEncoding
Re: is ASCII operator
Posted: Sat Jun 22, 2013 3:43 am
by monte
No it's for mergJSON which handles everything but NULL... below 128... But I expect the operator would be useful outside that...
Re: is ASCII operator
Posted: Sat Jun 22, 2013 6:29 am
by rkriesel
Since you want speed, I suggest replacing the function call for each byte with an array look-up for each byte.
Code: Select all
function needsEncoding pData
local tArray
repeat with i = 0 to 255
if ( i < 48 ) or ( i >= 58 and i <= 64 ) or ( i >= 91 and i <= 96 ) or ( i >= 123 ) then put "true" into tArray[ i ]
end repeat
repeat for each byte tByte in pData
if tArray[ tByte ] then return "true"
end repeat
return "false"
end needsEncoding
I haven't tested the code above. If the idea works for you, please let us know how it affected the speed.
An "is ASCII" operator might be expected return "false" for an empty input, so it would not always be the inverse of function "needsEncoding."
-- Dick
Re: is ASCII operator
Posted: Sat Jun 22, 2013 11:06 am
by monte
After looking at the operator code I'm hoping that's high on the refactoring agenda...
Re: is ASCII operator
Posted: Sat Jun 22, 2013 12:32 pm
by LCMark
How about 'x is an ascii string', 'x is a native string' (and later - when we have unicode - 'x is a unicode string'). Meaning whether the contents of a string just requires ascii, the native charset (or, later, unicode) to represent.
Essentially this means extending MCIs.
In regards to refactoring - yes, all syntax is refactored to split syntax from implementation. Adding operators will be as easy as any other syntax.
Re: is ASCII operator
Posted: Sat Jun 22, 2013 11:28 pm
by monte
Well... obviously detecting 'is an ascii string' is easy... beyond that it gets a bit curly doesn't it?... If you knew the underlying encoding you could work out if it could be represented as ascii (like what we did for the unicode props in the properties) but without knowing that how do we go about it? I think I asked virtually the same question a few weeks back about the future unicode plans

Re: is ASCII operator
Posted: Sun Jun 23, 2013 1:42 am
by monte
I've just sent a pull request for "is [not] an ascii string" but would need some guidance on the native variant if you want me to look at that.
Re: is ASCII operator
Posted: Sun Jun 23, 2013 4:01 pm
by DarScott
Some years ago I mentioned on the improve list using 'every' and 'some' for multiple array and chunk operations and comparisons. I don't remember what syntax I suggested. Maybe like this:
Code: Select all
if charToNum( some char of x ) > 127 then
Re: is ASCII operator
Posted: Sun Jun 23, 2013 4:05 pm
by DarScott
How does this get into the dictionary?
Re: is ASCII operator
Posted: Sun Jun 23, 2013 4:58 pm
by mwieder
Indefinite articles are a problem in natural-language processing. "Any" is an analog for the random function in xtalk, but I could also see "any" used for what you want here:
I could also see where "some" might be used in place of "any" in different contexts.
Re: is ASCII operator
Posted: Sun Jun 23, 2013 10:18 pm
by monte
DarScott wrote:How does this get into the dictionary?
I think someone... possibly me... needs to submit a pull request with a doc... wherever the docs are, haven't looked yet.
Re: is ASCII operator
Posted: Mon Jun 24, 2013 10:05 am
by LCMark
I think someone... possibly me... needs to submit a pull request with a doc... wherever the docs are, haven't looked yet.
Indeed - we've now merged in the docs and a release note system into 'develop'. So are starting to ask that contributions come with release notes and dictionary entries. These things are under the 'docs' folder in the livecode repo. There's info about both these things in the contribution docs (
http://livecode.com/community/contribute-to-livecode/).
Re: is ASCII operator
Posted: Mon Jun 24, 2013 10:13 am
by monte
hmm... well in this case I branched of master... so do I branch off develop to create the docs or what?
Re: is ASCII operator
Posted: Mon Jun 24, 2013 11:00 am
by LCMark
Well... obviously detecting 'is an ascii string' is easy... beyond that it gets a bit curly doesn't it?... If you knew the underlying encoding you could work out if it could be represented as ascii (like what we did for the unicode props in the properties) but without knowing that how do we go about it? I think I asked virtually the same question a few weeks back about the future unicode plans

My suggestion above was a little terse (I didn't have much time to post this weekend).
At the moment all strings in the engine are 'native strings' (sequences of bytes that are interpreted as being the native text encoding). Due to the 1-1 mapping between char and byte in the native encodings, all strings in the engine are also 'binary strings'. This duality works very well - until you want to manipulate text that is in a larger encoding than the native ones (i.e. one that takes more than 1 byte per char).
So, right now, 'is a native string' will always return true for values that convert to strings (which is all at the moment).
Moving forward, all strings in the engine will be replaced by an MCStringRef abstraction. This opaque type will be able to hold either a native/binary string or a unicode string. More abstractly, an MCStringRef represents a sequence of characters - there's no need (from the outside) to be concerned about the internal representation (or encoding).
At that point 'is a native string' might not return true, if the text contained within the string cannot be converted (losslessly) to the native encoding.
In fact, (in the future) a whole family of 'is a string' type operators would be useful:
- is a binary string - returns true if the string can convert to binary (i.e. is natively encoded) (and the value converts to a string)
- is a native string - returns true if the string can be encoded as native (and the value converts to a string)
- is a simple unicode string - returns true if the string can be encoded in unicode with no surrogate pairs (and the value converts to a string)
- is a unicode string - returns true if the value converts to a string
- is a string - returns true if the value converts to a string
So the above will probably cause more questions than it answers, but at least it's a start
