LiveCode Forums.

Posted: **Mon Aug 04, 2014 12:03 pm**

I must have slept through the RegExp on LC class but...

I read in a file as binary, then want to parse it manually. In one test I read in an XML file. So the first line of the file is:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

which dumps into a field correctly. So I first want to find any printable character strings, which should be the whole line in this case. I use this call with the regexp set for the entire set of ASCII printable characters via their Unicode value:

Code: Select all

matchChunk(gFileContents, "([\x{0020}-\x{007e}]*)", tStart, tEnd)

The incomplete string I get back is:

1.0" encoding="UTF-8" standalone="yes

Pilot error somewhere? Note that I get the same result with non-Unicode "([\x20-\x7e]*)" regexp as well.

If I use the simpler "([A-Za-z]*)" it is even more interesting - I get back "yes". I would have expected to get "xml" instead.

Thanks, Walt

Posted: **Mon Aug 04, 2014 12:53 pm**

Walt,

Please, try this one:

([\<?\"\x{0020}-\x{007e}]*)

PS: you have to put the regex in a custom property or in a field,
otherwise you'll get stuck with the quote character inside the regex..

For your last question, the answer is correct!
You can check the greed concept of regex to understand this.

HTH,

Thierry

Posted: **Tue Aug 05, 2014 3:37 pm**

. .

Posted: **Wed Aug 06, 2014 8:03 am**

[-hh] wrote:Sorry Thierry, I didn't ask you before writing this. But I found no complicated regex thread where the solution is not from you. If you don't like this proposal then take it as praise and 'Dankeschön' for your expert-help for us in this field ].

Hermann,

Don't worry, I appreciate your words.

For your suggestion: Warum nicht

Regards,

Thierry

Posted: **Fri Aug 08, 2014 1:30 pm**

Thanks Thierry. The problem was the processing of the field under the hood. I was entering my regex in a field without quotes and parentheses, and adding them to the front and back in the script. ([\x{0020}-\x{007e}]) without adding quotes gets the whole string as well, without the \<?\" component (I was not specifically looking for tags).
Walt

Posted: **Fri Aug 08, 2014 1:40 pm**

WaltBrown wrote:Thanks Thierry.
([\x{0020}-\x{007e}]) without adding quotes gets the whole string as well, without the \<?\" component (I was not specifically looking for tags).
Walt

Glad that it works.

In fact, my suggestion was only to check if the Unicode pattern was accepted.
As you didn't specify which platform and which LC versions; might have been a bug...

So, you definitively do not need these extra characters.

Regards,

Thierry

Posted: **Sun Aug 10, 2014 7:52 am**

Hallo, Thierry,
is it possible, to find with regex in a htmltext phrases, where the beginning and the end are the same, but in the middle are various chars?

e.g. "zu xxx"
"zu yyy"

"xxx" or "yyy" are different words.
I had to find all words with this font and size in a text.
Could regex do that? How?

Thanks in advance,

chrisw

Posted: **Sun Aug 10, 2014 9:41 am**

chrisw wrote: e.g. "zu xxx"
"zu yyy"

"xxx" or "yyy" are different words.
I had to find all words with this font and size in a text.
Could regex do that? How?

Chrisw,

Absolutely, regex is *one* of the perfect tools to do that.

Umm, I have to leave right now, so I give you something to start with,
and please, precise your question if needed, and I'll come back tonight
or tomorrow, or someone else could help too

this will match the first occurence: (\w+)
Parentheses will catch whatever is found by the pattern inside them.
\w will match any word character, the + means 1 or more time the preceding pattern.
In the regex, I replace the quotes by a dot. A trick to be able to simply write the regex *in* your script!

Code: Select all

on mouseUp
   local T, RX, theMatch
 
   put "<font face=.Georgia. size=.10.>zu (\w+)</font>" into RX
   
   if matchText( T,  RX, theMatch ) then
      -- do whatever you like with theMatch
   end if
end mouseUp

Umm, now you have to loop all over your text.
Using matchChunk(), you get text indexes instead of the matched text.
You can use the index for your loop.
You can also use replaceText(), but might be a bit more tricky
if you are not used to regex.
I have already shown in this forum most of the tricks for that.
Sorry, no time now to point where, but googling in this forum
with my name and regex, pattern, match,.. might be the sun will shine then.

Regards,

Thierry

LiveCode Forums.

MatchText Issue

MatchText Issue

Re: MatchText Issue

Re: MatchText Issue

Re: MatchText Issue

Re: MatchText Issue

Re: MatchText Issue

Re: MatchText Issue

Re: MatchText Issue