MatchText Issue

LiveCode is the premier environment for creating multi-platform solutions for all major operating systems - Windows, Mac OS X, Linux, the Web, Server environments and Mobile platforms. Brand new to LiveCode? Welcome!

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller

Post Reply
WaltBrown
Posts: 466
Joined: Mon May 11, 2009 9:12 pm

MatchText Issue

Post by WaltBrown » Mon Aug 04, 2014 12:03 pm

I must have slept through the RegExp on LC class but...

I read in a file as binary, then want to parse it manually. In one test I read in an XML file. So the first line of the file is:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

which dumps into a field correctly. So I first want to find any printable character strings, which should be the whole line in this case. I use this call with the regexp set for the entire set of ASCII printable characters via their Unicode value:

Code: Select all

matchChunk(gFileContents, "([\x{0020}-\x{007e}]*)", tStart, tEnd)
The incomplete string I get back is:

1.0" encoding="UTF-8" standalone="yes

Pilot error somewhere? Note that I get the same result with non-Unicode "([\x20-\x7e]*)" regexp as well.

If I use the simpler "([A-Za-z]*)" it is even more interesting - I get back "yes". I would have expected to get "xml" instead.

Thanks, Walt
Walt Brown
Omnis traductor traditor

Thierry
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 875
Joined: Wed Nov 22, 2006 3:42 pm

Re: MatchText Issue

Post by Thierry » Mon Aug 04, 2014 12:53 pm

Walt,

Please, try this one:

([\<?\"\x{0020}-\x{007e}]*)


PS: you have to put the regex in a custom property or in a field,
otherwise you'll get stuck with the quote character inside the regex..

For your last question, the answer is correct!
You can check the greed concept of regex to understand this.

HTH,

Thierry
!
SUNNY-TDZ.COM doesn't belong to me since 2021.
To contact me, use the Private messages. Merci.
!

[-hh]
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 2262
Joined: Thu Feb 28, 2013 11:52 pm

Re: MatchText Issue

Post by [-hh] » Tue Aug 05, 2014 3:37 pm

. .
Last edited by [-hh] on Sat Sep 27, 2014 10:05 pm, edited 1 time in total.
shiftLock happens

Thierry
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 875
Joined: Wed Nov 22, 2006 3:42 pm

Re: MatchText Issue

Post by Thierry » Wed Aug 06, 2014 8:03 am

[-hh] wrote:Sorry Thierry, I didn't ask you before writing this. But I found no complicated regex thread where the solution is not from you. If you don't like this proposal then take it as praise and 'Dankeschön' for your expert-help for us in this field ].
Hermann,

Don't worry, I appreciate your words.

For your suggestion: Warum nicht :)

Regards,

Thierry
!
SUNNY-TDZ.COM doesn't belong to me since 2021.
To contact me, use the Private messages. Merci.
!

WaltBrown
Posts: 466
Joined: Mon May 11, 2009 9:12 pm

Re: MatchText Issue

Post by WaltBrown » Fri Aug 08, 2014 1:30 pm

Thanks Thierry. The problem was the processing of the field under the hood. I was entering my regex in a field without quotes and parentheses, and adding them to the front and back in the script. ([\x{0020}-\x{007e}]) without adding quotes gets the whole string as well, without the \<?\" component (I was not specifically looking for tags).
Walt
Walt Brown
Omnis traductor traditor

Thierry
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 875
Joined: Wed Nov 22, 2006 3:42 pm

Re: MatchText Issue

Post by Thierry » Fri Aug 08, 2014 1:40 pm

WaltBrown wrote:Thanks Thierry.
([\x{0020}-\x{007e}]) without adding quotes gets the whole string as well, without the \<?\" component (I was not specifically looking for tags).
Walt
Glad that it works.

In fact, my suggestion was only to check if the Unicode pattern was accepted.
As you didn't specify which platform and which LC versions; might have been a bug...

So, you definitively do not need these extra characters.

Regards,

Thierry
!
SUNNY-TDZ.COM doesn't belong to me since 2021.
To contact me, use the Private messages. Merci.
!

chrisw
Posts: 11
Joined: Thu Jul 03, 2014 8:45 am

Re: MatchText Issue

Post by chrisw » Sun Aug 10, 2014 7:52 am

Hallo, Thierry,
is it possible, to find with regex in a htmltext phrases, where the beginning and the end are the same, but in the middle are various chars?

e.g. "<font face="Georgia" size="10">zu xxx</font>"
"<font face="Georgia" size="10">zu yyy</font>"

"xxx" or "yyy" are different words.
I had to find all words with this font and size in a text.
Could regex do that? How?

Thanks in advance,

chrisw

Thierry
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 875
Joined: Wed Nov 22, 2006 3:42 pm

Re: MatchText Issue

Post by Thierry » Sun Aug 10, 2014 9:41 am

chrisw wrote: e.g. "<font face="Georgia" size="10">zu xxx</font>"
"<font face="Georgia" size="10">zu yyy</font>"

"xxx" or "yyy" are different words.
I had to find all words with this font and size in a text.
Could regex do that? How?
Chrisw,

Absolutely, regex is *one* of the perfect tools to do that.

Umm, I have to leave right now, so I give you something to start with,
and please, precise your question if needed, and I'll come back tonight
or tomorrow, or someone else could help too :)

this will match the first occurence: (\w+)
Parentheses will catch whatever is found by the pattern inside them.
\w will match any word character, the + means 1 or more time the preceding pattern.
In the regex, I replace the quotes by a dot. A trick to be able to simply write the regex *in* your script!

Code: Select all

on mouseUp
   local T, RX, theMatch
 
   put "<font face=.Georgia. size=.10.>zu (\w+)</font>" into RX
   
   if matchText( T,  RX, theMatch ) then
      -- do whatever you like with theMatch
   end if
end mouseUp
Umm, now you have to loop all over your text.
Using matchChunk(), you get text indexes instead of the matched text.
You can use the index for your loop.
You can also use replaceText(), but might be a bit more tricky
if you are not used to regex.
I have already shown in this forum most of the tricks for that.
Sorry, no time now to point where, but googling in this forum
with my name and regex, pattern, match,.. might be the sun will shine then.

Regards,

Thierry
!
SUNNY-TDZ.COM doesn't belong to me since 2021.
To contact me, use the Private messages. Merci.
!

Post Reply