Page 1 of 1

Regex

Posted: Sun Nov 30, 2014 4:45 pm
by Traxgeek
Hi,

I'm trying to use regex... and failing miserably. I have read a load of posts here (biggest contributor was Thierry), on Google (mostly StackOverflow) and read the Wikipedia 'how to' article but I still find understanding regex enough to create a simple 'search-for-and-remove-some-text' script eludes me !

I thought that if someone would be good enough to provide some help with my precise requirement I might be able to 'reverse engineer' it and apply the various instructions I've bookmarked to see precisely how the components of the specific regex statement work in my particular case.

An example of my (HTML) text :
<p></p>
<ul type="square">
<li>
<p firstindent="-36" leftindent="36" rightindent="15"><b><font face="Arial" size="12"
color="#262626" bgcolor="#FFFFFF">Free App of the Day (FAD) eligibility</font></b><font face="Arial" size="12" color="#262626" bgcolor="#FFFFFF">&nbsp;If the checkbox labeled&nbsp;<b>Yes, please consider this app for the program</b>&nbsp;is checked (this is the default), Amazon may select your app for this promotional program. If Amazon selects your app for FAD, Amazon will contact you with more details about what to expect as your app goes through the testing and approval process.</font></p>
</li>
</ul><p></p>


An example of what I'm trying to do :
(If I can work out how the regex statement works for this example than I should be able to expand it to do other things - well, that's my idea... :)
Isolate and remove all instances of, say, the color tag (red, bold text) but for ANY colour/value between the quotes, meaning that :
<p firstindent="-36" leftindent="36" rightindent="15"><b><font face="Arial" size="12" color="#262626" bgcolor="#FFFFFF">
becomes :
<p firstindent="-36" leftindent="36" rightindent="15"><b><font face="Arial" size="12" bgcolor="#FFFFFF">

I figure I can then start to work out the rest...

Can anyone enlighten me please ?
Thanks a million.

Trax.

Re: Regex

Posted: Sun Nov 30, 2014 7:23 pm
by Thierry
Traxgeek wrote: I'm trying to use regex...
An example of what I'm trying to do :
Isolate and remove all instances of, say, the color tag (red, bold text) but for ANY colour/value between the quotes, meaning that :
<p firstindent="-36" leftindent="36" rightindent="15"><b><font face="Arial" size="12" color="#262626" bgcolor="#FFFFFF">
becomes :
<p firstindent="-36" leftindent="36" rightindent="15"><b><font face="Arial" size="12" bgcolor="#FFFFFF">
Hi Traxgeek,

Here is one for a start:

Code: Select all

   put replaceText( yourHtmlText, "\scolor=.#[A-F0-9]{6}.", empty) into whatever
PS: I'm using the dot to match the quote; I am a lazy man :roll:

HTH,
Thierry

Re: Regex

Posted: Sun Nov 30, 2014 11:20 pm
by Traxgeek
Hi Thierry,

Really, really appreciated.
AND I understand pretty much how it works (I think :D ).
I don't understand exactly what " I'm using the dot to match the quote; I am a lazy man" means but it's my homework !! :D
Not in my office right now (so can't try my theory) but I think by modifying your script to :
put replaceText( yourHtmlText, "\scolor|\sbgcolor=.#[A-F0-9]{6}.", empty) into whatever
should then remove BOTH the color AND the bgcolor tags ?
Anyways, I'm looking forward to trying it all out tomorrow.

Again, fantastic. Really appreciated. Thanks.

Trax.

Re: Regex

Posted: Mon Dec 01, 2014 6:00 am
by rkriesel
Hi, Trax.

Here's another way. This one avoids alternation and ranges, so I'd guess it'd be faster.

Code: Select all

put replaceText(t1, "\s(?:bg)?color=\" & quote & "#[[:xdigit:]]{6}\" & quote, empty) into t2
The above (\" & quote & ") is more rigorous than the lazy (.) but probably equivalent in your data.

-- Dick

Re: Regex

Posted: Mon Dec 01, 2014 8:42 am
by Traxgeek
Hi Dick,

Thanks to you too !
I'm one happy bunny :D and off to 'play'... (and do my 'homework' - trying to figure out why these two scripts work...) Happy days...
Much appreciate chaps.

Trax.

Re: Regex

Posted: Mon Dec 01, 2014 8:58 am
by Thierry
Hi Thierry,
Really, really appreciated.
AND I understand pretty much how it works (I think :D ).
Glad that you understand it :)
I don't understand exactly what " I'm using the dot to match the quote; I am a lazy man" means but it's my homework !! :D
Guess this need a bit of clarificaion.
The first thought would be to put a quote instead of a dot, as it is exactly what you are expecting from your data.
But, as it is Livecode, you can't just write a quote; so you have to type this instead of the dot: " & quote & "
And because it was sunday evening, I was the lazy guy not to type all this.
So, the dot is not lazy, and in fact will do pretty well the job
and even a little faster than a quote (no test inside the regex engine)
But this is just ridiculous to think about this; most of the time you won't see any difference in term of speed.
put replaceText( yourHtmlText, "\scolor|\sbgcolor=.#[A-F0-9]{6}.", empty) into whatever
should then remove BOTH the color AND the bgcolor tags ?
Almost.

Code: Select all

put replaceText( yourHtmlText, "\s(?:color|bgcolor)=.#[A-F0-9]{6}.", empty) into whatever
or

Code: Select all

put replaceText( yourHtmlText, "\s(?:bg)?color=.#[A-F0-9]{6}.", empty) into whatever
There is also a POSIX way of writing this [A-Fa-f0-9] -> [[:xdigit]]
This is only syntaxic sugar, as the [[:xdigit]] will be translated to [A-Fa-f0-9]
And you won't win a nano second, even with Gbytes of data; so it's more a matter of what you like, nothing else.

About performance, writing: (?:bg)?color or (?:color|bgcolor), in most situation won't make much difference!!!
Except if you know your data, then you can optimize your regex.
But in your case, I don't think you have to take care of that. If you really need to understand why,
well, ask and I'll explain in more details...

Happy regexing :)

Thierry

Re: Regex

Posted: Wed Dec 03, 2014 8:09 am
by Traxgeek
Hi Thierry,

Amazing. I've now spent quite a few hours on Regex since your help and it's a powerful (if little confusing to read) topic. I've been practising removing multiple specific tags using the ' | ' (or) symbol and playing with the Posix A-F method. A lot to learn (and retain).

Much obliged - especially for the explanations after your initial repsonse. Really useful.

Trax

Re: Regex

Posted: Wed Dec 03, 2014 9:23 am
by Thierry
Traxgeek wrote:Hi Thierry,

Amazing. I've now spent quite a few hours on Regex since your help and it's a powerful (if little confusing to read) topic. I've been practising removing multiple specific tags using the ' | ' (or) symbol and playing with the Posix A-F method. A lot to learn (and retain).

Much obliged - especially for the explanations after your initial repsonse. Really useful.
Hi Trax,

Thanks for the positive feedback.

Yes, the syntax is a bit terse when starting,
but there is not that much to learn; and obviously you are doing well :)

Good luck,

Thierry