Page 1 of 1

Noob question on regexp

Posted: Wed May 13, 2009 11:05 pm
by WaltBrown
Hi!

I am trying to find URLs embedded in large text files. I want to search for occurrences of "://" and "<stuff>.<stuff>.<stuff>" (three unknown chunks without spaces separated by dots to identify URLs and IP addresses that don't have "http://" appended).

Any suggestions?

Find chars "://" in field does not seem to like to search for "://", I get no results on a small test document that is full of URLs. And I cannot see how to make a regexp for the dotted address given unknown chunk contents and lengths

Thanks!

Walt

Posted: Wed May 13, 2009 11:10 pm
by WaltBrown
I got the first one to work, but any suggestions for the second form would be helpful.

Thanks,
Walt

Posted: Thu May 14, 2009 12:25 pm
by bn
Walt,
I give you my shot at it:

Code: Select all

-- from the dictionary
-- offset(charsToFind,stringToSearch[,charsToSkip])

on mouseUp
   put field 1 into myVar
   put "" into tCollector
   put 0 into tCounter
   
   repeat -- repeats until first offset returns 0 i.e. not found
      
      --put offset(quote &"://",myVar,tCounter) into myHitStart -- if you start with a quote
      
      put offset("://",myVar,tCounter) into myHitStart -- without quote
      
      if myHitStart > 0 then 
         add myHitStart to tCounter
         put offset(quote,myVar, tCounter) into myEnd -- looking for next quote
         if myEnd = 0 then exit repeat -- second searchpattern not found
         
         -- select char tCounter to (tCounter + myEnd) of field 1 -- if you want to look at the selection
         -- wait 1 second -- if you want to look at the selection in field 1
         
         put char tCounter to (tCounter + myEnd) of field 1 & return after tCollector -- make list of hits
         add myEnd to tCounter
         
      else
         exit repeat
      end if
   end repeat
   if tCollector <> "" then delete last char of tCollector -- return
   put tCollector into field 2
end mouseUp
It gives you the a list of hits. Within these hits you will have to extract the things between the dots.
This will give you the stripped down version of the hits

Code: Select all

on mouseUp
   put field 2 into myTemp
   replace "://" with empty in myTemp
   -- replace quote & "://" with empty in myTemp -- if you want to have a leading quote
   
   set the itemdelimiter to "/"
   repeat with i = 1 to the number of lines of myTemp
      put item 1 of line i of myTemp into line i of myTemp
   end repeat
   put myTemp into field 2
end mouseUp
Just make 2 fields, field 1 has the source code of a html page. make two buttons and a second field. Give it a try.
regards
Bernd

Posted: Fri May 22, 2009 3:07 pm
by WaltBrown
Thanks, Bernd, I'm getting up to speed on the various flavors of regexp, especially the various need or not for some of the delimiting characters like parenthesis. You've been a great help.