Internet Library Question - How to retrieve every link?

MasterchiefJB · Post by **MasterchiefJB** » Mon May 09, 2011 6:39 pm

So I was able to open the revbrowser and I am able to navigate to a specified website using the following code

revBrowserNavigate tBrowserId, "http://www.ybbo.de/member/index.php"

There are several links on this website and I would like to retrieve a list of all available links. Then my app should navigate to each link
and stay there for several seconds. Is this possible with Revolution?

My attempt to navigate to each retrieved link would look like this:

Code: Select all


 revBrowserNavigate tBrowserId2, tRetrievedLink_1
  wait 2 seconds
revBrowserNavigate tBrowserId2, tRetrievedLink_2
  wait 2 seconds
revBrowserNavigate tBrowserId2, tRetrievedLink_3
  wait 2 seconds
revBrowserNavigate tBrowserId2, tRetrievedLink_And_So_On
  wait 2 seconds

The only problem for me is to retrieve this list of links and put the into variables.

Any kind of help is appreciated!
Regards,
Masterchief

Mark · Post by **Mark** » Sun May 15, 2011 12:06 am

Hi,

To get every link of a web page, you need to get the htmlText from the revBrowser and parse it.

To get the htmlText, use

Code: Select all

put revBrowserGet(gBrowserInstance,"htmlText") into myHtmlText

Best,

Mark

MasterchiefJB · Post by **MasterchiefJB** » Mon May 23, 2011 3:27 pm

Thank you very much Mark,

but could you give me an example on how to parse HTML code. I am sorry but this is totally new to me.

Best,
Masterchief

BvG · Post by **BvG** » Tue May 24, 2011 12:08 pm

To parse text you use the text parsing capabilities of a language. In liveCode, these are described in the user documentation pdf, accessible from the help menu. look for chunk expression, item- and lineDelimiter, as well as repeat and if structures.

I'm sorry to be unspecific, but your question basically boils down to "I don't know how to program, please tell me how to do that"

Oh and to get the html source of a page, it's easier to use "put URL" instead of the browser object:

Code: Select all

put url "http://google.com" into theHTML

MasterchiefJB · Post by **MasterchiefJB** » Wed Jul 27, 2011 2:40 pm

Hi again,

I completed another project and now I have time to work on this tool again!^^

Just needed some time to understand my own code but I already came up with a new challenge for you!!!

So this code:

Code: Select all

put revBrowserGet(gBrowserInstance,"htmlText") into myHtmlText

and this code:

Code: Select all

put url "http://google.com" into theHTML

is intended to retrieve the HTML source of a website.

The problem is: The HTML text I retrieve is just a small portion not the entire source as I can review it in Firefox or IE by using the 'view sourcecode' tab.

So is it possible to retrieve the entire HTML sourcecode of a website within Livecode?

And I am just curious wheter the put and the get command also include Javacode expressions if they are contained in the sourcecode of the website.

Any help is greatly appreciated!

Cheers!

PS: It´s good to be back again!!!

BrantfordBrands · Post by **BrantfordBrands** » Wed Jan 25, 2012 8:28 pm

Not sure if you managed to solve this, but I'm facing the same problem, and I think I've found the solution, hoping you (Or anybody) can help validate it.

It seems that revBrowser and it's ability to get the source HTML is spotty at best.

I found some times it worked, some times it didn't, and subsequent calls to the same instance returns different answers!

So, instead, (Since I don't need to actually RENDER the HTML), I'm using libURLDownloadToFile to retrieve the HTML directly from the server to a local file, which I then open, read & process.

In my process, I split the HTML by ">" then, for each item if it starts with "A" or "a", split that item by quotes (To separate the Names and Values of the attributes):

Code: Select all

  put field htmlSource into theElements 
   
   split theElements by ">"
   repeat for each element thisElement in theElements
      if char 1 to 2 of thisElement = "<A" or "<a" then
      split thisElement by space
      repeat for each element thisAttribute in thisElement
         if char 1 to 4 of thisAttribute = "href" or "HREF" then
            split thisAttribute by quote
            put thisAttribute[2] & return  after theURLS
         end if
         
      end repeat
      end if
   end repeat
   
put theURLS into field "listURLS"

Of course many of the HREF's are like "/News" and "/Comments" , etc...

So I'm adding the root URL in front to build out the entire URL:

Code: Select all

on getMyURLs
   
   put field "tbURL" into rootURL
   
   --Remove any trailing slash
   if rootURL ends with "/"  then
      put char 1 to (the length of rootURL - 1) of rootURL into rootURL
   end if
   
   set the  itemdelimiter to return
   repeat for each item thisURL in field "listURLS"
      
      --If it's already formatted, and we've not been there, grab it.
      if thisURL begins with rootURL then
         if thisURL is not among the items of field "listMyCollectedURLS" then
            put thisURL & return after field "listMyCollectedURLS"
         end if
      end if
      
      --If it's relative, fix it, if we've not been there, grab it.
      if thisURL begins with "/"  then
         put rootURL & thisURL into newURL
         if newURL is not among the items of field "listMyCollectedURLS" then
            put newURL & return after field "listMyCollectedURLS"
         end if         
      end if    
      
      if thisURL begins with "./"  then
         put rootURL & "/" & thisURL into newURL
         if newURL is not among the items of field "listMyCollectedURLS" then
            put newURL & return after field "listMyCollectedURLS"
         end if         
      end if    
      
   end repeat
end getMyURLs

So far this seems to be working, I'm now working on automating this process and then validating it against more web sites.

Of course this won't help if links are performed via anything other than pure HTML, so some sites may have incomplete scraping, but I doubt it's possible to handle ALL of those cases with any technology...

I'm VERY new to LiveCode, so I hope this helps, and if anybody has suggestions for improvements, I'm more than interested in hearing of them.

Thanks!

...Jeff

Nonsanity · Post by **Nonsanity** » Fri Aug 09, 2013 5:08 pm

This thread is a bit old, but I thought I'd add some of the tool functions I use to parse any text chunk for particular bits, since I've been doing this a lot lately. I use these functions all the time to extract URLs from HTML source.

Code: Select all

function CutTo src, pat
   get offset( pat, src ) 
   if it > 0 then return char (it + number of chars in pat) to -1 of src
   else return src
end CutTo

function CopyTo src, pat
   get offset( pat, src ) - 1
   return char 1 to it of src
end CopyTo

The idea with this is to progressively throw away the leading characters of the data until you get to the bit you want. In the case of an HTML page, multiple CutTo's targeting semi-unique strings can get you where you want to go if there isn't a single unique string available.

For example, if you are scraping a Google search result page, this string (as of this posting, and with my settings--your results may vary) marks the start of the links secion:

Code: Select all

<div id="search

And this string marks the end:

Code: Select all

Searches related to

The following shows an example on how to use the above two functions to extract the links. Stick the code in a button and set up a field (here named "links") and you can get a list of all the results from a Google search page.

Note that I'm doing some extra processing on the results to get them into a useful format as the actual link inside the <a> structure is an internal Google link that contains the final link in an encoded format. All this worked in my single test, but additional massaging of the returned data might be necessary.

Code: Select all

function CutTo src, pat
   get offset( pat, src ) 
   if it > 0 then return char (it + number of chars in pat) to -1 of src
   else return src
end CutTo

function CopyTo src, pat
   get offset( pat, src ) - 1
   return char 1 to it of src
end CopyTo

on mouseUp
   put "" into fld "links"
   put url "https://www.google.com/search?q=livecode" into xx
   put CutTo( xx, "<div id="&quote&"search" ) into xx
   put CopyTo( xx, "Searches related to" ) into xx
   repeat while "<a href=" is in xx
      put CutTo( xx, "<a href=" ) into xx
      set itemdel to quote
      put item 2 of xx into u
      if "webcache.googleusercontent.com" is not in u then
         delete char 1 to 7 of u
         put CopyTo( u, "&" ) into u
         put URLDecode( u ) into u
         put u & return after fld "links"
      end if
   end repeat
end mouseUp

If you use LiveCode to do web scraping, immediately following the links you find, remember to add a delay or the website may stop talking to you. Accessing links too fast makes you look like a DoS attacker. Putting a "wait 1 second" between each url call will make all the difference.

atout66 · Post by **atout66** » Mon Nov 04, 2013 6:35 pm

Nice job

FourthWorld · Post by **FourthWorld** » Mon Nov 04, 2013 7:17 pm

When I started writing spiders I found this book very helpful:

Webbots, Spiders, and Screen Scrapers
A Guide to Developing Internet Agents with PHP/CURL
http://shop.oreilly.com/product/9781593273972.do

While the examples use PHP and CURL, most of the concepts can be adapted for LiveCode with reasonable effort.

MaxV · Post by **MaxV** » Fri Nov 08, 2013 1:07 pm

This is a very interesting conversation, and I thin that offest and a good regular expression do the best job.
First of all put the following regular expression: (?i)href=(\"[^"]*\"|'[^']*'|[^'">\s]+)
in a custom property of a button, since it's too difficult to write in the code editor. Call it regex
Then you can use the following code:

Code: Select all

on mouseUp   
   put URL "http://forums.runrev.com/index.php" into temp
   findLinks temp
end mouseUp

on findLinks temp
   #a html link is like: <a href=www.google.it >
   #but it can be also a mess like: <a class='link' href="www.google.it"  color=green>
   #the correct regular expression to extract link in a tag is: (?i)href=(\"[^"]*\"|'[^']*'|[^'">\s]+)
   #unfortunately is a mess to write directly in the code, so I put it in the
   #regex property of this button
   put the regex of me into regex                  
   put "true" into test
   repeat while test is "true"
      #search and remove till find a "<a"
      put offset("<a ",temp) + 3 into tskip #note that 3 is the length o "<a "      
      if tskip > 3 then         
         delete char 1 to tskip of temp          
         #now we have a problem, the end of the link can                
         #be any of the following chars: space, ", ',>                        
         #now we have the link, but it can be sorrounded by quotes, so we remove them:         
         put matchText(temp,regex,tlink) into status
         if the first char of tlink is "'" or the first char of tlink is quote then                     
            delete first char of tlink                     
            delete last char of tlink                     
         end if             
         #now we put the link in the list of links              
         put tlink & return after ListOfLinks               
      else   
         put "false" into test            
      end if      
   end repeat
   answer ListOfLinks
end findLinks

Simple, efficient and clean. Useful for crawler, spiders and extracting links.

Thierry · Post by **Thierry** » Fri Nov 08, 2013 6:30 pm

Hello,

My try

I've been putting more logic in the regex:

Code: Select all

function findLinks myhtml
   local tList
   -- regex: (?msi)<a\s.*?href\s*=\s*["']?([^'">\s]+)["']?
   put the regex of me into regex
   repeat while matchChunk( myhtml,regex,pStart,pEnd)
      put 0 into tList[char pStart to pEnd of myhtml]
      delete char 1 to pEnd of myhtml
   end repeat
   return the keys of tList
end findLinks

Having post something somehow similar yesterday here:
http://forums.runrev.com/viewtopic.php?f=9&t=17877
I think this might interest some of you.

PS: I didn't test too much this code, therefore I'm ready for cases which doesn't work

Have fun!

Thierry

[-hh] · Post by **[-hh]** » Fri Nov 08, 2013 10:00 pm

..........

Thierry · Post by **Thierry** » Sat Nov 09, 2013 9:25 am

[-hh] wrote:Great and very elegant function.

Thanks.

The attribute 'href' may be expanded by the attribute 'id', 'target', 'name'
as these also give refs to jump to.

Yes, I know.
And even worse, you can have in your CSS this kind of reference too:

div#header {
background-image: url(images/sunny.jpg);

}

Regards,

Thierry

LiveCode Forums.

Internet Library Question - How to retrieve every link?

Internet Library Question - How to retrieve every link?

Re: Internet Library Question - How to retrieve every link?

Re: Internet Library Question - How to retrieve every link?

Re: Internet Library Question - How to retrieve every link?

Re: Internet Library Question - How to retrieve every link?

Re: Internet Library Question - How to retrieve every link?

Re: Internet Library Question - How to retrieve every link?

Re: Internet Library Question - How to retrieve every link?

Re: Internet Library Question - How to retrieve every link?

Re: Internet Library Question - How to retrieve every link?

Re: Internet Library Question - How to retrieve every link?

Re: Internet Library Question - How to retrieve every link?

Re: Internet Library Question - How to retrieve every link?