Page 1 of 1

[BEGINNER] I'm trying to retrieve the first 39 characters

Posted: Sun May 19, 2013 4:11 am
by shawnblc
I'm trying to retrieve the first 39 characters from a webpage. If someone could point me in the right direction, I'd appreciate it.

Code: Select all

put url ("http://domain.com/examples.php") into field "fld1" 
put characters 1 to 39 into field "fld1"

Re: [BEGINNER] I'm trying to retrieve the first 39 characte

Posted: Sun May 19, 2013 4:25 am
by Simon
You're almost there:

Code: Select all

put char 1 to 39 of field "fld1" into field "fld1"
you need a from-to

Simon

Re: [BEGINNER] I'm trying to retrieve the first 39 characte

Posted: Sun May 19, 2013 4:46 am
by shawnblc
Simon wrote:You're almost there:

Code: Select all

put char 1 to 39 of field "fld1" into field "fld1"
you need a from-to

Simon
Ah. Thank you Simon. That helps a lot. So when I do something like that I need a From ----> To, then I can limit the text. Got it. Thank you again.

Re: [BEGINNER] I'm trying to retrieve the first 39 characte

Posted: Sun May 19, 2013 8:01 pm
by dunbarx
Hi.

Simon showed you exactly what you were missing with your first attempt, and you both seem to call that "from...to". I see what you mean by that, and also think that you really understand it.

I just want you to go back to that first post, where you say:

put characters 1 to 39 into field "fld1"

and read this carefully. Even though just one line previously you loaded that very field with data, LiveCode will not know how to deal with this line. Characters 1 to 39 of what, exactly? In other words, you have to think like the engine does, and make sure that you can mentally parse a line of code into a sensible statement:

put characters 1 to 39 "OF SOME SOURCE OF DATA" into fld "fld1".

You had a valid target container, but you were missing an important part of a valid chunk expression.

Craig Newman

Re: [BEGINNER] I'm trying to retrieve the first 39 characte

Posted: Sat Sep 14, 2013 2:32 pm
by Maxiogee
I tried this when I wanted to obtain the contents of a webpage, but I got all the html coding

Is there a way to just obtain the text displayed to the viewer of a web-page?

Re: [BEGINNER] I'm trying to retrieve the first 39 characte

Posted: Sat Sep 14, 2013 4:40 pm
by FourthWorld
Maxiogee wrote:Is there a way to just obtain the text displayed to the viewer of a web-page?
Sometimes. The challenge with web scraping is that HTML offers so much flexibility that really the only way to traverse its elements reliably is through the DOM, which would require using JavaScript in a LiveCode browser object.

But in many cases you can use this quick function to obtain the text without the head portion or styling attributes, though the result can sometimes still include body scripts, hidden div contents, etc.:

Code: Select all

function HtmlToText pHtml
   -- Save the state of the templateField:
   put the properties of the templateField into tSaveProps
   -- Set the htmlText, obtain the text:
   set the htmlText of the templateField to pHtml
   put the text of the templateField into tText
   -- Restore the state of the templateField:
   set the properties of the templateField to tSaveProps
   -- Return the text:
   return tText
end HtmlToText
Note that htmlText is not designed to be true web-ready HTML; it's designed only as a way to represent all of a LiveCode field's contents and styles in a plain-text format for easy parsing and reproduction. So expect to find many differences between HTML and htmlText that won't account for the full range of true HTML tags (and conversely, there are some htmlText tags unique to LiveCode fields that are not found in HTML, such as the threeDBox style and others).

If you do a lot of web scraping on varying pages you may prefer to use a more complete regex-based solution. But I find this function takes care of most of the tags very quickly, leaving the remainder easy enough to pull out any unwanted elements through other means if needed.

Re: [BEGINNER] I'm trying to retrieve the first 39 characte

Posted: Mon Sep 16, 2013 10:15 am
by Maxiogee
Thanks Richard,

I am doing the 'scraping' (great word) on the pages of a site which lists the music Top 40 for every week from 1960 to date.
I'm trying to set up a substantial cross-referenced database from the info.

I have to manually open each page, select all, copy, and then return to the LiveCode stack and a button-click strips out the unwanted lines before and after the 'meat' I am after. The number of lines is always the same and I have coded the 'meat-handling' part to strip out irrelevant text.

I would have loved to be able to set that open, select and copy but no matter how I tried it all I got was the HTML.

Aaah well, I'm already over half-way through.

Regards.
Tony