How to extract whole text from a PDF file with the PDF widget?
Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller
-
- Posts: 58
- Joined: Mon Oct 28, 2013 1:23 pm
How to extract whole text from a PDF file with the PDF widget?
Hi,
I have a PDF filewith text and pictures, but I just want the text.
I can do it manually with Ctrl-A and Ctrl-Copy by viewing the file with Preview on MacOS.
I have a business licence and want to use the PDF widget but I cannot find a way to do it.
Can someone help me out?
Cheers,
Torsten
I have a PDF filewith text and pictures, but I just want the text.
I can do it manually with Ctrl-A and Ctrl-Copy by viewing the file with Preview on MacOS.
I have a business licence and want to use the PDF widget but I cannot find a way to do it.
Can someone help me out?
Cheers,
Torsten
-
- VIP Livecode Opensource Backer
- Posts: 10045
- Joined: Sat Apr 08, 2006 7:05 am
- Contact:
Re: How to extract whole text from a PDF file with the PDF widget?
Do you like the result?TorstenHolmer wrote: ↑Fri Dec 10, 2021 5:08 pmI can do it manually with Ctrl-A and Ctrl-Copy by viewing the file with Preview on MacOS.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
-
- Posts: 58
- Joined: Mon Oct 28, 2013 1:23 pm
Re: How to extract whole text from a PDF file with the PDF widget?
I am not sure what you mean. I have opened the PDF file with the Preview app and then selected the whole text with ctrl/cmd-A.
By copying the text with Ctrl/CMD-C it gets into the clipboard where I can import it into my Livecode app as text.
This works fine, a minor problem is, that pictures with appended text have strange mutiple occurences of this text.
To make it simpler, the idea with the PDF widget is to load the pdf into the app and to have an automatic extraction of the text in a separate field.
-
- VIP Livecode Opensource Backer
- Posts: 10045
- Joined: Sat Apr 08, 2006 7:05 am
- Contact:
Re: How to extract whole text from a PDF file with the PDF widget?
There are command line tools that will extract the text of PDFs, most free, callable from LC's shell function. Super fast and easy. I can help turn up one for Linux if that's useful; haven't done that on other platforms, but I believe Mac includes one and there are tons for Win.
I asked about how happy you've been with the results because PDF is a one-way format, a Roach Motel in which text goes in well enough but is not designed to make getting it out again an easy task.
Sometimes you'll be quite pleased with the clean plain text that comes out of a PDF. Other times you'll see spaces, line breaks, and other artifacts you'd never expect. Sometimes those artifacts are easy to clean up, other times not so much, making the text anywhere between somewhat readable to OMG what is this?
The unpredictable range of both quality and the specifics of impaired quality make automating cleanup at best very challenging.
If you have no choice over the format you have to work with, best of luck and I hope the text in the documents you need to work with comes out looking sane.
But be prepared for cleanup. And where you have the option of obtaining the document in just about any other format (eg ePub is device-agnostic and super easy to work with) you'll likely be able to extract the text more reliably.
I asked about how happy you've been with the results because PDF is a one-way format, a Roach Motel in which text goes in well enough but is not designed to make getting it out again an easy task.
Sometimes you'll be quite pleased with the clean plain text that comes out of a PDF. Other times you'll see spaces, line breaks, and other artifacts you'd never expect. Sometimes those artifacts are easy to clean up, other times not so much, making the text anywhere between somewhat readable to OMG what is this?

If you have no choice over the format you have to work with, best of luck and I hope the text in the documents you need to work with comes out looking sane.
But be prepared for cleanup. And where you have the option of obtaining the document in just about any other format (eg ePub is device-agnostic and super easy to work with) you'll likely be able to extract the text more reliably.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
Re: How to extract whole text from a PDF file with the PDF widget?
You probably saw the recent thread on the mailing list about this, and apparently the widget does not include a way to extract text. There is an old external, now deprecated, that can do it.
https://www.mail-archive.com/use-liveco ... 14067.html
https://www.mail-archive.com/use-liveco ... 14067.html
Jacqueline Landman Gay | jacque at hyperactivesw dot com
HyperActive Software | http://www.hyperactivesw.com
HyperActive Software | http://www.hyperactivesw.com
Re: How to extract whole text from a PDF file with the PDF widget?
I posted a short guide on how to extract the text from a PDF here: viewtopic.php?f=8&t=35280&p=201036&hili ... er#p201036
This method uses an opensource command line tool controlled from LC to extract the text of the PDF to a text file which is trivial to import back into LC
Hope that helps
Stam
This method uses an opensource command line tool controlled from LC to extract the text of the PDF to a text file which is trivial to import back into LC
Hope that helps
Stam
-
- Livecode Opensource Backer
- Posts: 10094
- Joined: Fri Feb 19, 2010 10:17 am
Re: How to extract whole text from a PDF file with the PDF widget?
But, as has been pointed out in the Use-List, there could be licensing issues herean opensource command line tool
as LiveCode (as it is at the moment) is closed source.
Re: How to extract whole text from a PDF file with the PDF widget?
Excuse me for not keeping tabs on the use-list. I find it too verbose to keep a close eye on all the time. And I hate email.
So the feeling is that use of open source is a problem? For software now? How bizarre.
That’s a bit hypocritical as LC itself uses several open source components and I’m pretty sure that hasn’t changed with LC itself dropping open source. In fact I’m sure that’s what they do with the PDF widget.
Surely it’s just a matter of doing what LC does and include the relevant licences and respect the respective OSS requirements?
So the feeling is that use of open source is a problem? For software now? How bizarre.
That’s a bit hypocritical as LC itself uses several open source components and I’m pretty sure that hasn’t changed with LC itself dropping open source. In fact I’m sure that’s what they do with the PDF widget.
Surely it’s just a matter of doing what LC does and include the relevant licences and respect the respective OSS requirements?
-
- Livecode Opensource Backer
- Posts: 10094
- Joined: Fri Feb 19, 2010 10:17 am
-
- VIP Livecode Opensource Backer
- Posts: 10045
- Joined: Sat Apr 08, 2006 7:05 am
- Contact:
Re: How to extract whole text from a PDF file with the PDF widget?
As the author of that post, please allow me to clarify:richmond62 wrote: ↑Mon Dec 13, 2021 11:10 am"Pop across" and check the Use-List:
http://lists.runrev.com/pipermail/use-l ... 66432.html
My comment there starts with a conditional expression ("if"), regarding the suggestion that a compiled component might be a work derived from GPL-governed source.
If that were the case, the component would inherit the distribution rights and responsibilities of the work it was derived from.
That would seem unlikely, however, because LC was dual-licenced, so even during the period in which they also distributed an open source edition of LiveCode, they had to maintain a proprietary edition as well.
Though like most modern software they make extensive use of open source packages, they've shown good diligence in choosing only those with licenses like MIT which are well suited for both open source and proprietary distribution.
Two posts later the original producer of the component clarified that the LC external is based on PDFium, which is distributed under MIT license:
http://lists.runrev.com/pipermail/use-l ... 66435.html
So the "If" in my reply evaluates to false: since the component was not derived from to the GPL'd xPDFReader, GPL license terms are not relevant to the component LC ships.
Moreover, the use case presented was one for an in-house tool, not anticipated to be distributed to others.
Since the GPL is a distribution license, any personal use of GPL-governed works carries no inherited rights and responsibilities from that license. The software is not being shared, so there is no obligation to provide source to a recipient.
This is useful in appreciating the value of Stam's solution he linked to. While it does make use of the GPL-governed xPDFReader, that use would not appear to require sharing source, because the use case is not part of a derivative work, as would be the case if LC had distributed a derivative work in the form of an external.
If all this seems complex, wait till you see what PDF does to the format of the text it contains as evidenced through extraction.

Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn