Page 1 of 1

How to extract whole text from a PDF file with the PDF widget?

Posted: Fri Dec 10, 2021 5:08 pm
by TorstenHolmer
Hi,

I have a PDF filewith text and pictures, but I just want the text.

I can do it manually with Ctrl-A and Ctrl-Copy by viewing the file with Preview on MacOS.

I have a business licence and want to use the PDF widget but I cannot find a way to do it.

Can someone help me out?

Cheers,
Torsten

Re: How to extract whole text from a PDF file with the PDF widget?

Posted: Sat Dec 11, 2021 3:10 am
by FourthWorld
TorstenHolmer wrote:
Fri Dec 10, 2021 5:08 pm
I can do it manually with Ctrl-A and Ctrl-Copy by viewing the file with Preview on MacOS.
Do you like the result?

Re: How to extract whole text from a PDF file with the PDF widget?

Posted: Sat Dec 11, 2021 9:20 am
by TorstenHolmer
FourthWorld wrote:
Sat Dec 11, 2021 3:10 am
Do you like the result?
I am not sure what you mean. I have opened the PDF file with the Preview app and then selected the whole text with ctrl/cmd-A.
By copying the text with Ctrl/CMD-C it gets into the clipboard where I can import it into my Livecode app as text.
This works fine, a minor problem is, that pictures with appended text have strange mutiple occurences of this text.

To make it simpler, the idea with the PDF widget is to load the pdf into the app and to have an automatic extraction of the text in a separate field.

Re: How to extract whole text from a PDF file with the PDF widget?

Posted: Sat Dec 11, 2021 5:26 pm
by FourthWorld
There are command line tools that will extract the text of PDFs, most free, callable from LC's shell function. Super fast and easy. I can help turn up one for Linux if that's useful; haven't done that on other platforms, but I believe Mac includes one and there are tons for Win.

I asked about how happy you've been with the results because PDF is a one-way format, a Roach Motel in which text goes in well enough but is not designed to make getting it out again an easy task.

Sometimes you'll be quite pleased with the clean plain text that comes out of a PDF. Other times you'll see spaces, line breaks, and other artifacts you'd never expect. Sometimes those artifacts are easy to clean up, other times not so much, making the text anywhere between somewhat readable to OMG what is this? :) The unpredictable range of both quality and the specifics of impaired quality make automating cleanup at best very challenging.

If you have no choice over the format you have to work with, best of luck and I hope the text in the documents you need to work with comes out looking sane.

But be prepared for cleanup. And where you have the option of obtaining the document in just about any other format (eg ePub is device-agnostic and super easy to work with) you'll likely be able to extract the text more reliably.

Re: How to extract whole text from a PDF file with the PDF widget?

Posted: Sat Dec 11, 2021 6:48 pm
by jacque
You probably saw the recent thread on the mailing list about this, and apparently the widget does not include a way to extract text. There is an old external, now deprecated, that can do it.

https://www.mail-archive.com/use-liveco ... 14067.html

Re: How to extract whole text from a PDF file with the PDF widget?

Posted: Mon Dec 13, 2021 2:22 am
by stam
I posted a short guide on how to extract the text from a PDF here: viewtopic.php?f=8&t=35280&p=201036&hili ... er#p201036

This method uses an opensource command line tool controlled from LC to extract the text of the PDF to a text file which is trivial to import back into LC

Hope that helps
Stam

Re: How to extract whole text from a PDF file with the PDF widget?

Posted: Mon Dec 13, 2021 9:14 am
by richmond62
an opensource command line tool
But, as has been pointed out in the Use-List, there could be licensing issues here
as LiveCode (as it is at the moment) is closed source.

Re: How to extract whole text from a PDF file with the PDF widget?

Posted: Mon Dec 13, 2021 9:29 am
by stam
Excuse me for not keeping tabs on the use-list. I find it too verbose to keep a close eye on all the time. And I hate email.

So the feeling is that use of open source is a problem? For software now? How bizarre.

That’s a bit hypocritical as LC itself uses several open source components and I’m pretty sure that hasn’t changed with LC itself dropping open source. In fact I’m sure that’s what they do with the PDF widget.

Surely it’s just a matter of doing what LC does and include the relevant licences and respect the respective OSS requirements?

Re: How to extract whole text from a PDF file with the PDF widget?

Posted: Mon Dec 13, 2021 11:10 am
by richmond62
"Pop across" and check the Use-List:

http://lists.runrev.com/pipermail/use-l ... 66432.html

Re: How to extract whole text from a PDF file with the PDF widget?

Posted: Mon Dec 13, 2021 6:01 pm
by FourthWorld
richmond62 wrote:
Mon Dec 13, 2021 11:10 am
"Pop across" and check the Use-List:

http://lists.runrev.com/pipermail/use-l ... 66432.html
As the author of that post, please allow me to clarify:

My comment there starts with a conditional expression ("if"), regarding the suggestion that a compiled component might be a work derived from GPL-governed source.

If that were the case, the component would inherit the distribution rights and responsibilities of the work it was derived from.

That would seem unlikely, however, because LC was dual-licenced, so even during the period in which they also distributed an open source edition of LiveCode, they had to maintain a proprietary edition as well.

Though like most modern software they make extensive use of open source packages, they've shown good diligence in choosing only those with licenses like MIT which are well suited for both open source and proprietary distribution.

Two posts later the original producer of the component clarified that the LC external is based on PDFium, which is distributed under MIT license:
http://lists.runrev.com/pipermail/use-l ... 66435.html

So the "If" in my reply evaluates to false: since the component was not derived from to the GPL'd xPDFReader, GPL license terms are not relevant to the component LC ships.

Moreover, the use case presented was one for an in-house tool, not anticipated to be distributed to others.

Since the GPL is a distribution license, any personal use of GPL-governed works carries no inherited rights and responsibilities from that license. The software is not being shared, so there is no obligation to provide source to a recipient.

This is useful in appreciating the value of Stam's solution he linked to. While it does make use of the GPL-governed xPDFReader, that use would not appear to require sharing source, because the use case is not part of a derivative work, as would be the case if LC had distributed a derivative work in the form of an external.

If all this seems complex, wait till you see what PDF does to the format of the text it contains as evidenced through extraction. ;)