Page 1 of 1
Stripping text from word documents
Posted: Thu Feb 18, 2010 5:21 pm
by jpottsx1
Is it possible to read into text fields the contents of .DOC, and .PDF stripping out the formatting on import. I just want the actual text with CR/LF and none of the other garbage that comes with the read anyfile demo.
Re: Stripping text from word documents
Posted: Thu Feb 18, 2010 6:25 pm
by Klaus
Hi Jeff,
no this is not possible without extreme efforts!
DOC and PDF are NOT plain text files, as you have seen, so this is not possible right "out of the box".
If you are on OS X you could use SHELL and "textutil" to convert a DOC or PDF to plain or rtf text and work with that one.
Best
Klaus
Re: Stripping text from word documents
Posted: Sat Apr 24, 2010 6:57 am
by Curry
The WordLib library can import Word files. Its forte is the newer DOCX format (Word 2007) and OpenOffice, but it does provide basic support for legacy Word DOC files, and it seems that's just what you're after. It does a pretty good job of stripping out the text.
To get the plain text with no styles, just import and then "put field 1 into field 1" for example, to clear any formatting.
(I hope to provide full formatting support for the legacy DOC files in a future version, and the more registered users I have, the more I will be able to develop the library!)