Re: Parsing Word files
Posted: Sun Jan 24, 2021 2:55 am
Embarrasingly, i posed the question without actually testing what simply importing the file would produce - i had just assumed this would be binary data...so i did what i should have done before posting the question and tested...
i imported a typical Word file into a variable and examined the text... About 70% of this is clearly binary data, but interestingly the file also seems to include the entire letter in plain ascii text as well !!
I suspect the binary data contains formatting/layout data for the included ascii text, but since this includes the entire text i can easily extract it since our systems automatically insert boilerplate text at start and end of letters... so no need to splash out on the plugin just yet.
So sorry to waste your time with this... but on this note, we also receive a small percentage of our letters in PDF format (especially when sent to us from other centres). These do contain extractable text (inasmuch as it can be copy/pasted) - but not sure if it's possible to extract this in LC?
I did a cursory search and not sure i found any kind of useful answer other than using a command line tool.
Any suggestions for extracting text from PDF?
i imported a typical Word file into a variable and examined the text... About 70% of this is clearly binary data, but interestingly the file also seems to include the entire letter in plain ascii text as well !!
I suspect the binary data contains formatting/layout data for the included ascii text, but since this includes the entire text i can easily extract it since our systems automatically insert boilerplate text at start and end of letters... so no need to splash out on the plugin just yet.
So sorry to waste your time with this... but on this note, we also receive a small percentage of our letters in PDF format (especially when sent to us from other centres). These do contain extractable text (inasmuch as it can be copy/pasted) - but not sure if it's possible to extract this in LC?
I did a cursory search and not sure i found any kind of useful answer other than using a command line tool.
Any suggestions for extracting text from PDF?