Page 1 of 3

[SOLVED] Parsing Word files

Posted: Fri Jan 22, 2021 12:09 pm
by stam
Hi all,

I have a large number of Word files, the text of which i'd like to import to a database.

Is there an way to do this easily with LC?
(these are the older version of .doc, not .docx)

Many thanks
Stam

-- EDIT: Long discussion, the solutions for me can be distilled to these 2 responses:
See comment about parsing Word files in this reply. This was an adequate solution for me.
See also this solution about extracting ascii text from PDF files.

Re: Parsing Word files

Posted: Fri Jan 22, 2021 12:50 pm
by Klaus
Hi Stam,

I don't know of any free libraries, but you can buy an add-on for LC here:
https://livecode.com/extensions/wordlib/2-2-0/

Best

Klaus

Re: Parsing Word files

Posted: Fri Jan 22, 2021 12:52 pm
by stam
Thanks Klaus,
I was aware of this but was wondering if there was a built-in/free option... i guess not then :)

Re: Parsing Word files

Posted: Fri Jan 22, 2021 2:59 pm
by dunbarx
Hi.

Just wondering, since I never do anything like this. What is missing, or what unwanted baggage comes over if you simply load the entire Word document into a field? Is it things like the fact that links lose their, er, links? Formatting seems intact. not that a lot of work might be necessary to transform the text into something more to your liking.

Craig

Re: Parsing Word files

Posted: Fri Jan 22, 2021 3:18 pm
by Klaus
Obviously you never tried this yourself! :-)
I exported an RTF file to DOC and imported it into an LC field, have fun:
rtf.jpg
word_in_lc.jpg

Re: Parsing Word files

Posted: Fri Jan 22, 2021 3:35 pm
by dunbarx
Klaus.

What, that text is not readable?

So a big difference between reading a file with, say the "read from file" command and simply copying the contents of a file and pasting into LC.

I have used "the "read from file" command for years without issue, but only read "txt" documents that I made myself, rebuilding stuff "saved" by stacks I run. When reading "txt" files, the text comes back just fine.

Now I see what wordLib does.

Craig

Re: Parsing Word files

Posted: Fri Jan 22, 2021 3:36 pm
by richmond62
I exported an RTF file to DOC
Indeed . . . 8)

I tend to export DOC, OTD, whatever files as RTF files and then import them using

Code: Select all

set the RTFtext

Re: Parsing Word files

Posted: Fri Jan 22, 2021 5:09 pm
by FourthWorld
That's an interestingly anomalous RTF conversion, Klaus.

Looks like the chars and styles are preserved, but alignment is off because of tab settings.

How does it look if you adjust the field's tabstops to match the source Word doc?

Re: Parsing Word files

Posted: Fri Jan 22, 2021 5:13 pm
by Klaus
No idea, I do not have any Office software installed.
I used the "Save as..." feature in TextEdit on my Mac to export an existing RTF file to DOC.

Funny how the actual text is reversed. :-D

Re: Parsing Word files

Posted: Fri Jan 22, 2021 5:59 pm
by FourthWorld
The reversal seems an encoding issue. Does it improve when you use textEncode to match the source?

Re: Parsing Word files

Posted: Fri Jan 22, 2021 6:06 pm
by Klaus
No idea, I only wanted to show Craig that a WORD/DOC file is not per se "human readable". :D

Re: Parsing Word files

Posted: Fri Jan 22, 2021 6:44 pm
by FourthWorld
Stam, what format do you want to store it in?

Re: Parsing Word files

Posted: Sat Jan 23, 2021 8:28 pm
by stam
FourthWorld wrote:
Fri Jan 22, 2021 6:44 pm
Stam, what format do you want to store it in?
Thanks all for the interesting discussions, even though no free solution is found.

To expand on the case-use a bit, i want create an app for work to help us database our patients correctly which is something sorely lacking.
The electronic patient record system we use stores all our letters in Word .doc format and are moderately complex in structure (from a text parsing point of view).

Converting these to .rtf is definitely out of the question - i'm talking about 10,000 letters here, possibly more.

I already have created an app that will correctly parse the text of these letters to extract demographics, patient identifiers, diagnoses, medication etc -- i use this to copy/paste the text of the word document directly into the app which is fine for singular cases, but processing thousands of letters is just not feasible.

Hence, i'd like the equivalent of extracting the text programmatically.
It looks like it might be feasible with the Word plugin, but was just wondering if there was another way -- but i'm going to guess not :)

Re: Parsing Word files

Posted: Sat Jan 23, 2021 10:08 pm
by FourthWorld
How uniformly structured is the info in the letters?

Re: Parsing Word files

Posted: Sun Jan 24, 2021 1:30 am
by stam
FourthWorld wrote:
Sat Jan 23, 2021 10:08 pm
How uniformly structured is the info in the letters?
Fairly uniform - but with enough unpredictability - one has to cater for different styles between doctors, different specialities, the secretaries' views of how the letter should be structured, random spelling errors etc.

It was a real pain to create parsing algorithms to reliably be able to parse text when copy/pasted into the app but it works 99% of the time now.

But not sure how that helps extract the text from Word files?