Help with parsing a text file
Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller
Help with parsing a text file
I have dabbled with LiveCode many years ago, but have only just started to work with it again. I am trying to write a simple program to parse some text from a text file.
The file isn't delimited and looks like this:
GENERAL 1 BANK PAGE 1
TRACING REPORT BY VENDOR RUN DATE 07/03/15
RUN TIME 00:24:33
================================================================================
VENDOR NO: 0000 COMPANY CO 07/02/14
================================================================================
ACCOUNT AMOUNT CUSTOMER NAME TRACE TXN-DATE
1369 220.56 Joe Hill 363288 07/02/15
1146 25.00 Mary Johnson 346866 07/02/15
PAGE SUMMARY: Total Amount 245.56 Total Transactions: 2
*********************************
*TOTAL $ 245.56 *
*FILE CREATION NO : 213 *
*********************************
I want to be able to extract the Account, Amount, Customer Name and Transaction Date from each line.
Unfortunately the formatting of the text file is different than the pasted version is showing on the thread here. Different lines have different numbers of spaces before the text begins and the spacing between the different fields is also variable. The lines containing the relevant details have 16 spaces before the Account number, but other lines have differing numbers of spaces before the text.
I can easily open the file with
on mouseUp
answer file "A text file" with type ("text files|txt|tTXT" & return & "all files|*|*")
if it <> "" then
put it into theFilePath
put URL ("file:" & theFilePath) into field "text"
else
--no file was selected, or cancel was pressed
beep
end if
end mouseUp
The first seven lines are irrelevant so they can be deleted.
on mouseUp
delete line 1 to 7 of field "text"
end mouseUp
I am having trouble thinking my way through parsing each line and extracting the relevant text. I am thinking that I need to use some sort or pattern matching, possibly regex? to parse each line until I get to the line that begins with PAGE SUMMARY.
I have been reading through the dictionary and looking at online examples and I am completely lost.
I would welcome any suggestions on how to proceed.
The file isn't delimited and looks like this:
GENERAL 1 BANK PAGE 1
TRACING REPORT BY VENDOR RUN DATE 07/03/15
RUN TIME 00:24:33
================================================================================
VENDOR NO: 0000 COMPANY CO 07/02/14
================================================================================
ACCOUNT AMOUNT CUSTOMER NAME TRACE TXN-DATE
1369 220.56 Joe Hill 363288 07/02/15
1146 25.00 Mary Johnson 346866 07/02/15
PAGE SUMMARY: Total Amount 245.56 Total Transactions: 2
*********************************
*TOTAL $ 245.56 *
*FILE CREATION NO : 213 *
*********************************
I want to be able to extract the Account, Amount, Customer Name and Transaction Date from each line.
Unfortunately the formatting of the text file is different than the pasted version is showing on the thread here. Different lines have different numbers of spaces before the text begins and the spacing between the different fields is also variable. The lines containing the relevant details have 16 spaces before the Account number, but other lines have differing numbers of spaces before the text.
I can easily open the file with
on mouseUp
answer file "A text file" with type ("text files|txt|tTXT" & return & "all files|*|*")
if it <> "" then
put it into theFilePath
put URL ("file:" & theFilePath) into field "text"
else
--no file was selected, or cancel was pressed
beep
end if
end mouseUp
The first seven lines are irrelevant so they can be deleted.
on mouseUp
delete line 1 to 7 of field "text"
end mouseUp
I am having trouble thinking my way through parsing each line and extracting the relevant text. I am thinking that I need to use some sort or pattern matching, possibly regex? to parse each line until I get to the line that begins with PAGE SUMMARY.
I have been reading through the dictionary and looking at online examples and I am completely lost.
I would welcome any suggestions on how to proceed.
Re: Help with parsing a text file
Hi dazza101,
Welcome to the forum!
What about using lineoffset +1?
Simon
Welcome to the forum!
What about using lineoffset +1?
Simon
I used to be a newbie but then I learned how to spell teh correctly and now I'm a noob!
Re: Help with parsing a text file
Thanks Simon, but I'm not following your suggestion. lineoffset returns the line number on which the specified search text is found. Correct? Are you suggesting using lineoffset with a regex for spaces, numeric chars, spaces, chars, spaces, etc?Simon wrote:Hi dazza101,
Welcome to the forum!
What about using lineoffset +1?
Simon
Re: Help with parsing a text file
Hi dazza101,
Sorry, was thinking you were just looking for the line with the relevant info. Try pasting and using code formatting (button at the top).
other;
I'll let you figure out the multiple entries after tLine.
Simon
Sorry, was thinking you were just looking for the line with the relevant info. Try pasting and using code formatting (button at the top).
other;
Code: Select all
on mouseUp
put fld 1 into temp
put lineoffset("ACCOUNT AMOUNT CUSTOMER NAME TRACE TXN-DATE",temp) into tLine
put line (tLine + 1) of fld 1 into tSent
repeat for each word tWord in tSent
answer tWord
end repeat
end mouseUp
Simon
I used to be a newbie but then I learned how to spell teh correctly and now I'm a noob!
Re: Help with parsing a text file
Thank you Simon, that helps a lot!Simon wrote:Hi dazza101,
Sorry, was thinking you were just looking for the line with the relevant info. Try pasting and using code formatting (button at the top).
other;I'll let you figure out the multiple entries after tLine.Code: Select all
on mouseUp put fld 1 into temp put lineoffset("ACCOUNT AMOUNT CUSTOMER NAME TRACE TXN-DATE",temp) into tLine put line (tLine + 1) of fld 1 into tSent repeat for each word tWord in tSent answer tWord end repeat end mouseUp
Simon
Re: Help with parsing a text file
So I have finished my little project.
To recap, I wanted to be able to read in a text file that I receive every day from a bank of transactions paid into our account and extract the relevant information, then copy it to the clipboard so it could be pasted into a spreadsheet.
The text file isn't delimited and is inconsistent in its fields. Quite often the second name field is missing and the number of spaces between words is inconsistent.
An example of the type of text file I receive is this:
Thanks to Simon for kickstarting my brain again, as I haven't done any programming for 6 years.
I am sure my code is a bit of a kludge and completely inelegant. But it works.
This is what I ended up with:
Thanks for the help and if anyone wants to suggest any modifications or simplifications to the code I would be most grateful. I basically feel like a complete Noob again with programming 
cheers
Darryl
To recap, I wanted to be able to read in a text file that I receive every day from a bank of transactions paid into our account and extract the relevant information, then copy it to the clipboard so it could be pasted into a spreadsheet.
The text file isn't delimited and is inconsistent in its fields. Quite often the second name field is missing and the number of spaces between words is inconsistent.
An example of the type of text file I receive is this:
Code: Select all
Big Bank One PAGE 1
TRACING REPORT BY VENDOR RUN DATE 05/12/15
RUN TIME 00:47:31
================================================================================
VENDOR NO: 2345 Our Business 05/11/15
================================================================================
ACCOUNT AMOUNT CUSTOMER NAME TRACE TXN-DATE
1031 103.50 TheBigBadBusiness 222333 05/10/15
1453 91.50 Randolph Reindeer 233371 05/10/15
1714 269.50 Johnny Depp 233335 05/10/15
1215 1317.59 HalloweenProductions 433309 05/10/15
1675 138.50 Betty Boop 433359 05/10/15
1234 210.56 Olive Oil 433308 05/10/15
PAGE SUMMARY: Total Amount 813.56 Total Transactions: 3
*********************************
*TOTAL $ 813.56 *
*FILE CREATION NO : 123 *
*********************************
I am sure my code is a bit of a kludge and completely inelegant. But it works.
This is what I ended up with:
Code: Select all
on mouseUp
# The first seven lines of the file are irrelevant, delete them.
delete line 1 to 7 of field "text"
put fld 1 into temp
put lineoffset("ACCOUNT AMOUNT CUSTOMER NAME TRACE TXN-DATE",temp) into tLine
put line (tLine +1) of fld 1 into tSent
put word 1 of fld 1 into tCheck
#continue until end of relevant data in file. Indicated by line beginning with PAGE
repeat while tCheck <> "PAGE"
# clear clipboard
set the clipboardData["text"] to ""
put 1 into count
repeat for each word tWord in tSent
# only import words that contain relevant text and characters
if matchText(tWord, "^[0-9A-Za-z-.-/]+$") then
if count = 1 then put tWord into tData[count, tLine]
if count = 2 then put tWord into tData[count, tLine]
if count = 3 then put tWord into tData[count, tLine]
if count = 4 then put tWord into tData[count, tLine]
if count = 5 then put tWord into tData[count, tLine]
if count = 6 then put tWord into tData[count, tLine]
add one to count
end if
end repeat
add 1 to tLine
put line (tLine +1) of fld 1 into tSent
put word 1 of tSent into tCheck
end repeat
repeat with x = 0 to tLine step 2
put 1 into y
# the name data fields are inconsistent. The 2nd name field is sometimes omitted. Make the data consistent and add a space into the second name if it is absent in the file
if tData[4,x] is a number then put tData[5,x] into tData[6,x]
if tData[4,x] is a number then put tData[4,x] into tData[5,x]
if tData[4,x] is a number then put " " into tData[4,x]
put tData[1,x] into AccountData[y]["Account"]
put tData[2,x] into AccountData[y]["Amount"]
put tData[3,x] && tData[4,x] into AccountData[y]["Name"]
put tData[5,x] into AccountData[y]["Txn"]
put tData[6,x] into AccountData[y]["Date"]
# Ignore lines that are blank and place the rest on the clipboard
If AccountData[y]["Account"] <> "" then set clipboardData to clipboardData & return & AccountData[y]["Account"] & tab & AccountData[y]["Amount"] & tab & AccountData[y]["Name"] & tab & AccountData[y]["Txn"] & tab & AccountData[y]["Date"]
add 1 to y
end repeat
end mouseUp

cheers
Darryl
Re: Help with parsing a text file
Good shot, you did some hard work. There's another way that relies on the fact that you can count backwards from the end of a text chunk by using negative numbers. For example, word -1 of a line is the last word, word -2 of a line is second from the last word, etc. With that in mind, you can calculate the name fields by extracting word 3 to -3 of the line, and if word -3 is empty it doesn't matter. Here's one way to write that:
That's the long form. You could actually combine the entire repeat section into one line that's harder to read but does the same thing. I also removed the "filter" line here because your example only has a single truly blank line. All the others have spaces in them, so we let the repeat loop deal with that exclusively:
Code: Select all
on mouseUp
parsefld fld "text"
end mouseUp
on parsefld pData
put line 8 to lineoffset("PAGE SUMMARY:",pData)-1 of pData into pData -- remove unwanted lines
filter pData without empty -- remove blank lines; only one in the example text is truly blank
repeat for each line tLine in pData -- process the lines
if the number of words in tLine < 1 then next repeat -- some lines contain only spaces
put word 1 of tLine into tAccount
put word 2 of tLine into tAmount
put word 3 to -3 of tLine into tName
put word -2 of tLine into tTrace
put word - 1 of tLine into tDate
put tAccount &tab& tAmount &tab& tName &tab& tTrace &tab& tDate & cr after tList
end repeat
set the clipboarddata["text"] to tList
end parsefld
Code: Select all
on parsefld pData
put line 8 to lineoffset("PAGE SUMMARY:",pData)-1 of pData into pData -- remove unwanted lines
repeat for each line tLine in pData -- process the lines
if the number of words in tLine < 1 then next repeat -- empty lines, or lines containing only spaces
put word 1 of tLine &tab& word 2 of tLine &tab& word 3 to -3 of tLine &tab& \
word -2 of tLine &tab& word - 1 of tLine & cr after tList
end repeat
set the clipboarddata["text"] to tList
end parsefld
Jacqueline Landman Gay | jacque at hyperactivesw dot com
HyperActive Software | http://www.hyperactivesw.com
HyperActive Software | http://www.hyperactivesw.com
Re: Help with parsing a text file
Hi Darryl,dazza101 wrote:if anyone wants to suggest any modifications or simplifications to the code I would be most grateful.
Here is another way to deal with extracting information from some text.
I'm using a regular expression to match all your customers' information lines:
Code: Select all
constant Rx = "(?m)^\s+(\d+)\s+([\d.]+)\s+([\w\s.-]+?)\s+(\d+)\s+([^\s]+)$"
Code: Select all
on mouseUp
put test1( field "fSource") &cr into fld "fResults2"
put test2( field "fSource") &cr after fld "fResults2"
put test3( field "fSource") &cr after fld "fResults2"
end mouseUp
Code: Select all
function test1 T
local R
local _account,_amount,_customer,_trace,_txndate
repeat while matchChunk( T, Rx, a1,a2,b1,b2,c1,c2,d1,d2,e1,e2) then
put char a1 to a2 of T into _account
put char b1 to b2 of T into _amount
put char c1 to c2 of T into _customer
put char d1 to d2 of T into _trace
put char e1 to e2 of T into _txndate
-- <...> used to check visually captured chars
get format("<%s>\t<%s>\t<%s>\t<%s>\t<%s>\n", _account,_amount,_customer,_trace,_txndate)
put IT after R
delete char 1 to e2 of T
end repeat
return R
end test1
Code: Select all
function test2 T
local R
repeat while matchChunk( T, Rx, a1,a2,b1,b2,c1,c2,d1,d2,e1,e2) then
put format("%s\t%s\t%s\t%s\t%s\n", \
char a1 to a2 of T, \
char b1 to b2 of T, \
char c1 to c2 of T, \
char d1 to d2 of T, \
char e1 to e2 of T ) after R
delete char 1 to e2 of T
end repeat
return R
end test2
Code: Select all
function test3 T
local R
if sunnYextract( T, Rx, "<\1>\t<\2>\t<\3>\t<\4>\t<\5>\n", R) then return R
end test3
Code: Select all
<1031> <103.50> <TheBigBadBusiness> <222333> <05/10/15>
<1453> <91.50> <Randolph Reindeer> <233371> <05/10/15>
<1714> <269.50> <Johnny Depp> <233335> <05/10/15>
<1215> <1317.59> <HalloweenProductions> <433309> <05/10/15>
<1675> <138.50> <Betty Boop> <433359> <05/10/15>
<1234> <210.56> <Olive Oil> <433308> <05/10/15>
<1077> <100.00> <Kathy Brown> <538118> <05/10/15>
<1986> <80.00> <PIFinancial> <638255> <05/10/15>
<1919> <100.00> <Mark Thespot> <651034> <05/10/15>
<2288> <100.00> <Okay Doaky> <751036> <05/10/15>
Thierry
Last edited by Thierry on Thu Nov 17, 2022 12:31 pm, edited 1 time in total.
!
SUNNY-TDZ.COM doesn't belong to me since 2021.
To contact me, use the Private messages. Merci.
!
SUNNY-TDZ.COM doesn't belong to me since 2021.
To contact me, use the Private messages. Merci.
!
Re: Help with parsing a text file
Thank you very much jacque and Thierry. You have both given me much to work with and made me realise how much I have to learn (again) about Livecode.
Much appreciated.
Darryl
Much appreciated.
Darryl
Re: Help with parsing a text file
But that's the fun part!dazza101 wrote:You have both given me much to work with and made me realise how much I have to learn (again) about Livecode.

Jacqueline Landman Gay | jacque at hyperactivesw dot com
HyperActive Software | http://www.hyperactivesw.com
HyperActive Software | http://www.hyperactivesw.com