Help with parsing a text file

dazza101 · Post by **dazza101** » Thu Oct 29, 2015 4:06 am

I have dabbled with LiveCode many years ago, but have only just started to work with it again. I am trying to write a simple program to parse some text from a text file.

The file isn't delimited and looks like this:

GENERAL 1 BANK PAGE 1
TRACING REPORT BY VENDOR RUN DATE 07/03/15
RUN TIME 00:24:33
================================================================================
VENDOR NO: 0000 COMPANY CO 07/02/14
================================================================================
ACCOUNT AMOUNT CUSTOMER NAME TRACE TXN-DATE
1369 220.56 Joe Hill 363288 07/02/15

1146 25.00 Mary Johnson 346866 07/02/15

PAGE SUMMARY: Total Amount 245.56 Total Transactions: 2
*********************************
*TOTAL $ 245.56 *
*FILE CREATION NO : 213 *
*********************************

I want to be able to extract the Account, Amount, Customer Name and Transaction Date from each line.

Unfortunately the formatting of the text file is different than the pasted version is showing on the thread here. Different lines have different numbers of spaces before the text begins and the spacing between the different fields is also variable. The lines containing the relevant details have 16 spaces before the Account number, but other lines have differing numbers of spaces before the text.

I can easily open the file with

on mouseUp
answer file "A text file" with type ("text files|txt|tTXT" & return & "all files|*|*")
if it <> "" then
put it into theFilePath
put URL ("file:" & theFilePath) into field "text"
else
--no file was selected, or cancel was pressed
beep
end if
end mouseUp

The first seven lines are irrelevant so they can be deleted.

on mouseUp
delete line 1 to 7 of field "text"
end mouseUp

I am having trouble thinking my way through parsing each line and extracting the relevant text. I am thinking that I need to use some sort or pattern matching, possibly regex? to parse each line until I get to the line that begins with PAGE SUMMARY.

I have been reading through the dictionary and looking at online examples and I am completely lost.

I would welcome any suggestions on how to proceed.

Simon · Post by **Simon** » Thu Oct 29, 2015 4:16 am

Hi dazza101,
Welcome to the forum!

What about using lineoffset +1?

Simon

dazza101 · Post by **dazza101** » Thu Oct 29, 2015 4:52 am

Simon wrote:Hi dazza101,
Welcome to the forum!

What about using lineoffset +1?

Simon

Thanks Simon, but I'm not following your suggestion. lineoffset returns the line number on which the specified search text is found. Correct? Are you suggesting using lineoffset with a regex for spaces, numeric chars, spaces, chars, spaces, etc?

Simon · Post by **Simon** » Thu Oct 29, 2015 6:01 am

Hi dazza101,
Sorry, was thinking you were just looking for the line with the relevant info. Try pasting and using code formatting (button at the top).
other;

Code: Select all

on mouseUp
   put fld 1 into temp
   put lineoffset("ACCOUNT AMOUNT CUSTOMER NAME TRACE TXN-DATE",temp) into tLine
   put line (tLine + 1) of fld 1 into tSent
   repeat for each word tWord in  tSent
      answer tWord
   end repeat
end mouseUp

I'll let you figure out the multiple entries after tLine.

Simon

dazza101 · Post by **dazza101** » Thu Oct 29, 2015 10:35 pm

Simon wrote:Hi dazza101,
Sorry, was thinking you were just looking for the line with the relevant info. Try pasting and using code formatting (button at the top).
other;
Code: Select all
on mouseUp
   put fld 1 into temp
   put lineoffset("ACCOUNT AMOUNT CUSTOMER NAME TRACE TXN-DATE",temp) into tLine
   put line (tLine + 1) of fld 1 into tSent
   repeat for each word tWord in  tSent
      answer tWord
   end repeat
end mouseUp
I'll let you figure out the multiple entries after tLine.

Simon

Thank you Simon, that helps a lot!

dazza101 · Post by **dazza101** » Sat Oct 31, 2015 6:38 am

So I have finished my little project.

To recap, I wanted to be able to read in a text file that I receive every day from a bank of transactions paid into our account and extract the relevant information, then copy it to the clipboard so it could be pasted into a spreadsheet.

The text file isn't delimited and is inconsistent in its fields. Quite often the second name field is missing and the number of spaces between words is inconsistent.

An example of the type of text file I receive is this:

Code: Select all

                         Big Bank One             PAGE 1
                         TRACING REPORT BY VENDOR        RUN DATE 05/12/15
                                                         RUN TIME 00:47:31
================================================================================
VENDOR NO:  2345   Our Business                            05/11/15
================================================================================
        ACCOUNT           AMOUNT CUSTOMER NAME                    TRACE TXN-DATE
                1031      103.50 TheBigBadBusiness         222333 05/10/15
	
                1453       91.50 Randolph Reindeer                  233371 05/10/15
	
                1714      269.50 Johnny Depp                233335 05/10/15

   1215     1317.59 HalloweenProductions          433309 05/10/15
	
                1675      138.50 Betty Boop               433359 05/10/15
	
                1234      210.56 Olive Oil                   433308 05/10/15
	

	

PAGE SUMMARY:  Total Amount 813.56         Total Transactions: 3
           *********************************
           *TOTAL $                813.56  *
           *FILE CREATION NO  :       123  *
           *********************************

Thanks to Simon for kickstarting my brain again, as I haven't done any programming for 6 years.

I am sure my code is a bit of a kludge and completely inelegant. But it works.

This is what I ended up with:

Code: Select all

on mouseUp
   
   # The first seven lines of the file are irrelevant, delete them.
   delete line 1 to 7 of field "text"
   
   put fld 1 into temp
   put lineoffset("ACCOUNT AMOUNT CUSTOMER NAME TRACE TXN-DATE",temp) into tLine
      
   put line (tLine +1) of fld 1 into tSent
   put word 1 of fld 1 into tCheck
   #continue until end of relevant data in file. Indicated by line beginning with PAGE
   repeat while tCheck <> "PAGE"
      
      # clear clipboard
      set the clipboardData["text"] to ""
      
      put 1 into count
      repeat for each word tWord in  tSent
         
         # only import words that contain relevant text and characters
         if matchText(tWord, "^[0-9A-Za-z-.-/]+$") then
            if count = 1 then put tWord into tData[count, tLine]
            if count = 2 then put tWord into tData[count, tLine]
            if count = 3 then put tWord into tData[count, tLine]
            if count = 4 then put tWord into tData[count, tLine]
            if count = 5 then put tWord into tData[count, tLine]
            if count = 6 then put tWord into tData[count, tLine]
            
            add one to count
         end if
         
      end repeat
      add 1 to tLine
      put line (tLine +1) of fld 1 into tSent
      put word 1 of tSent into tCheck
   end repeat
   
   repeat with x = 0 to tLine step 2
      put 1 into y
      
      # the name data fields are inconsistent. The 2nd name field is sometimes omitted. Make the data consistent and add a space into the second name if it is absent in the file
      if tData[4,x] is a number then put tData[5,x] into tData[6,x]
      if tData[4,x] is a number then put tData[4,x] into tData[5,x]
      if tData[4,x] is a number then put " " into tData[4,x]
      
      put  tData[1,x] into AccountData[y]["Account"]
      put  tData[2,x] into AccountData[y]["Amount"]
      put  tData[3,x] && tData[4,x] into AccountData[y]["Name"]
      put  tData[5,x] into AccountData[y]["Txn"]
      put  tData[6,x] into AccountData[y]["Date"]
      
      # Ignore lines that are blank and place the rest on the clipboard
      If AccountData[y]["Account"] <> "" then set  clipboardData to clipboardData & return & AccountData[y]["Account"] & tab & AccountData[y]["Amount"] & tab & AccountData[y]["Name"]  & tab & AccountData[y]["Txn"] & tab & AccountData[y]["Date"] 
      
      add 1 to y
   end repeat
   
end mouseUp

Thanks for the help and if anyone wants to suggest any modifications or simplifications to the code I would be most grateful. I basically feel like a complete Noob again with programming

cheers
Darryl

jacque · Post by **jacque** » Sun Nov 01, 2015 4:38 am

Good shot, you did some hard work. There's another way that relies on the fact that you can count backwards from the end of a text chunk by using negative numbers. For example, word -1 of a line is the last word, word -2 of a line is second from the last word, etc. With that in mind, you can calculate the name fields by extracting word 3 to -3 of the line, and if word -3 is empty it doesn't matter. Here's one way to write that:

Code: Select all

on mouseUp
  parsefld fld "text"
end mouseUp

on parsefld pData
  put line 8 to lineoffset("PAGE SUMMARY:",pData)-1 of pData into pData -- remove unwanted lines
  filter pData without empty -- remove blank lines; only one in the example text is truly blank
  repeat for each line tLine in pData -- process the lines
    if the number of words in tLine < 1 then next repeat -- some lines contain only spaces
    put word 1 of tLine into tAccount
    put word 2 of tLine into tAmount
    put word 3 to -3 of tLine into tName
    put word -2 of tLine into tTrace
    put word - 1 of tLine into tDate
    put tAccount &tab& tAmount &tab& tName &tab& tTrace &tab& tDate & cr after tList
  end repeat
  set the clipboarddata["text"] to tList
end parsefld

That's the long form. You could actually combine the entire repeat section into one line that's harder to read but does the same thing. I also removed the "filter" line here because your example only has a single truly blank line. All the others have spaces in them, so we let the repeat loop deal with that exclusively:

Code: Select all

on parsefld pData
  put line 8 to lineoffset("PAGE SUMMARY:",pData)-1 of pData into pData -- remove unwanted lines
  repeat for each line tLine in pData -- process the lines
    if the number of words in tLine < 1 then next repeat -- empty lines, or lines containing only spaces
    put word 1 of tLine &tab& word 2 of tLine &tab& word 3 to -3 of tLine &tab& \
          word -2 of tLine &tab& word - 1 of tLine & cr after tList
  end repeat
  set the clipboarddata["text"] to tList
end parsefld

Thierry · Post by **Thierry** » Sun Nov 01, 2015 10:34 am

dazza101 wrote:if anyone wants to suggest any modifications or simplifications to the code I would be most grateful.

Hi Darryl,

Here is another way to deal with extracting information from some text.
I'm using a regular expression to match all your customers' information lines:

Code: Select all

constant Rx = "(?m)^\s+(\d+)\s+([\d.]+)\s+([\w\s.-]+?)\s+(\d+)\s+([^\s]+)$"

and I'm giving you 3 samples:

Code: Select all

on mouseUp
   put test1( field "fSource") &cr into fld "fResults2"
   put test2( field "fSource") &cr after fld "fResults2"
   put test3( field "fSource")  &cr after fld "fResults2"
end mouseUp

Code: Select all

function test1 T
   local R
   local _account,_amount,_customer,_trace,_txndate
   
   repeat while matchChunk( T, Rx, a1,a2,b1,b2,c1,c2,d1,d2,e1,e2) then
      put char a1 to a2 of T into _account
      put char b1 to b2 of T into _amount
      put char c1 to c2 of T into _customer
      put char d1 to d2 of T into _trace
      put char e1 to e2 of T into _txndate
      -- <...> used to check visually captured chars
      get format("<%s>\t<%s>\t<%s>\t<%s>\t<%s>\n", _account,_amount,_customer,_trace,_txndate)
      put IT after R
      delete char 1 to e2 of T
   end repeat
   return R
end test1

This one is a bit more compact than the 1st but same logic:

Code: Select all

function test2 T
   local R
   repeat while matchChunk( T, Rx, a1,a2,b1,b2,c1,c2,d1,d2,e1,e2) then
      put format("%s\t%s\t%s\t%s\t%s\n", \
            char a1 to a2 of T, \
            char b1 to b2 of T, \
            char c1 to c2 of T, \
            char d1 to d2 of T, \
            char e1 to e2 of T ) after R
      delete char 1 to e2 of T
   end repeat
   return R
end test2

and the joker:

Code: Select all

function test3 T
   local R
   if sunnYextract( T, Rx, "<\1>\t<\2>\t<\3>\t<\4>\t<\5>\n", R) then return R
end test3

here are the output datas:

Code: Select all

<1031>	<103.50>	<TheBigBadBusiness>	<222333>	<05/10/15>
<1453>	<91.50>	<Randolph Reindeer>	<233371>	<05/10/15>
<1714>	<269.50>	<Johnny Depp>	<233335>	<05/10/15>
<1215>	<1317.59>	<HalloweenProductions>	<433309>	<05/10/15>
<1675>	<138.50>	<Betty Boop>	<433359>	<05/10/15>
<1234>	<210.56>	<Olive Oil>	<433308>	<05/10/15>
<1077>	<100.00>	<Kathy Brown>	<538118>	<05/10/15>
<1986>	<80.00>	<PIFinancial>	<638255>	<05/10/15>
<1919>	<100.00>	<Mark Thespot>	<651034>	<05/10/15>
<2288>	<100.00>	<Okay Doaky>	<751036>	<05/10/15>

Kind regards,

Thierry

dazza101 · Post by **dazza101** » Tue Nov 03, 2015 1:08 am

Thank you very much jacque and Thierry. You have both given me much to work with and made me realise how much I have to learn (again) about Livecode.

Much appreciated.

Darryl

jacque · Post by **jacque** » Tue Nov 03, 2015 8:01 am

dazza101 wrote:You have both given me much to work with and made me realise how much I have to learn (again) about Livecode.

But that's the fun part!

I still run into things in the dictionary I had no idea were there.

LiveCode Forums.

Help with parsing a text file

Help with parsing a text file

Re: Help with parsing a text file

Re: Help with parsing a text file

Re: Help with parsing a text file

Re: Help with parsing a text file

Re: Help with parsing a text file

Re: Help with parsing a text file

Re: Help with parsing a text file

Re: Help with parsing a text file

Re: Help with parsing a text file