Get the last two columns

Got a LiveCode personal license? Are you a beginner, hobbyist or educator that's new to LiveCode? This forum is the place to go for help getting started. Welcome!

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller

xoxiwe
Posts: 7
Joined: Fri May 28, 2021 12:45 pm

Get the last two columns

Post by xoxiwe » Sun Feb 23, 2025 5:10 am

Hello,

I'm trying to get some info from a text file, and the data of the text file looks like this:
Gainesville,Libra,5684,1987,Hurp1987,Bedford
Feather Sound,3513,1942,Jons1942,Flores
Marysville,Sagittarius,6638,1963,Subjes,Williams
Orange,Leo,5592,1986,38,Dandrall,Wolf
As you can see each line has different number of items; and there are more lines in the data that have more number of items.

I just want to get the last two items from each line which should look like this:
Hurp1987,Bedford
Jons1942,Flores
Subjes,Williams
Dandrall,Wolf
I know the result can be got by using the repeat loop, but I'm interested in using the split and combine command since the data I'm gonna process is huge, like 300-500 mb in size, so definitely repeat loop will take a longer time, and my pc wont be able to handle it I fear.

I already written the script like this:

Code: Select all

set the columndel to ","
split tText by column -- tText is the variable that holds the data
put tText[-2] into fld "f1"
put tText[-1] into fld "f2"
Unfortunately LC doesn't get the last columns like that :(

If anyone can tell me or show an example of how to get the last columns, or even if there's a quickest way to get the result, I'll be glad to learn, and your help will be so much appreciated :) but please remember, it's gonna handle a huge data, so the quickest way is welcome :)

jmburnod
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 2729
Joined: Sat Dec 22, 2007 5:35 pm
Contact:

Re: Get the last two columns

Post by jmburnod » Sun Feb 23, 2025 10:45 am

HI,

This is a way to do it, but there is a lot of ways to do the same

Code: Select all

   put fld "fText" into tText
   repeat with i = 1 to the num of lines of tText
      put item -2 to -1 of line i of tText & cr after tRes
   end repeat
   delete char -1 of tRes
   put tRes
Best regards
Jean-Marc
https://alternatic.ch

xoxiwe
Posts: 7
Joined: Fri May 28, 2021 12:45 pm

Re: Get the last two columns

Post by xoxiwe » Sun Feb 23, 2025 12:36 pm

thanks Jean-Marc for your solution; but as I said I actually trying to avoid repeat loop as my pc might not be able to handle 300-500 mb size data with repeat loop :( Do you have any idea to get the result using array?

richmond62
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 10078
Joined: Fri Feb 19, 2010 10:17 am

Re: Get the last two columns

Post by richmond62 » Sun Feb 23, 2025 1:54 pm

Well, you're already going to have a large overhead if you import the whole text at once into either a field or variable.

Why not import each line of your comma-delimited text file, one-at-time, and strip out everything except the last 2 items and dump those into a list Field?
-
Screenshot 2025-02-23 at 15.01.52.png
-

Button 'STRIP':

Code: Select all

on mouseUp
   put empty into fld "f2"
   put 1 into CHAMPS
   repeat until line CHAMPS of fld "f1" is empty
    put line CHAMPS of fld "f1" into PAYSAGE
      put item -2 to -1 of PAYSAGE into line CHAMPS of fld "f2"
      add 1 to CHAMPS
   end repeat
end mouseUp
or, even shorter:

Code: Select all

on mouseUp
   put empty into fld "f2"
   put 1 into CHAMPS
   repeat until line CHAMPS of fld "f1" is empty
    put item -2 to -1 of line CHAMPS of fld "f1" into line CHAMPS of fld "f2"
      add 1 to CHAMPS
   end repeat
end mouseUp
Attachments
Jack the Stripper.livecode.zip
Stack
(1.21 KiB) Downloaded 381 times
Last edited by richmond62 on Sun Feb 23, 2025 2:06 pm, edited 2 times in total.

stam
Posts: 3061
Joined: Sun Jun 04, 2006 9:39 pm

Re: Get the last two columns

Post by stam » Sun Feb 23, 2025 2:01 pm

xoxiwe wrote:
Sun Feb 23, 2025 5:10 am
Unfortunately LC doesn't get the last columns like that :(
Your pseudocode won't work not just because 'it doesn't work like that' but because the keys you are wanting to obtain are sub-keys.
There is no command in LC that allows you to extract all the subkeys of all rows that I know of.

On the other hand, loops can be highly optimised and much faster than you think.
Specifically for each loops can be blindingly fast, and I would suggest trying a modification of the code proposed above, which can be abstraction to a function (lets call it getLastTwoColums(pText) ) which you can just put in your stack script and call for any source of data:

Code: Select all

function getLastTwoColumns pSource
   local tText
   repeat for each line tLine of pSource
      put item -2 to -1 of tLine & return after tText
   end repeat
   delete the last char of tText
   return tText
end getLastTwoColumns
Stam

bogs
Posts: 5480
Joined: Sat Feb 25, 2017 10:45 pm

Re: Get the last two columns

Post by bogs » Sun Feb 23, 2025 2:30 pm

xoxiwe wrote:
Sun Feb 23, 2025 5:10 am
Hello,

I'm trying to get some info from a text file, and the data of the text file looks like this:
Gainesville,Libra,5684,1987,Hurp1987,Bedford
As you can see each line has different number of items; and there are more lines in the data that have more number of items.

I just want to get the last two items from each line which should look like this:
Hurp1987,Bedford
<sic>
I already written the script like this:

Code: Select all

set the columndel to ","
split tText by column -- tText is the variable that holds the data
put tText[-2] into fld "f1"
put tText[-1] into fld "f2"
------------------
stam wrote:
Sun Feb 23, 2025 2:01 pm
Your pseudocode won't work not just because 'it doesn't work like that' but because the keys you are wanting to obtain are sub-keys.
There is no command in LC that allows you to extract all the subkeys of all rows that I know of.
Um, then I guess you don't think this would work...
arrayToFields.png
His main problem was the - sign in front of the key -
put tText[-2] into fld "f1"
put tText[-1] into fld "f2"

Maybe it is time for me to do another video on arrays :cry:

BTW Stam is right, the "for each" version of the loop would not sacrifice significant speed on such a small data set, unlike "repeat for" loops, for each doesn't have to iterate each loop from the beginning. Shoving your data into a variable and running that loop might add microseconds, although if your datasets get much MUCH larger, certainly splitting it into an array is noticeably faster.

A larger problem that I see is that if your data wasn't formatted exactly the same, it could cause issues, but likely you already know that. Looking at the 2nd line vs. all the rest, I'm not sure those are formatted exactly the same.

~Edit - after looking at the 4 rows again, I am pretty sure now that none of them are exactly like the other. I think your results are going to be interesting.

1st item appears to be consistent with a location.
2nd item appears to be consistent with an astrological sign
3rd item appears to be consistent with an index#
4th item appears to be consistent with a birthdate
5th item appears to be consistent with an age
etc

...but not all the lines appear to have all the items, or some are in the wrong places, and on this relatively microscopic version that took me a minute to figure out. If your larger set of data is likely scrambled as this, your going to have lots of fun.
Image

SparkOut
Posts: 2943
Joined: Sun Sep 23, 2007 4:58 pm

Re: Get the last two columns

Post by SparkOut » Sun Feb 23, 2025 4:55 pm

I think several valid observations have been made by everyone above.
1) the source data looks a mess. If the only consistency is that you need the last two items of a line, then you will have to use a repeat loop. If you can manage the data source better, then an array split could work if you can specify the particular keys needed.
2) a repeat for each loop is astronomically faster than a repeat with index loop over a large data set like this
3) with a data source as large as suggested, it's going to face challenges whatever method is used.

So is there a reason why the task has to be tackled like this? Is there a way the job can be broken down into smaller steps? Separating source data into ranges/managing the consistency of the data items in each line?

With the original problem as stated, I don't think there's a better solution available than stam's. Except maybe to add a line "wait 0 milliseconds with messages" inside the repeat loop, which will help the UI not to lock up and appear unresponsive, if it takes a long time to complete.

xoxiwe
Posts: 7
Joined: Fri May 28, 2021 12:45 pm

Re: Get the last two columns

Post by xoxiwe » Sun Feb 23, 2025 6:09 pm

So what I've learnt is, array thing is only possible when all the lines have the equal number of items, and subkeying with negative numbers isn't possible to start column counting from the last what is possible only in repeat loop.

Btw, it just put me in more tension after I was informed that the text file I'm messing around with is just a sample file, original files will be like 2-3 GB in size. So I think taking the whole data into a var to process further wont be a good idea since my pc doesnt have a large memory to handle such a big data at once. Therefore, instead of taking it into memory as a whole, we can open the source file to read it's lines one by one, and then write the desired data to the destination file.

To do that, I think I can start the script like this:

Code: Select all

open file dFile for write -- dFile is the destination text file LC will write the desired output
open file sFile for read -- sFile is the source text file that holds the raw data
set the itemdel to ","
put 1 into tCount
repeat until line tCount of sFile is empty
   put line tCount of sFile into tLine
   write item -2 to -1 of tLine & cr to file dFile
   add 1 to tCount
end repeat
close file dFile
close file sFile
With this script, I guess LC will use less memory to process the source file of 2-3 GB in size, although I assume it will take longer than taking the raw data into a var and then processing the data. But at least the process can be done without trouble. Pls tell me if there's anything wrong in the script above. Btw, I was wondering which is faster, "repeat until" or "repeat for"??

Thanks richmond and Stam both for giving the idea of the repeat loop :) (although I knew repeat loop can do this magic, just was trying do it using arrays)
bogs wrote:
Sun Feb 23, 2025 2:30 pm
His main problem was the - sign in front of the key -
that's what I tried to get the last two columns from the data. my silly mind thought it might work like the delimiter thing :lol:
bogs wrote:
Sun Feb 23, 2025 2:30 pm
after looking at the 4 rows again, I am pretty sure now that none of them are exactly like the other.
If the rows had the same number of columns, I could get the desired columns in the fastest way using arrays! :roll:
SparkOut wrote:
Sun Feb 23, 2025 4:55 pm
Is there a way the job can be broken down into smaller steps?
You mean splitting the source file into several pieces? That thing came to my mind too; but splitting a 3 GB file into 100 mb (say for example) will make a lot of pieces, and I have to manually import all of those in LC which will lead me to frustration I fear :(

bobcole
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 161
Joined: Tue Feb 23, 2010 10:53 pm

Re: Get the last two columns

Post by bobcole » Sun Feb 23, 2025 7:25 pm

As the Original Poster (OP) said the issue is the size of the data file:
I actually trying to avoid repeat loop as my pc might not be able to handle 300-500 mb size data
Therefore, is there a way to read from the file one line at a time? I realize this will be slow in comparison to the alternatives posted above.
Untested:

Code: Select all

on mouseUp
   put "/Users/myTest.txt" into myFilePath
   put empty into myList
   open file myFilePath
   repeat until EOF
      read from file myFilePath for 1 line
      put item -2 of it & comma & item -1 of it after myList
   end repeat
end mouseUp
Worth a try?
Bob

bogs
Posts: 5480
Joined: Sat Feb 25, 2017 10:45 pm

Re: Get the last two columns

Post by bogs » Sun Feb 23, 2025 10:07 pm

Answering to these in reverse order :P
bobcole wrote:
Sun Feb 23, 2025 7:25 pm
As the Original Poster (OP) said the issue is the size of the data file:
I actually trying to avoid repeat loop as my pc might not be able to handle 300-500 mb size data
Therefore, is there a way to read from the file one line at a time? I realize this will be slow in comparison to the alternatives posted above.
There are many many MANY ways to do it, and reading directly from the file is no worse in this case than any other. A simple way to find out what amount of time it would take for each method is to craft a simple timer, and put each of the various routines into a simple script which calls it before the routine starts, and stops it when the routine ends, subtract the start from the end, voila, you have the amount of time (in seconds) that each took. I'll leave that to the more eager of you out there :P :P
xoxiwe wrote:
Sun Feb 23, 2025 6:09 pm
So what I've learnt is, array thing is only possible when all the lines have the equal number of items, and subkeying with negative numbers isn't possible to start column counting from the last what is possible only in repeat loop.
This is not the correct lesson to take away. Arrays will work the same as any other programmatic variable/method will work, the correct lesson to take away from this small diversion is that your data needs to be consistent no matter what method you are using to extract it. The generic term that says it all is "...garbage in, garbage out".
In other words, if you have a line with 3 items, lets say index number, Name, and phone number, then every line that follows needs to have them in the same order, every time. Your example listed earlier as I pointed out has all the elements apparently jumbled around, you can not program in any way I know of to take that into an account. If line 1 looks like:
[indent]1, Joe James, 916-864-5789 [/indent]
... then line 2 can not alter that order, even as little as ...
[indent]James John, 2, 916-864-5789 [/indent]
or the results will look like the mess you are working with. The lines do need the same amount of delimiters in them, but even more importantly is that each item, whether missing or not, is in its correct place in the line. Let's take line 1 thru 4 from your earlier example with my guess as to interpretation of the data:
GuessedItems.png
I hope you see the problem clearer now. If you were trying to get the last 2 items of each line, and assumed those would be the email address and name (my guessing) or whatever they are in this example, you would have results that are not as you expect them to be because lines 2 and 4 do not have the data in the format lines 1 and 3 do, regardless of which method you used in code, because your "data" is all mixed up, even in just 4 lines.
The array on the other hand, would still have 7 keys in it, in order from 1 to 8. You could (in code) simply request keys 7 and 8 (the last 2 keys in the array) and put them into a field just as I showed up there, but the data you would see would NOT ALWAYS be the data you are looking for because your SOURCE DOES not have each item in it's place in the line.
--Edit -- I am of course making the assumption that the example list CAME from your source list, if that is not the case, you certainly could try it with the code provided and see what you get :)
Image

SparkOut
Posts: 2943
Joined: Sun Sep 23, 2007 4:58 pm

Re: Get the last two columns

Post by SparkOut » Mon Feb 24, 2025 12:00 am

bogs wrote:
Sun Feb 23, 2025 10:07 pm
the correct lesson to take away from this small diversion is that your data needs to be consistent no matter what method you are using to extract it.
This ^

There are fudges and compromises, and thought experiments as an exercise in optimising the way to tackle your problem as originally stated. But these are no substitute at all for having valid data to start with.

If it truly is impossible to rely on the data, then you will have to work hard and make choices to use methods that might not be optimal. But that's not the fault of LiveCode, compromised data will compromise your options to handle it.

I am also certain that the size of the source data will be a major problem if you try to continue with the data in this format. Could you "manually" sanitise the data in a one off process and populate a database to be your application's data source, and use SQL queries to access it?

bogs
Posts: 5480
Joined: Sat Feb 25, 2017 10:45 pm

Re: Get the last two columns

Post by bogs » Tue Feb 25, 2025 7:48 pm

Since I was bored, I got a (relatively) large dataset (53+ meg.) I was pretty sure was *not* corrupted and ran it through some of the different ways to get the result you were looking for.

The dataset came from the government and had about 17 columns all told. The code for the array method used:

Code: Select all

on mouseUp
   put empty into field 1; put empty into field 2; put empty into field 3
   
   set the datProp of this stack to url("file:" & specialFolderPath("Desktop") & "/LiveCodeProjects/exampleProjects/largeCsvDatasetHandling/Electric_Vehicle_Population_Data.csv")
   put the milliseconds into tmpStart
   put the datProp of this stack into tData
   set the columndelimiter to comma
   split tData by column
   put tData[6] into field 2
   put tData[7] into field 3
   
   put "    Putting the data into a customProperty, then into an array, " & "and putting 2 of the keys into 2 seperate fields, took " & the milliseconds -tmpStart & "milliseconds" into field 1
end mouseUp
The above code run multiple times took between 400-700 milliseconds to complete.

Moving to the "repeat for each line x" loop which *should* be the fastest repeat method out there, it wasn't even close taking almost 20 minutes with the data in the same variable. The code for the repeat loop:

Code: Select all

on mouseUp
   put empty into field 1; put empty into field 2; put empty into field 3
   --lock screen
   put the seconds into tmpStart
   
   put url("file:/home/b/Desktop/LiveCodeProjects/exampleProjects/largeCsvDatasetHandling/Electric_Vehicle_Population_Data.csv") into tData
   set the itemDelimiter to comma
   put 1 into theLine
   repeat for each line x in tData
      put theLine & " of  " & the number of lines of tData into field 1
      put item 6 of x into line theline of field 2
      put item 7 of x into line theline of field 3
      add 1 to theLine
   end repeat
   
   --unlock screen
   put "    Putting item 6 & 7 of each line into fields 2 & 3 took " & the seconds -tmpStart & "seconds" into field 1
end mouseUp
I only ran this one time, it was more than obvious it was not going to even be close to the array in speed, which is logical since the array just dumps the entire key into a field at a time.

Having said that, you mentioned your machine is 'limited' in resources. I ran the above on a virtual machine (Linux, 4GB ram). If your machine is more limited than that, I would need specs. to simulate.

Even if your machine is more limited than that, though, you could find out how many lines are in the file (message box / put the number of lines of xxx).

The file I used that was 53 meg was about 23k lines, so I would divide the answer you get from the message box till it comes out around 23k lines, then again using the message box, break each multi-GB file down to files around that size (or whatever the max you feel will work).

Good luck with your project, but I would sure take the array approach over the repeat for one ANY day.
Image

SparkOut
Posts: 2943
Joined: Sun Sep 23, 2007 4:58 pm

Re: Get the last two columns

Post by SparkOut » Tue Feb 25, 2025 8:20 pm

Hey bogs, why would you update the field like that in the loop?
A) field uodates are a hefty overhead
B) you have two statements that put data into an indexed line, which means that twice within each loop the engine has to count through to the index to update each.
Surely a more comparative test would be to put the extracted data "after" or even "before" a variable, then drop that data into the field in one go.
Notwithstanding, the array split is a brilliant method, but relies on data being consistently sane.
Anyone producing insane data for others to manipulate surely is insane, or a sadist.

dunbarx
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 10305
Joined: Wed May 06, 2009 2:28 pm

Re: Get the last two columns

Post by dunbarx » Tue Feb 25, 2025 10:59 pm

Sparkout
Surely a more comparative test would be to put the extracted data "after" or even "before" a variable,
I had always heard that putting something "before" a variable was much slower than putting something "after"

But I just ran a test with a variable with ten million lines, and the two actions take the same time.

It occurs to me that this post adds nothing to the discussion.

Craig

richmond62
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 10078
Joined: Fri Feb 19, 2010 10:17 am

Re: Get the last two columns

Post by richmond62 » Wed Feb 26, 2025 1:25 pm

Would I be right in thinking with datasets containing lines containing differing numbers of items one could do this:

Code: Select all

put tData[-2] into field 2
   put tData[-1] into field 3
to capture the last 2 items

?

Post Reply