CSV files

LiveCode is the premier environment for creating multi-platform solutions for all major operating systems - Windows, Mac OS X, Linux, the Web, Server environments and Mobile platforms. Brand new to LiveCode? Welcome!

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller, robinmiller

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 10050
Joined: Sat Apr 08, 2006 7:05 am
Contact:

Re: CSV files

Post by FourthWorld » Mon Feb 23, 2015 7:46 pm

AxWald wrote:I think we're talking about semantics here. For sure, CSV isn't a standard. An RFC for it exists, but it doesn't seem to have ever reached approval.
Indeed, as I noted in the article I linked to above, in practice there are so many varieties of escaping conventions, and even in the specifics of delimiting itself, that we find ourselves in a needlessly complicated world of horribly formatted data.
Besides this, the term "Comma Separated Values" is grossly misleading, because this would make it unusable for large parts of the world, where the comma is used as decimal separator (Am I right that the comma is used in anglo-american notation as separator for thousands?). No fractional numbers possible, nor nicely formatted ones? This doesn't make sense at all.
Exactly. And this is compounded by the common frequency of commas in textual data, making commas an ever worse choice beyond hope of redemption as a generalized format.
So, even if Wikipedia claims this name to be correct, it isn't. Nobody uses commas as field delimiters! The current common use is to have semicolons as field delimiters, check in your office software!
SQL dumps use CSV, and many published data sets use it as well, as I noted in my article. I haven't used Microsoft Excel in years, but double-checking LibreOffice Calc this morning it uses comma as the default when I choose "Save As" with the "CSV" option.
As mentioned, IMHO we're discussing semantics here. I may have another view of CSV as you, and you may not agree to my view of it as "Character Separated Values".
We can use any terms we like to describe anything, but being explicit is sometimes helpful.

If I tell people I work for the "CIA" and provide no further clarification, I can hardly blame them if they belive I work for the Central Intelligence Agency when I might work for Catalina Island Aviation.

The Help file on CSV for LibreOffice explicitly defines "CSV" as "Comma-Separated Values":
https://help.libreoffice.org/Calc/Impor ... _CSV_Files

Microsoft Office's Help also defines "CSV" as "Comma Separated Values", using ".txt" for tab-separated values:
https://support.office.com/en-us/articl ... e8a008ba5c

When discussing delimited data in general, I find it's helpful to just use "delimited data", distinguishing from the decades-long history of the "C" in CSV" meaning "Comma".

To further distinguish tab-delimited data (arguably the sanest choice) specifically, in writing I describe it as "tab-delimited" and in file extensions I use "*.tab".
I'm (to be precise: the servers my LC standalones run on) sending literally thousands of CSV packets each hour, like this:

Code: Select all

ORD received:  212445, 84, 53a2a591deb6d
0154120,2.3,6
9141000,3.14,12
7008510,4.15,6
0856030,8.93,6
0856010,9.64,6
(This is a real-world example of the acknowledgment of an order, and it's actually REAL CSV! :shock:)

So, please abandon your evil desire to kill poor old CSV!
It hasn't done no harm to you, and it would break my heart would I be forced to replace this nice, cute, little piece of datagram with an XML equivalent, where alone the parentheses would be more than the actual data ...
Relax. It's all good. Thankfully for all of us, I have no control over what anyone other than myself does with a keyboard. :)

As I describe in the article I've referred to but is apparently not interesting enough to have been read, the problem with commas is no more and no less than what you've noted yourself: commas often appear in data.

So using commas as a generalized solution for data exports is inherently prone to inefficiency and complexity.

This widely-used convention has indeed caused all of us harm, in millions of lost hours among both developers and end-users alike. At best it's wasted clock cycles, and at worst the complexity has led to bugs resulting in invalid imports.

If my advocacy for choosing any other delimiter is "evil", please note your comments above where you do so yourself.

I don't think either of us is being evil. I think we're both just trying to solve real-world problems in an efficient way.

Your data set above is a very good example, as are the many functions in HyperCard and LiveCode that return values delimited by commas.

The important distinction here is that commas are used as delimiters in LiveCode values only when it can be known that commas won't also appear in the data being delimited. For example, getting the rect of an object will return a comma-delimited value.

But when LiveCode needs to return data in which commas may appear, such as the cardNames, it uses return as the delimiter rather than comma.

When writing data used by systems you control, use whatever works.

But when a client asks you to generate CSV exports, I recommend - as you do - using tab-delimited instead.

All of this is discussed in my article. It's a shame I don't have David Sedaris' gift for entertaining writing style, because if the article were read as frequently as its argued against there would likely be less argument, since in the end we all seem to agree with what's proposed there.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

AxWald
Posts: 578
Joined: Thu Mar 06, 2014 2:57 pm

Re: CSV files

Post by AxWald » Tue Feb 24, 2015 6:11 pm

Hi,

Be assured, Richard, that I read your article - more than once actually, and once more before writing this! :wink:

And I have to row back about my claim concerning LibreOffice, it actually uses "Comma" as the default delimiter. I have no Excel here, but my MS Access 2003 shows me this, after starting an CSV-Export (after a first screen asking if you don't want to use fixed length data [ARRGGHH!], maybe ...):
access_csv.png
MS Access 2K3 file types
access_csv.png (10.41 KiB) Viewed 7935 times
Access at least seems to do it right - after all it seems to be the only really awesome program MS ever bought: :wink:
access_typ.png
MS Access 2K3 CSV-Saving dialogue
access_typ.png (5.1 KiB) Viewed 7935 times
And it offers even .tab as file ending!

I fully agree with you that "Comma" is the worst possible field separator (besides for special cases like the one I have shown, and where it is enforced by "the other side") - but let me mention that "Tab" isn't this much better! Having started my "career" with Mac & HyperCard, "Tab-Return" was the native format for me, for a long time.
Until other ppl started to use my software, and pasted text from their word processors, including those dreaded tabs - you surely can imagine the time used for debugging those rare crashes, and the time used to further avoid them ...

Thus I may join your condemnation of "Comma Separated Values". While we may agree that "Character Separated Values" are a vital part of our IT life, I beg to consider that "Tab" isn't our rescue, either - "Semicolon" for instance has a much lower danger potential.


But we may stop this discussion here (At least I will! *1), and start to find better delimiters for those of us that are forced to use character separated values:
  • They should be chars that can be entered into an options dialog by keyboard.
  • They should not be used in every days text input.
  • (They should be printing chars that you can see when checking a file in a text editor.) if possible ...
Do we have such beasts in the common ASCII/ ANSI char space? We cannot rely on Unicode, since there's still tons of software out there that doesn't know it.

Suggestions?

Have a good time!

*1: After having read all again, I cannot help but to realize that we agree in nearly all of the core points. But we disagree just in the terms. Guess we can live with this, right?
All code published by me here was created with Community Editions of LC (thus is GPLv3).
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 10050
Joined: Sat Apr 08, 2006 7:05 am
Contact:

Re: CSV files

Post by FourthWorld » Tue Feb 24, 2015 7:44 pm

AxWald wrote:I fully agree with you that "Comma" is the worst possible field separator (besides for special cases like the one I have shown, and where it is enforced by "the other side") - but let me mention that "Tab" isn't this much better!
One strong advantage tab has over comma is that it occurs in text far less frequently.

This advantage is often amplified in the very types of apps that need exporters most: databases and spreadsheets and such, where tab is disallowed in most text input, used instead for navigating between fields.

In this regard it makes almost a natural choice for delimiter, since the user interaction uses it as a form of separator between fields, reinforced in the export format as well.

Until other ppl started to use my software, and pasted text from their word processors, including those dreaded tabs - you surely can imagine the time used for debugging those rare crashes, and the time used to further avoid them ...
That's one of the nice things about LiveCode: it handles them as gracefully as any spreadsheet or word processor table - just paste into a field that has its vGrid property set and you get beautiful column alignment automatically.

Thus I may join your condemnation of "Comma Separated Values". While we may agree that "Character Separated Values" are a vital part of our IT life, I beg to consider that "Tab" isn't our rescue, either - "Semicolon" for instance has a much lower danger potential.
Depends on which dangers we're avoiding. In terms of frequency of occurence in data, I use semi-colons often in writing but almost never find it even possible to use a tab, and when it does appear in text it's used as a separator for formatting columns, as in spreadsheets, word processor tables, or LiveCode fields.

But we may stop this discussion here (At least I will! *1), and start to find better delimiters for those of us that are forced to use character separated values:
  • They should be chars that can be entered into an options dialog by keyboard.
  • They should not be used in every days text input.
  • (They should be printing chars that you can see when checking a file in a text editor.) if possible ...
Do we have such beasts in the common ASCII/ ANSI char space? We cannot rely on Unicode, since there's still tons of software out there that doesn't know it.

Suggestions?
Choosing delimiters is a useful thing to think about for all the reasons we've covered:
- Delimited text is more efficient than alternatives like XML for tabular data
- It's simpler, and thus less error-prone
- Commas suck because they're extremely likely to appear in the text being delimited

The whole point with delimited data is to support interoperability, so any convention that's widely used is a good place to start.

Delimiting records with CR is nearly universal so that's a natural fit as a record delimiter.

Tab as a field delimiter is supported in nearly every app that also supports CSV, so it has the same scope of interoperability coverage but without the parsing overhead.

Those two are fairly safe choices, and will cover the vast majority of needs just as they are.

This still leaves us with secondary delimiters, or escapes, for those edge cases where a field's data may legitimately contain CRs or tabs.

There's far less consistency among implementers with escapes (even with CSV, as we see from the dizzying variety of escape methods), and as long as the escaping is char-for-char rather than completely restructuring the content such as CSV does by adding quotes around it (which themselves then need escaping), it barely matters much which pair of characters one chooses.

That said, keeping in mind that the goal of any delimited format is interoperability, it never hurts to choose a pair of characters already in use by others.

I tend to use 0x04 for escaping in-data tabs and 0x11 for escaping in-data CRs because if nothing else there are a few million FileMaker Pro users doing the same thing with that app's wonderfully concise Merge format.

These are also easy to remember because they somewhat follow their origins:

- 0x04 is EOT, a minor boundary for delimiting text before reaching EOF

- 0x11 is Vertical Tab, used in some programs as a minor form of line-ending for wrapping within cells, and even LiveCode now renders it accordingly in fields since v5.5 when dontWrap is false.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

rinzwind
Posts: 135
Joined: Tue May 01, 2012 10:44 am

Re: CSV files

Post by rinzwind » Sat Jul 23, 2016 1:02 pm

I know... old thread...

Too bad CSV files are so wildly different. I don't know how we got here... because even in the beginning of computing the ASCII group specified and used specific unprintable chars for delimiting tables, rows and columns which would be perfectly suitable for text files containing tabular data, aka CSV. Somewhere along came the comma and newline... probably because it renders visually in a text editor. But hey... those same text editors could be changed to handle the dedicated delimiters in a visual way. But it did not happen. Either way... these are imho the most useful delimiters because they never cause problems in any kind of (international) text and are specifically chosen for this purpose:

global gs, rs, us
put numToChar(29) into gs --group/table
put numToChar(30) into rs --record/row
put numToChar(31) into us --unit/field
(anyone got a solution to make these into constants?)

(ps Would be nice if these constants where added to the language like cr and tab already are. Also... why are the dgText delimiters in the datagrid hard coded to tab and cr? Better to have these configurable and easy to implement too!)

29 – GS – Group separator
Data storage was one of the main reasons for some control codes to get in the ASCII definition. Databases are most of the time setup with tables, containing records. All records in one table have the same type, but records of different tables can be different. The group separator GS is defined to separate tables in a serial data storage system. Note that the word table wasn't used at that moment and the ASCII people called it a group.
30 – RS – Record separator
Within a group (or table) the records are separated with RS or record separator.
31 – US – Unit separator
The smallest data items to be stored in a database are called units in the ASCII definition. We would call them field now. The unit separator separates these fields in a serial data storage environment. Most current database implementations require that fields of most types have a fixed length. Enough space in the record is allocated to store the largest possible member of each field, even if this is not necessary in most cases. This costs a large amount of space in many situations. The US control code allows all fields to have a variable length. If data storage space is limited—as in the sixties—this is a good way to preserve valuable space. On the other hand is serial storage far less efficient than the table driven RAM and disk implementations of modern times. I can't imagine a situation where modern SQL databases are run with the data stored on paper tape or magnetic reels...

http://www.lammertbies.nl/comm/info/asc ... cters.html

Mikey
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 755
Joined: Fri Jun 27, 2008 9:00 pm

Re: CSV files

Post by Mikey » Sat Jul 23, 2016 3:47 pm

We have an open-sourced csv parser, all in LC, at https://github.com/macMikey/csvToText

AxWald
Posts: 578
Joined: Thu Mar 06, 2014 2:57 pm

Re: CSV files

Post by AxWald » Mon Jul 25, 2016 1:09 pm

Hi,
rinzwind wrote:I know... old thread...
Doesn't matter - useful information is always welcome! ;-)
rinzwind wrote:global gs, rs, us
put numToChar(29) into gs --group/table
put numToChar(30) into rs --record/row
put numToChar(31) into us --unit/field
(anyone got a solution to make these into constants?)

Code: Select all

constant RecSep=numToChar(30), FldSep=numToChar(31), CstSep=numToChar(29)
on whatever
...
Maybe like this?
Since we rarely need a table separator I'd use the "GS" as a wildcard - sometimes there may be a use for it ;-)

On the other hand, is there really a need to have "acknowledged separators" anymore?

I remember that, back in the good old days of HC, I used such when I wrote my book-keeping software, that stored it's data in plain text files.
But: Today I just set up a SQLite db, rhyme some SQL & and let the driver care for the details ...

So, for me at least, there's no real need to use my own separators at all - and I work with table data day in day out. But these nearly always come as a result of an SQL query (and as such with the delimiters I want), or from the output of a third party program that I cannot control anyways.
In both cases it's rather easy to work with the data, with limited use of "the itemDel". And if the incoming data received don't suit my needs at all, they are made suiting by brute force ;-)
Thx a lot for this link! This site is full of useful information! :)

Have fun!
All code published by me here was created with Community Editions of LC (thus is GPLv3).
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!

richmond62
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 10099
Joined: Fri Feb 19, 2010 10:17 am

Re: CSV files

Post by richmond62 » Mon Jul 25, 2016 6:19 pm

A bit late to the party, but . . .

My experience importing text files with both single and double quotes is distinctly negative,
so I tend to strip them out before doing anything else.

rinzwind
Posts: 135
Joined: Tue May 01, 2012 10:44 am

Re: CSV files

Post by rinzwind » Tue Jul 26, 2016 10:29 am

constant RecSep=numToChar(30), FldSep=numToChar(31), CstSep=numToChar(29)
Neh, thats not working unfortunately. The constant keyword only works with numbers and strings, not one time calculations. It would be nice if it did. Constant in the definition of 'assign one time and cannot be changed afterwards'. How its assigned should not matter. And going on offtopic ;).., would also be nice if there was something like stack local and card local variables (available anywhere in objects on the specific stack or card). Now its all or nothing. Global, local. I know there are properties, but they are soooo verbose, too verbose for their own good and hurts readability.

AxWald
Posts: 578
Joined: Thu Mar 06, 2014 2:57 pm

Re: CSV files

Post by AxWald » Wed Jul 27, 2016 1:05 pm

Hi,
rinzwind wrote:
constant RecSep=numToChar(30), FldSep=numToChar(31), CstSep=numToChar(29)
Neh, thats not working unfortunately. The constant keyword only works with numbers and strings, not one time calculations. It would be nice if it did.
Bummer! I knew there was a reason I rarely use constants, but was too lazy to remember why - Sorry!
rinzwind wrote:I know there are properties, but they are soooo verbose, too verbose for their own good and hurts readability.
Dunno, after looking at them with distrust for a while I'm now using them regularly.
In our case here:

Code: Select all

on openstack
   set the RecSep of this stack to numToChar(30)
   set the FldSep of this stack to numToChar(31)
   set the CstSep of this stack to numToChar(29)
end openstack
and then somewhere in the code:

Code: Select all

  put "this " & the FldSep of this stack & "is a " & the FldSep of this stack & \
         "record" & the RecSep of this stack after MyTableToBe
This comes quite close to a constant implemented in a way you'd expect ;-)

It's verbose, right. Yet I still prefer this to the parenthesial orgies of other languages ...
And it has to be verbose - after all, you can add custom props to any object, on the fly. This comes really close to the different local variables you mentioned.
Btw., AutoIt can be helpful replacing "#ts " with " of this stack " (and more) on the fly, btw. ;-)

Have fun!
All code published by me here was created with Community Editions of LC (thus is GPLv3).
If you use it in closed source projects, or for the Apple AppStore, or with XCode
you'll violate some license terms - read your relevant EULAs & Licenses!

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 10050
Joined: Sat Apr 08, 2006 7:05 am
Contact:

Re: CSV files

Post by FourthWorld » Wed Jul 27, 2016 5:00 pm

Another option comes at the cost of two characters, but works inline and requires no per-script declarations:

Code: Select all

set the itemDel to us()
set the lineDel to rs()
...where a library or backscript could include:

Code: Select all

function rs
   return numtochar(30)
end rs

function us
   return numtochar(31)
end us
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

Post Reply