August 1st, 2024

Why CSV is still king

CSV remains a dominant file format in data processing due to its simplicity and widespread adoption, despite challenges like lack of standardization and text encoding issues. Its relevance continues.

Read original article

FrustrationConfusionAppreciation

CSV remains a dominant file format in data processing due to its simplicity, resilience, and widespread adoption. Originating in the early days of computing, CSV emerged as a practical solution for storing tabular data, gaining traction in various business applications by the 1970s. Its integration into spreadsheet software like VisiCalc, Lotus 1-2-3, and Microsoft Excel further solidified its role as a universal data exchange format. Despite the rise of more advanced formats like Parquet, which offer better efficiency for data analysis, CSV's accessibility and ease of use keep it relevant.

However, CSV is not without its challenges, including the lack of an official standard, issues with text encoding, and difficulties in handling commas within data fields. These limitations can complicate the use of CSV, especially with complex datasets. Nevertheless, its human-readability and compatibility with a wide range of tools ensure its continued popularity.

Looking forward, CSV may see improvements through standardization efforts and new tools to address its shortcomings. Its enduring presence in published datasets and data processing tools suggests that CSV will remain a staple in the data landscape, demonstrating that sometimes the simplest solutions are the most effective.

At 50 Years Old, Is SQL Becoming a Niche Skill?

SQL, a foundational technology, faces scrutiny in today's IT world. Evolving roles like data scientists challenge its centrality. Debates persist on SQL's relevance against newer technologies like JSON queries, impacting its future role.

What spreadsheets need? LLMs, says Microsoft

Microsoft developed SpreadsheetLLM to enhance large language models' efficiency in analyzing spreadsheet data. The tool addresses challenges like homogeneous rows/columns by serializing and compressing data, reducing token usage. Despite limitations, it aims to reduce computational costs and improve user interactions, potentially transforming data analysis tasks.

The Elegance of the ASCII Table

The article explores the elegance and historical importance of the ASCII table in computing. It discusses design choices, historical context, compatibility with Unicode, practical applications, and enduring relevance in technology.

Basic–The Most Consequential Programming Language in the History of Computing

BASIC, created in 1964, made programming accessible to students and hobbyists, fostering a culture of experimentation. Its legacy persists in education and among enthusiasts despite declining professional use.

Basic – The Most Consequential Programming Language in the History of Computing

BASIC, created in 1964, made programming accessible to students and hobbyists, fostering interest in coding. Its legacy influences modern languages, despite its decline in popularity among professional developers.

AI: What people are saying

The comments reflect a mix of opinions on the CSV format and its alternatives, highlighting several key themes.

Many users express frustration with CSV's limitations, particularly regarding the use of commas as delimiters, which complicates data handling.
Several commenters advocate for using alternative formats like TSV (Tab-Separated Values) to avoid issues with escaping characters.
There is a call for better standardization and tooling to address the common problems associated with CSV files.
Some users share personal experiences and tools they've developed to manage CSV data more effectively.
Overall, while CSV is widely used, there is a consensus that improvements and alternatives are needed to enhance data integrity and usability.

31 comments

By @thbb123 - 8 months

Sad that the ASCII specification includes 2 codes: 30 and 31, respectively field separator and record separator, precisely to answer cleanly the need that CSV fullfils addresses.

During the 90's I was anal for using them, pissing the hell out of my teammates and users for forcing them to use these 'standard compliant' files.

Had to give up.

By @tanin - 8 months

What surprised me the most about CSVs is that:

- To escape the delimiter, we should enclose the value with double quotes. Ok, makes sense.

- To escape double quotes within the enclosing double quotes, we need to use 2 double quotes.

Many tools are getting it wrong. Meanwhile some tools like pgadmin, justifiably, allows you to configure the escaping character to be double quote or single quote because CSV standard is often not respected.

Anyway, if you are looking for a desktop app for querying CSVs using SQL, I'd love to recommend my app: https://superintendent.app (offline app) -- it's more convenient than using command-line and much better for managing a lot of CSVs and queries.

By @Nihilartikel - 8 months

I've found the Unicode cat emoji to be an effective delimiter to avoid escaping more common chars in my cat-separated-value artifacts.

Of course the cat emoji is escaped by the puppy emoji if it occurs in a value. The puppy emoji escapes itself when needed.

By @zarzavat - 8 months

Just use TSV. Commas are a terrible delimiter because many human strings have commas in them. This means that CSV needs quoting of fields and nobody can agree on how exactly that should work.

TSV doesn’t have this problem. It can represent any string that doesn’t have either a tab or a newline, which is many more than CSV can.

By @calibas - 8 months

Why don't we use 0x1F (␟) instead of "," or TAB to separate units and 0x1E (␞) to separate records?

It seems like half the problems with CSV were solved back in the 70s with ASCII codes.

By @impure - 8 months

I switched to TSV files for my app. None of my values contain tabs so I don't have to escape anything.

By @userbinator - 8 months

I wish binary length-prefixed formats would've become more common. Parsing text, and especially escaping, seems to be a continual source of bugs and confusion. Then again, those who don't implement escaping correctly may also overlap with those who can't be bothered learning how to use a hex editor.

By @Kon-Peki - 8 months

CSV comes from a world in which the producer and consumer know each other; if there are problems they talk to each other and work it out.

There is still plenty of this kind of data exchange happening, and CSV is perfectly fine for it.

If I'm consuming data produced by some giant tech company or mega bank or whatever, there is no chance I'll be able to get them to fix some issue I have processing it. From these kind of folks, I'd like something other than CSV.

By @theanonymousone - 8 months

I fully agree that CSV is king and am quite happy about it. But the comma character was probably one of the worst choices they could make for the "standard", IMHO of course.

Tab makes far more sense here, because you are very likely able to just convert non-delimiter tabs to spaces without losing semantics.

Even considering how editors tend to mess with the tab character, there are still better choices based on frequency in typical text: |, ~, or even ;.

All IMHO, again.

By @endgame - 8 months

I wasn't around at the time, but surely ASCII was (even if not ubiquitous)? Is there any particular reason that the FS/GS/RS/US (file/group/record/unit separator) characters didn't catch on in this role?

By @breck - 8 months

I love CSVs.

I made, ScrollSets a language that compiles to CSVs! (https://scroll.pub/blog/scrollsets.html)

Here's a simple tool to turn your CSV into ScrollSet (https://scroll.pub/blog/csvToScrollSet.html)

This is what powers the CSV download on PLDB.io and how so many people collaborate on building a single CSV (https://pldb.io/csv.html)

By @jeff-hykin - 8 months

> Efforts to standardize them

I actually just finished a library to add proper typed parsing that works with existing CSV files. Its designed to be as compatible as possible with existing spreadsheets, while allowing for perfect escaping and infinite nesting of complex data structures and strings. I think its an ideal compromise, as most CSV files won't change at all.

https://github.com/jeff-hykin/typed_csv

By @mannyv - 8 months

CSV is king because most ETL department programmers suck. Half the time they can't generate a CSV correctly. Anything more complicated would cause their tiny brains to explode.

I'm not bitter, I just hate working with ETL 'teams' that struggle to output the data in a specified format - even when you specify it in the way they want you to.

By @fragmede - 8 months

> Why CSV Will Remain King

it'll only remain king as long as we let it.

move to using Sqlite db files as your interchange format

By @__mharrison__ - 8 months

CSV is the VHS of data formats. Or to reference our discussion from yesterday, the markdown of data formats. It gets the job done.

I help clients deal with them frequently. For many cases they are sufficient, for other cases moving to something like parquet makes a lot of sense.

By @EvanAnderson - 8 months

A lot of data that I see in CSV "format" would work fine as tab-delimited and wouldn't need any escaping (because most of the data I see doesn't allow literal tabs anyway). That would be a simple improvement over CSV.

By @valiant55 - 8 months

I'm surprised that the article and the comments failed to mentioned pipe delimited files. I work with almost two dozen different vendors (in healthcare) and 90% use pipes. Doing data exchange with a variety of delimiters is so common that I just built out a bespoke system for taking in a set of common configurations and parsing the information. Other settings include line endings, encoding, escape characters, whether the header is included etc.

By @maerF0x0 - 8 months

I prefer ndjson for systems I build. (with only json objects on the top level) It's much safer for a lot of edges. If there's significant repetition in the keys, they end up zipping well.

By @deafpolygon - 8 months

CSV is still king because of one thing: inertia

It's just much easier to keep using it, since you're already doing it.

In the meantime, how about XML? /awaits the pack of raving mad HNers

By @nuc1e0n - 8 months

As the article says, it will be interesting to see if NDJSON becomes more popular. Although it's a bit more difficult to parse and has makes for larger files than CSV it is more unambiguous.

By @penguin_booze - 8 months

It's a bit annoying that jq quotes strings in the CSV output:

  echo foo | jq -rR 'split("") | @csv'

By @Havoc - 8 months

I’ve been using parquet more lately. Different tradeoffs. Not having to worry about escaping chars and delimiters is nice though

By @pdyc - 8 months

indeed, i created my own tool to preview and adjust csv files before viewing https://csvonline.newbeelearn.com/csvdemo . Its not ready yet would probably not work for large files but works well enough for csv's with appended data that screws up formatting.

By @bandie91 - 8 months

the site says "something went wrong" just 1 sec AFTER it successfully displayed the content. something is so wrong that had withdraw the content from the user... use js only to enhance UX!

By @01HNNWZ0MV43FF - 8 months

Because json5L hasn't caught on yet and everything else has obvious flaws

By @up2isomorphism - 8 months

Not sure what is a “king “ in this case. But fav is one of example that is intuitive and straight horrible at the same time.

By @trillic - 8 months

I like Pipe-separated values

By @corytheboyd - 8 months

I don’t think CSV became king because “,” is a great delimiter (obviously it is not), it became king because it is an easy and logical separator _to most people_. Yeah it’s infuriatingly dumb from a technical standpoint. All the points here that tabs or ascii separators are superior are of course correct. I honestly respect it for how ubiquitous it became WITHOUT having a standard. Still going to curse when I have to deal with a broken one though.

By @dietr1ch - 8 months

I think that we just need someone to get fed up and simply tackle the list of well known problems of CVS.

What we need is,

  - A standard (yeah, link xkcd 927, it's mentioned enough that I can recall it's ID) to be announced **after** the rest of things are ready.

  - Libraries to work with it in major languages. One in Rust + wrappers in common languages might get good traction these days. Having support for dataframe libraries right away might be necessary too.

  - Good tooling. I'm guessing one of the reasons CSV took off is that regular unix tools are able to deal with CVSs mostly fine (there's edge cases with field delimiters/commas, but it's not that bad).

The new format would ideally have types, the files would be sharded and have metadata to quickly scan them, and the tooling should be able to make simple joins, ideally automatically based on the metadata since most of the times there's a single reasonable way to join tables.

This seems too much work to get right since the very beginning, so maybe building on top of Apache Arrow might help reduce the solution space.

By @hilbert42 - 8 months

Exchanging information between different data formats is one of the biggest problems I've experienced in computing and IT and it's been thus from the earliest days.

Having so many formats is confusing, inefficient and leads to data loss. This article is right, CSV is king simply because it's essentially the lowest common denominator and I, like most of us, use it for that reason—at least that's so for data that can be stored in database type formats.

But take other data such as images, sound and AVI, and even text. There are dozens of sound, image and other formats. It's all a first-class mess.

For example, we fall back to the antiquated horrible JPG format because we can't agree on better ones such as say jpeg 2000, there being always excuses why we can't such speed, data size, inefficient algorithms etc.

Take word processing for instance, why is it so hard to convert Microsoft's confounded nasty DOC format to say the open document ODT format without errors. It's almost impossible to get the layout in one format converted accurately into another. Similarly, information is lost converting from lossless TIF to say JPG, or from WAV to MP3, etc. What's worse is that so few seem to care about such things.

Every time a conversion is done between lossless formats and lossy ones entropy increases. That's not to say that shouldn't happen it's just that in isolation one has little or no idea about the quality of the original material. Even with ever increasing speeds, more and more storage space so many still have an obsession—in fact a fetish—of compressing data into smaller and smaller sizes using lossy formats with little regard for what's actually lost.

It's not only in sound and image formats where data integrity suffers over convenience, take the case of converting data fields from one format to another. How often has one experienced the situation where a field is truncated during conversion—where say 128 characters suddenly becomes 64 or so after conversion and there's no indication from the converter that data has actually been truncated? Many times I'd suggest.

Another instance, is where fields in the original data don't exist in the converted format. For example, data is often lost from one's phone contacts when converted from an old phone to a new one because the new phone doesn't accommodate all the fields of the old one.

Programmers really have a damn hide for not only allowing this to occur but for not even warning the poor hapless user that some of his/her data has been lost.

That programmers have so little reagard and consideration for data integrity I reckon is a terrible situation and a blight on the whole IT industry.

Why doesn't computer science take these issues more seriously?

Why CSV is still king

Related

At 50 Years Old, Is SQL Becoming a Niche Skill?

What spreadsheets need? LLMs, says Microsoft

The Elegance of the ASCII Table

Basic–The Most Consequential Programming Language in the History of Computing

Basic – The Most Consequential Programming Language in the History of Computing

Related

At 50 Years Old, Is SQL Becoming a Niche Skill?

What spreadsheets need? LLMs, says Microsoft

The Elegance of the ASCII Table

Basic–The Most Consequential Programming Language in the History of Computing

Basic – The Most Consequential Programming Language in the History of Computing