Why CSV is still king
CSV remains a dominant file format in data processing due to its simplicity and widespread adoption, despite challenges like lack of standardization and text encoding issues. Its relevance continues.
Read original articleCSV remains a dominant file format in data processing due to its simplicity, resilience, and widespread adoption. Originating in the early days of computing, CSV emerged as a practical solution for storing tabular data, gaining traction in various business applications by the 1970s. Its integration into spreadsheet software like VisiCalc, Lotus 1-2-3, and Microsoft Excel further solidified its role as a universal data exchange format. Despite the rise of more advanced formats like Parquet, which offer better efficiency for data analysis, CSV's accessibility and ease of use keep it relevant.
However, CSV is not without its challenges, including the lack of an official standard, issues with text encoding, and difficulties in handling commas within data fields. These limitations can complicate the use of CSV, especially with complex datasets. Nevertheless, its human-readability and compatibility with a wide range of tools ensure its continued popularity.
Looking forward, CSV may see improvements through standardization efforts and new tools to address its shortcomings. Its enduring presence in published datasets and data processing tools suggests that CSV will remain a staple in the data landscape, demonstrating that sometimes the simplest solutions are the most effective.
Related
At 50 Years Old, Is SQL Becoming a Niche Skill?
SQL, a foundational technology, faces scrutiny in today's IT world. Evolving roles like data scientists challenge its centrality. Debates persist on SQL's relevance against newer technologies like JSON queries, impacting its future role.
What spreadsheets need? LLMs, says Microsoft
Microsoft developed SpreadsheetLLM to enhance large language models' efficiency in analyzing spreadsheet data. The tool addresses challenges like homogeneous rows/columns by serializing and compressing data, reducing token usage. Despite limitations, it aims to reduce computational costs and improve user interactions, potentially transforming data analysis tasks.
The Elegance of the ASCII Table
The article explores the elegance and historical importance of the ASCII table in computing. It discusses design choices, historical context, compatibility with Unicode, practical applications, and enduring relevance in technology.
Basic–The Most Consequential Programming Language in the History of Computing
BASIC, created in 1964, made programming accessible to students and hobbyists, fostering a culture of experimentation. Its legacy persists in education and among enthusiasts despite declining professional use.
Basic – The Most Consequential Programming Language in the History of Computing
BASIC, created in 1964, made programming accessible to students and hobbyists, fostering interest in coding. Its legacy influences modern languages, despite its decline in popularity among professional developers.
- Many users express frustration with CSV's limitations, particularly regarding the use of commas as delimiters, which complicates data handling.
- Several commenters advocate for using alternative formats like TSV (Tab-Separated Values) to avoid issues with escaping characters.
- There is a call for better standardization and tooling to address the common problems associated with CSV files.
- Some users share personal experiences and tools they've developed to manage CSV data more effectively.
- Overall, while CSV is widely used, there is a consensus that improvements and alternatives are needed to enhance data integrity and usability.
During the 90's I was anal for using them, pissing the hell out of my teammates and users for forcing them to use these 'standard compliant' files.
Had to give up.
- To escape the delimiter, we should enclose the value with double quotes. Ok, makes sense.
- To escape double quotes within the enclosing double quotes, we need to use 2 double quotes.
Many tools are getting it wrong. Meanwhile some tools like pgadmin, justifiably, allows you to configure the escaping character to be double quote or single quote because CSV standard is often not respected.
Anyway, if you are looking for a desktop app for querying CSVs using SQL, I'd love to recommend my app: https://superintendent.app (offline app) -- it's more convenient than using command-line and much better for managing a lot of CSVs and queries.
Of course the cat emoji is escaped by the puppy emoji if it occurs in a value. The puppy emoji escapes itself when needed.
TSV doesn’t have this problem. It can represent any string that doesn’t have either a tab or a newline, which is many more than CSV can.
It seems like half the problems with CSV were solved back in the 70s with ASCII codes.
There is still plenty of this kind of data exchange happening, and CSV is perfectly fine for it.
If I'm consuming data produced by some giant tech company or mega bank or whatever, there is no chance I'll be able to get them to fix some issue I have processing it. From these kind of folks, I'd like something other than CSV.
Tab makes far more sense here, because you are very likely able to just convert non-delimiter tabs to spaces without losing semantics.
Even considering how editors tend to mess with the tab character, there are still better choices based on frequency in typical text: |, ~, or even ;.
All IMHO, again.
I made, ScrollSets a language that compiles to CSVs! (https://scroll.pub/blog/scrollsets.html)
Here's a simple tool to turn your CSV into ScrollSet (https://scroll.pub/blog/csvToScrollSet.html)
This is what powers the CSV download on PLDB.io and how so many people collaborate on building a single CSV (https://pldb.io/csv.html)
I actually just finished a library to add proper typed parsing that works with existing CSV files. Its designed to be as compatible as possible with existing spreadsheets, while allowing for perfect escaping and infinite nesting of complex data structures and strings. I think its an ideal compromise, as most CSV files won't change at all.
I'm not bitter, I just hate working with ETL 'teams' that struggle to output the data in a specified format - even when you specify it in the way they want you to.
it'll only remain king as long as we let it.
move to using Sqlite db files as your interchange format
I help clients deal with them frequently. For many cases they are sufficient, for other cases moving to something like parquet makes a lot of sense.
It's just much easier to keep using it, since you're already doing it.
In the meantime, how about XML? /awaits the pack of raving mad HNers
echo foo | jq -rR 'split("") | @csv'
What we need is,
- A standard (yeah, link xkcd 927, it's mentioned enough that I can recall it's ID) to be announced **after** the rest of things are ready.
- Libraries to work with it in major languages. One in Rust + wrappers in common languages might get good traction these days. Having support for dataframe libraries right away might be necessary too.
- Good tooling. I'm guessing one of the reasons CSV took off is that regular unix tools are able to deal with CVSs mostly fine (there's edge cases with field delimiters/commas, but it's not that bad).
The new format would ideally have types, the files would be sharded and have metadata to quickly scan them, and the tooling should be able to make simple joins, ideally automatically based on the metadata since most of the times there's a single reasonable way to join tables.This seems too much work to get right since the very beginning, so maybe building on top of Apache Arrow might help reduce the solution space.
Having so many formats is confusing, inefficient and leads to data loss. This article is right, CSV is king simply because it's essentially the lowest common denominator and I, like most of us, use it for that reason—at least that's so for data that can be stored in database type formats.
But take other data such as images, sound and AVI, and even text. There are dozens of sound, image and other formats. It's all a first-class mess.
For example, we fall back to the antiquated horrible JPG format because we can't agree on better ones such as say jpeg 2000, there being always excuses why we can't such speed, data size, inefficient algorithms etc.
Take word processing for instance, why is it so hard to convert Microsoft's confounded nasty DOC format to say the open document ODT format without errors. It's almost impossible to get the layout in one format converted accurately into another. Similarly, information is lost converting from lossless TIF to say JPG, or from WAV to MP3, etc. What's worse is that so few seem to care about such things.
Every time a conversion is done between lossless formats and lossy ones entropy increases. That's not to say that shouldn't happen it's just that in isolation one has little or no idea about the quality of the original material. Even with ever increasing speeds, more and more storage space so many still have an obsession—in fact a fetish—of compressing data into smaller and smaller sizes using lossy formats with little regard for what's actually lost.
It's not only in sound and image formats where data integrity suffers over convenience, take the case of converting data fields from one format to another. How often has one experienced the situation where a field is truncated during conversion—where say 128 characters suddenly becomes 64 or so after conversion and there's no indication from the converter that data has actually been truncated? Many times I'd suggest.
Another instance, is where fields in the original data don't exist in the converted format. For example, data is often lost from one's phone contacts when converted from an old phone to a new one because the new phone doesn't accommodate all the fields of the old one.
Programmers really have a damn hide for not only allowing this to occur but for not even warning the poor hapless user that some of his/her data has been lost.
That programmers have so little reagard and consideration for data integrity I reckon is a terrible situation and a blight on the whole IT industry.
Why doesn't computer science take these issues more seriously?
Related
At 50 Years Old, Is SQL Becoming a Niche Skill?
SQL, a foundational technology, faces scrutiny in today's IT world. Evolving roles like data scientists challenge its centrality. Debates persist on SQL's relevance against newer technologies like JSON queries, impacting its future role.
What spreadsheets need? LLMs, says Microsoft
Microsoft developed SpreadsheetLLM to enhance large language models' efficiency in analyzing spreadsheet data. The tool addresses challenges like homogeneous rows/columns by serializing and compressing data, reducing token usage. Despite limitations, it aims to reduce computational costs and improve user interactions, potentially transforming data analysis tasks.
The Elegance of the ASCII Table
The article explores the elegance and historical importance of the ASCII table in computing. It discusses design choices, historical context, compatibility with Unicode, practical applications, and enduring relevance in technology.
Basic–The Most Consequential Programming Language in the History of Computing
BASIC, created in 1964, made programming accessible to students and hobbyists, fostering a culture of experimentation. Its legacy persists in education and among enthusiasts despite declining professional use.
Basic – The Most Consequential Programming Language in the History of Computing
BASIC, created in 1964, made programming accessible to students and hobbyists, fostering interest in coding. Its legacy influences modern languages, despite its decline in popularity among professional developers.