November 1st, 2024

DuckDB over Pandas/Polars

Paul Gross prefers DuckDB for data analysis over Polars and Pandas, citing its intuitive SQL syntax, ease of use for data manipulation, and automatic date parsing as significant advantages.

Read original article

Paul Gross discusses his experience using DuckDB for data analysis compared to Polars and Pandas. He initially attempted to analyze financial CSV files using Polars but found its syntax confusing and cumbersome, particularly when it came to selecting and transforming columns, parsing date formats, and using lambda functions. In contrast, he found DuckDB more intuitive due to his familiarity with SQL, allowing him to write queries that were easier to understand and execute. He highlights the simplicity of using SQL for tasks such as summing amounts and joining multiple CSV files, noting that DuckDB automatically handles date parsing. Gross concludes that DuckDB is a powerful and enjoyable tool for data analysis, especially for users who are more accustomed to SQL than to the syntax of libraries like Polars.

- DuckDB is preferred by users familiar with SQL for its intuitive syntax.

- Polars can be complex for casual users due to its syntax and transformation requirements.

- DuckDB simplifies data parsing and manipulation tasks compared to other libraries.

- The ability to join multiple CSVs and apply complex conditions is a significant advantage of DuckDB.

- Users transitioning from SQL to DuckDB may find it easier to perform data analysis tasks.

Announcing Polars 1.0 (Blog Post)

Polars releases Python version 1.0 after 4 years, gaining popularity with 27.5K GitHub stars and 7M monthly downloads. Plans include improving performance, GPU acceleration, Polars Cloud, and new features.

Memory Management in DuckDB

DuckDB optimizes query processing with effective memory management, using a streaming execution engine and disk spilling for large datasets. Its buffer manager enhances performance by caching frequently accessed data.

pg_duckdb: Splicing Duck and Elephant DNA

MotherDuck launched pg_duckdb, an open-source extension integrating DuckDB with Postgres to enhance analytical capabilities while maintaining transactional efficiency, supported by a consortium of companies and community contributions.

DuckDB 1.1.0 Released

DuckDB 1.1.0, codenamed "Eatoni," introduces significant updates including new SQL functionalities, improved community extensions, and performance enhancements, aiming to enhance user experience and efficiency in data analysis.

GPU Acceleration with Polars

Polars has introduced GPU acceleration with NVIDIA RAPIDS, offering up to 13 times faster performance for compute-bound queries in Python, while maintaining existing API semantics and fallback to CPU execution.

9 comments

By @lopatin - 6 months

I think the competition for the future is between DuckDB and Polars. Will we stick with the DataFrame model, made feasible by Polars's lazy execution, or will we go with in-process SQL a la DuckDB? Personally I've been using DuckDB because I already know SQL (and DuckDB provides persistence if I need it) and don't want to learn a new DataFrame DSL but I'd love to hear other the experience of other people.

By @ramraj07 - 6 months

I am just using duckdb on a 3TB dataset in a beefy ec2, and am pleasantly surprised at its performance on such a large table. I had to do some sharding to be sure but am able to match performance of snowflake or other cluster based systems using this single machine instance.

To clarify Clickhouse will likely match this performance as well, but doing things on a single machines look sexier to me than it ever did in decades.

By @minimaxir - 6 months

The test case of a simple aggregation is a good example of an important data science skill knowing when and here to use a given tool, and that there is no one right answer for all cases. Although it's worth noting that DuckDB and polars are comparable performance-wise for aggregation (DuckDB slightly faster: https://duckdblabs.github.io/db-benchmark/ ).

For my cases with polars and function piping, certain aspects of that workflow are hard to represent in SQL, and additionally it's easier for iteration/testing on a given aggregation to add/remove a given function pipe, and to relate to existing tables (e.g. filter a table to only IDs present in a different table, which is more algorithmically efficient than a join-then-filter). To do the ETL I tend to do for my data science workin pandas/polars in SQL/DuckDB, it would require chains of CTEs or other shenanigans, which eliminates similicity and efficincy.

By @wodenokoto - 6 months

> Note that DuckDB automatically figured out how to parse the date column.

It kinda did and it kinda didn't. Author got lucky that Transaction.csv contained a date where the day was after the 12th in a given month. Had there not been such a date, DuckDB would have gotten the dates wrong and read it as dd/mm/yyyy.

I think a warning from DuckDB would have been in order.

By @knowsuchagency - 6 months

Why not both? https://ibis-project.org/

By @wanderingmind - 6 months

My biggest issue with DuckDB is its not willing to implement edits to blob storages which allow edits (Azure). Having common object/blob storages that can be interacted and operated by multiple process will make it much more amenable to many data science driven workflows.

By @jgalt212 - 6 months

At what database size does it make sense to move from SQLite to DuckDB? My use case is off-line data analysis, not query / response web app.

By @pietz - 6 months

I don't understand the purpose of this post. "I write a lot of X so I prefer using X over Y." Great.

By @xiaodai - 6 months

lack of UDF is an issue

DuckDB over Pandas/Polars