December 14th, 2024

Should you ditch Spark for DuckDB or Polars?

DuckDB and Polars are emerging as alternatives to Apache Spark for small workloads, outperforming it in smaller configurations, while Spark excels in larger setups, maintaining strong performance and cost-effectiveness.

Read original articleLink Icon
Should you ditch Spark for DuckDB or Polars?

The article discusses the growing interest in single-machine compute engines like DuckDB and Polars as alternatives to Apache Spark for small and medium workloads. The author evaluates these engines through a benchmark suite that assesses performance, execution cost, development cost, and engine maturity. The benchmarks include various operations typical in data engineering, such as reading Parquet files, creating fact tables, and performing ad-hoc queries. Results indicate that DuckDB often outperforms Spark in smaller configurations, while Spark excels in larger setups, particularly at 100GB scale. Polars struggles with memory management, leading to out-of-memory errors in some tests. The analysis also highlights the cost-effectiveness of DuckDB and Polars compared to Spark, especially at lower scales. However, Spark's performance gains at higher core allocations suggest it remains a strong contender for larger workloads. The article concludes that while DuckDB and Polars offer compelling advantages, particularly in development agility and cost, Spark's established capabilities and performance in larger environments cannot be overlooked.

- DuckDB and Polars are gaining popularity as alternatives to Spark for smaller workloads.

- DuckDB generally outperforms Spark in smaller configurations, while Spark excels in larger setups.

- Polars faces memory management issues, leading to out-of-memory errors in benchmarks.

- DuckDB and Polars are more cost-effective than Spark, especially at lower scales.

- Spark remains a strong option for larger workloads due to its performance gains with higher core allocations.

Link Icon 25 comments
By @buremba - 4 months
Great post but it seems like you still rely on Fabric to run Spark NEE. If you're on AWS or GCP, you should probably not ditch Spark but combine both. DuckDB's gotcha is that it can't scale horizontally (multi-node), unlike Databricks. A single node can get you as far as you or can rent 2TB memory + 20TB NVME in AWS, and if you use PySpark, you can run DuckDB until it doesn't scale with its Spark integration (https://duckdb.org/docs/api/python/spark_api.html) and switch to Databricks if you need to scale out. That way, you get the best of the two worlds.

DuckDB on AWS EC2's price performance rate is 10x that of Databricks and Snowflake with its native file format, so it's a better deal if you're not processing petabyte-level data. That's unsurprising, given that DuckDB operates in a single node (no need for distributed shuffles) and works primarily with NVME (no use of object stores such as S3 for intermediate data). Thus, it can optimize the workloads much better than the other data warehouses.

If you use SQL, another gotcha is that DuckDB doesn't have advanced catalog features in cloud data warehouses. Still, it's possible to combine DuckDB compute and Snowflake Horizon / Databricks Unity Catalog thanks to Apache Iceberg, which enables multi-engine support in the same catalog. I'm experimenting this multi-stack idea with DuckDB <> Snowflake, and it works well so far: https://github.com/buremba/universql

By @kianN - 4 months
I went through this trade off at my last job. I started off migrating my adhoc queries to duckdb directly from delta tables. Over time, I used duckdb enough to do some performance tuning. I found that migrating from Delta to duckdb's native file format provided substantial speed wins.

The author focuses on read/write performance on Delta (makes sense for the scope of the comparison). I think if an engineer is considering switching from spark to duckdb/polars for their data warehouse, they would likely be open to data formats other than Delta, which is tightly coupled to the spark (and even more so to the closed-source Databricks implementation). In my use case, we saw enough speed wins and cost savings that it made sense to fully migrate our data warehouse to a self managed duckdb warehouse using duckdb's native file format.

By @serjester - 4 months
Polars is much more useful if you’re doing complex transformations instead of basic ETL.

Something under appreciated about polars is how easy it is to build a plugin. I recently took a rust crate that reimplemented the h3 geospatial coordinate system, exposed it at as a polars plugin and achieved performance 5X faster than the DuckDB version.

With knowing 0 rust and some help from AI it only took me 2ish days - I can’t imagine doing this in C++ (DuckDB).

[1] https://github.com/Filimoa/polars-h3

By @moandcompany - 4 months
My opinion: the high-prevalence of implementations using Spark, Pandas, etc. are mostly driven by (1) people's tendency to work with tools that use APIs they are already familiar with, (2) resume driven development, and/or to a much lesser degree (3) sustainability with regard to future maintainers, versus what may be technically sensible with regard to performance. A decade ago we saw similar articles referencing misapplications of Hadoop/Mapreduce, and today it is Spark as its successor.

Pandas' use of the dataframe concepts and APIs were informed by R and a desire to provide something familiar and accessible to R users (i.e. ease of user adoption).

Likewise, when the Spark development community somewhere around the version 0.11 days began implementing the dataframe abstraction over its original native RDD abstractions, it understood the need to provide a robust Python API similar to the Pandas APIs for accessibility (i.e. ease of user adoption).

At some point those familiar APIs also became a burden, or were not-great to begin with, in several ways and we see new tools emerge like DuckDB and Polars.

However, we now have a non-unique issue where people are learning and applying specific tools versus general problem-solving skills and tradecraft in the related domain (i.e. the common pattern of people with hammers seeing everything as nails). Note all of the "learn these -n- tools/packages to become a great ____ engineer and make xyz dollars" type tutorials and starter-packs on the internet today.

By @pnut - 4 months
Maybe this is telling more of the company I work in, but it is just incomprehensible for me to casually contemplate dumping a generally comparable, installed production capability.

All I think when I read this is, standing up new environments, observability, dev/QA training, change control, data migration, mitigating risks to business continuity, integrating with data sources and sinks, and on and on...

I've got enough headaches already without another one of those projects.

By @RobinL - 4 months
I submitted this because I thought it was a good, high effort post, but I must admit I was surprised by the conclusion. In my experience, admittedly on different workloads, duckdb is both faster and easier to use than spark, and requires significantly less tuning and less complex infrastructure. I've been trying to transition as much as possible over to duckdb.

There are also some interesting points in the following podcast about ease of use and transactional capabilities of duckdb which are easy to overlook (you can skip the first 10 mins): https://open.spotify.com/episode/7zBdJurLfWBilCi6DQ2eYb

Of course, if you have truly massive data, you probably still need spark

By @rapatel0 - 4 months
The author disparages ibis but i really think that this is short sighted. Ibis does a great job of mixing sql with dataframes to perform complex queries and abstracts away a lot of the underlyng logic and allows for query optimization.

Example:

df = (

    df.mutate(new_column=df.old_column.dosomething())

      .alias('temp_table')

      .sql('SELECT db_only_function(new_column) AS newer_column  from temp_table')

      .mutate(other_new_column = newer_column.do_other_stuff())

)

It's super flexible and duckdb makes it very performant. The general vice i experience creating overly complex transforms but otherwise it's super useful and really easy to mix dataframes and SQL. Finally it supports pretty much every backend including pyspark and polars

By @memhole - 4 months
Nice write up. I don’t think the comments about duckdb spilling to disk are correct. I believe if you create a temp or persistent db duckdb will spill to disk.

I might have missed it, but the integration of duckdb and the arrow library makes mixing and matching dataframes and sql syntax fairly seamless.

I’m convinced the simplicity of duckdb is worth a performance penalty compared to spark for most workloads. Ime, people struggle with fully utilizing spark.

By @pm90 - 4 months
> My name is Miles, I’m a Principal Program Manager at Microsoft. While a Spark specialist by role

JFYI. I think the article itself is pretty unbiased but I feel like its worth putting this disclaimer for the author.

By @steveBK123 - 4 months
I do wonder if some new tech adoption will actually be slowed due to the prevalence of LLM assisted coding?

That is - all these code assistants are going to be 10x as useful on spark/pandas as they would be on duckdb/polars, due to the age of the former and the continued rate of change in the latter.

By @sroerick - 4 months
I like Spark a lot. When I was doing a lot of data work, DataBricks was a great tool for a team that wasn’t elbow deep into data engineering to be able to get a lot of stuff done.

DuckDB is fantastic, though. I’ve never really built big data streaming situations so I can really accomplish anything I’ve needed with DuckDB. I’m not sure about building full data pipelines with it. Any time I’ve tried, it feels a little “duck-tapey” but the ecosystem has matured tremendously in the past couple years.

Polars never got a lot of love from me, though I love what they’re doing. I used to do a lot of work in pandas and python but I kind of moved onto greener pastures. I really just prefer any kind of ETL work in SQL.

Compute was kind of always secondary to developer experience for me. It kills me to say this but my favorite tool for data exploration is still PowerBI. If I have a strange CSV it’s tough to beat dragging it into a BI tool and exploring it that way. I’d love something like DuckDB Harlequin but for BI / Data Visualization. I don’t really love all the SaaS BI platforms I’ve explored. I did really like Plotly.

Totally open to hearing other folks’ experiences or suggestions. Nothing here is an indictment of any particular tool, just my own ADD and the pressure of needing to ship.

By @nicornk - 4 months
The blog post echos my experience that duckDB just works (due to superior disk spilling capabilities) and polars OOMs a lot.
By @ibgeek - 4 months
Good write up. The only real bias I can detect is that the author seems to conflate their (lack of) familiarity with ease of use. I bet if they spent a few months using DuckDB and Polars on a daily basis, they might find some of the tasks just as easy or easier to implement.
By @ZeroCool2u - 4 months
Would love to see Lake Sail overtake Spark, so we could generally dodge tuning the JVM for big Spark jobs.

https://docs.lakesail.com/sail/latest/

By @shcheklein - 4 months
Another alternative to consider is https://www.getdaft.io/ . AFAIU it is a more direct competitor to Spark (distributed mode).
By @Epa095 - 4 months
One thing that's not clear to me about the 'use duckdb instead' proposals is how to orchestrate the batches. In databricks/spark there are two components:

- auto loader/cloud files. Can attach to a blob storage of e.g csv or json, and give them as batches. As new files come in you get batches containing only the new files.

-structured streaming and it's checkpoints. It keeps tracks across runs of how far in the source it has read (including cloud files sources), and it's easy to either continue the job with only the new data, or delete the checkpoint and rebuild everything.

How can you do something similar with duckdb? If you have e.g a growing blob store of csv/avro/json files? Just read everything every day? Create some homegrown setup?

I guess what I describe above is independent of the actual compute library, you could use any transformation library to do the actual batches (and with foreachbatch you can actually use duckdb in spark like this).

By @adsharma - 4 months
I'm a bit confused by the claim that duckdb doesn't support dataframes.

This blog post suggests that it has been supported since 2021 and matches my experience.

https://duckdb.org/2021/05/14/sql-on-pandas.html

By @craydandy - 4 months
Interesting and well-written article. Thanks to the author for writing it. Replacing Spark with these single-machine tools seems to be on the hype, and Spark is not en vogue anymore.

The author ran Spark in Fabric, which has V-Order write enabled by default. DuckDB and Polars don't have this, as it's an MS proprietary algorithm. V-Order adds about 15% overhead to write, so it does change the result a bit.

The data sizes were bit on a large size, at least for the data amounts I see daily. There definitely are tables in the 10GB, 100GB, and even in 1TB size range, but most tables traveling through data pipelines are much smaller.

By @ritchie46 - 4 months
I do think code should be shared when you are benchmarking. He could be using Polars' eager API for instance, which would not be apples to apples.
By @OutOfHere - 4 months
Isn't Spark extremely memory inefficient due to the use of Java?
By @tessierashpool9 - 4 months
what is the current maximum ball park amount of data one can realistically handle on a single machine setup on AWS / GCP / Azure?

realistically means keeping in mind that the processing itself also requires memory as well as prerequisites like indexes which also need to be kept in memory.

maximum memory at AWS would be 1.5TB using r8g.metal-48xl. so, assuming 50% usable for the raw data means about 750GB are realistic.

By @jimmyl02 - 4 months
pretty new to the large scale data processing space so not sure if this is a known question but isn't the purpose of spark that it can be distributed across many workers and parallelized?

I guess the scale of data here ~100GB is manageable with something like DuckDB but once data gets past a certain scale, wouldn't single machine performance have no way of matching a distributed spark cluster?

By @code51 - 4 months
Why is Apache DataFusion not there as an alternative?
By @downrightmike - 4 months
Or use https://lancedb.com/ LanceDB is a developer-friendly, open source database for AI. From hyper scalable vector search and advanced retrieval for RAG, to streaming training data and interactive exploration of large scale AI datasets, LanceDB is the best foundation for your AI application

Recent podcast https://talkpython.fm/episodes/show/488/multimodal-data-with...