Should you ditch Spark for DuckDB or Polars?
DuckDB and Polars are emerging as alternatives to Apache Spark for small workloads, outperforming it in smaller configurations, while Spark excels in larger setups, maintaining strong performance and cost-effectiveness.
Read original articleThe article discusses the growing interest in single-machine compute engines like DuckDB and Polars as alternatives to Apache Spark for small and medium workloads. The author evaluates these engines through a benchmark suite that assesses performance, execution cost, development cost, and engine maturity. The benchmarks include various operations typical in data engineering, such as reading Parquet files, creating fact tables, and performing ad-hoc queries. Results indicate that DuckDB often outperforms Spark in smaller configurations, while Spark excels in larger setups, particularly at 100GB scale. Polars struggles with memory management, leading to out-of-memory errors in some tests. The analysis also highlights the cost-effectiveness of DuckDB and Polars compared to Spark, especially at lower scales. However, Spark's performance gains at higher core allocations suggest it remains a strong contender for larger workloads. The article concludes that while DuckDB and Polars offer compelling advantages, particularly in development agility and cost, Spark's established capabilities and performance in larger environments cannot be overlooked.
- DuckDB and Polars are gaining popularity as alternatives to Spark for smaller workloads.
- DuckDB generally outperforms Spark in smaller configurations, while Spark excels in larger setups.
- Polars faces memory management issues, leading to out-of-memory errors in benchmarks.
- DuckDB and Polars are more cost-effective than Spark, especially at lower scales.
- Spark remains a strong option for larger workloads due to its performance gains with higher core allocations.
Related
Rust for the small things? but what about Python?
The article compares Rust and Python for data engineering, highlighting Python's integration with LLMs and tools like Polars, while noting Rust's speed and safety but greater complexity.
6 Powerful Databricks Alternatives for Data Lakes and Lakehouses
Organizations are exploring alternatives to Databricks due to cost and complexity concerns. Definite is highlighted for its user-friendly, all-in-one data management platform, alongside options like Google BigQuery and Snowflake.
The sorry state of Java deserialization
The article examines Java deserialization challenges in reading large datasets, highlighting performance issues with various methods. Benchmark tests show modern tools outperform traditional ones, emphasizing the need for optimization and custom serialization.
DuckDB over Pandas/Polars
Paul Gross prefers DuckDB for data analysis over Polars and Pandas, citing its intuitive SQL syntax, ease of use for data manipulation, and automatic date parsing as significant advantages.
FireDucks: Pandas but 100x Faster
FireDucks, launched by NEC Corporation in October 2023, enhances data manipulation in Python, claiming to be 50 times faster than Pandas and outperforming Polars, requiring no code changes for integration.
DuckDB on AWS EC2's price performance rate is 10x that of Databricks and Snowflake with its native file format, so it's a better deal if you're not processing petabyte-level data. That's unsurprising, given that DuckDB operates in a single node (no need for distributed shuffles) and works primarily with NVME (no use of object stores such as S3 for intermediate data). Thus, it can optimize the workloads much better than the other data warehouses.
If you use SQL, another gotcha is that DuckDB doesn't have advanced catalog features in cloud data warehouses. Still, it's possible to combine DuckDB compute and Snowflake Horizon / Databricks Unity Catalog thanks to Apache Iceberg, which enables multi-engine support in the same catalog. I'm experimenting this multi-stack idea with DuckDB <> Snowflake, and it works well so far: https://github.com/buremba/universql
The author focuses on read/write performance on Delta (makes sense for the scope of the comparison). I think if an engineer is considering switching from spark to duckdb/polars for their data warehouse, they would likely be open to data formats other than Delta, which is tightly coupled to the spark (and even more so to the closed-source Databricks implementation). In my use case, we saw enough speed wins and cost savings that it made sense to fully migrate our data warehouse to a self managed duckdb warehouse using duckdb's native file format.
Something under appreciated about polars is how easy it is to build a plugin. I recently took a rust crate that reimplemented the h3 geospatial coordinate system, exposed it at as a polars plugin and achieved performance 5X faster than the DuckDB version.
With knowing 0 rust and some help from AI it only took me 2ish days - I can’t imagine doing this in C++ (DuckDB).
Pandas' use of the dataframe concepts and APIs were informed by R and a desire to provide something familiar and accessible to R users (i.e. ease of user adoption).
Likewise, when the Spark development community somewhere around the version 0.11 days began implementing the dataframe abstraction over its original native RDD abstractions, it understood the need to provide a robust Python API similar to the Pandas APIs for accessibility (i.e. ease of user adoption).
At some point those familiar APIs also became a burden, or were not-great to begin with, in several ways and we see new tools emerge like DuckDB and Polars.
However, we now have a non-unique issue where people are learning and applying specific tools versus general problem-solving skills and tradecraft in the related domain (i.e. the common pattern of people with hammers seeing everything as nails). Note all of the "learn these -n- tools/packages to become a great ____ engineer and make xyz dollars" type tutorials and starter-packs on the internet today.
All I think when I read this is, standing up new environments, observability, dev/QA training, change control, data migration, mitigating risks to business continuity, integrating with data sources and sinks, and on and on...
I've got enough headaches already without another one of those projects.
There are also some interesting points in the following podcast about ease of use and transactional capabilities of duckdb which are easy to overlook (you can skip the first 10 mins): https://open.spotify.com/episode/7zBdJurLfWBilCi6DQ2eYb
Of course, if you have truly massive data, you probably still need spark
Example:
df = (
df.mutate(new_column=df.old_column.dosomething())
.alias('temp_table')
.sql('SELECT db_only_function(new_column) AS newer_column from temp_table')
.mutate(other_new_column = newer_column.do_other_stuff())
)It's super flexible and duckdb makes it very performant. The general vice i experience creating overly complex transforms but otherwise it's super useful and really easy to mix dataframes and SQL. Finally it supports pretty much every backend including pyspark and polars
I might have missed it, but the integration of duckdb and the arrow library makes mixing and matching dataframes and sql syntax fairly seamless.
I’m convinced the simplicity of duckdb is worth a performance penalty compared to spark for most workloads. Ime, people struggle with fully utilizing spark.
JFYI. I think the article itself is pretty unbiased but I feel like its worth putting this disclaimer for the author.
That is - all these code assistants are going to be 10x as useful on spark/pandas as they would be on duckdb/polars, due to the age of the former and the continued rate of change in the latter.
DuckDB is fantastic, though. I’ve never really built big data streaming situations so I can really accomplish anything I’ve needed with DuckDB. I’m not sure about building full data pipelines with it. Any time I’ve tried, it feels a little “duck-tapey” but the ecosystem has matured tremendously in the past couple years.
Polars never got a lot of love from me, though I love what they’re doing. I used to do a lot of work in pandas and python but I kind of moved onto greener pastures. I really just prefer any kind of ETL work in SQL.
Compute was kind of always secondary to developer experience for me. It kills me to say this but my favorite tool for data exploration is still PowerBI. If I have a strange CSV it’s tough to beat dragging it into a BI tool and exploring it that way. I’d love something like DuckDB Harlequin but for BI / Data Visualization. I don’t really love all the SaaS BI platforms I’ve explored. I did really like Plotly.
Totally open to hearing other folks’ experiences or suggestions. Nothing here is an indictment of any particular tool, just my own ADD and the pressure of needing to ship.
- auto loader/cloud files. Can attach to a blob storage of e.g csv or json, and give them as batches. As new files come in you get batches containing only the new files.
-structured streaming and it's checkpoints. It keeps tracks across runs of how far in the source it has read (including cloud files sources), and it's easy to either continue the job with only the new data, or delete the checkpoint and rebuild everything.
How can you do something similar with duckdb? If you have e.g a growing blob store of csv/avro/json files? Just read everything every day? Create some homegrown setup?
I guess what I describe above is independent of the actual compute library, you could use any transformation library to do the actual batches (and with foreachbatch you can actually use duckdb in spark like this).
This blog post suggests that it has been supported since 2021 and matches my experience.
The author ran Spark in Fabric, which has V-Order write enabled by default. DuckDB and Polars don't have this, as it's an MS proprietary algorithm. V-Order adds about 15% overhead to write, so it does change the result a bit.
The data sizes were bit on a large size, at least for the data amounts I see daily. There definitely are tables in the 10GB, 100GB, and even in 1TB size range, but most tables traveling through data pipelines are much smaller.
realistically means keeping in mind that the processing itself also requires memory as well as prerequisites like indexes which also need to be kept in memory.
maximum memory at AWS would be 1.5TB using r8g.metal-48xl. so, assuming 50% usable for the raw data means about 750GB are realistic.
I guess the scale of data here ~100GB is manageable with something like DuckDB but once data gets past a certain scale, wouldn't single machine performance have no way of matching a distributed spark cluster?
Recent podcast https://talkpython.fm/episodes/show/488/multimodal-data-with...
Related
Rust for the small things? but what about Python?
The article compares Rust and Python for data engineering, highlighting Python's integration with LLMs and tools like Polars, while noting Rust's speed and safety but greater complexity.
6 Powerful Databricks Alternatives for Data Lakes and Lakehouses
Organizations are exploring alternatives to Databricks due to cost and complexity concerns. Definite is highlighted for its user-friendly, all-in-one data management platform, alongside options like Google BigQuery and Snowflake.
The sorry state of Java deserialization
The article examines Java deserialization challenges in reading large datasets, highlighting performance issues with various methods. Benchmark tests show modern tools outperform traditional ones, emphasizing the need for optimization and custom serialization.
DuckDB over Pandas/Polars
Paul Gross prefers DuckDB for data analysis over Polars and Pandas, citing its intuitive SQL syntax, ease of use for data manipulation, and automatic date parsing as significant advantages.
FireDucks: Pandas but 100x Faster
FireDucks, launched by NEC Corporation in October 2023, enhances data manipulation in Python, claiming to be 50 times faster than Pandas and outperforming Polars, requiring no code changes for integration.