FireDucks: Pandas but 100x Faster
FireDucks, launched by NEC Corporation in October 2023, enhances data manipulation in Python, claiming to be 50 times faster than Pandas and outperforming Polars, requiring no code changes for integration.
Read original articleFireDucks is a new library launched in October 2023 by a team from NEC Corporation, designed to enhance the performance of data manipulation in Python, particularly for users familiar with the Pandas library. The library claims to be significantly faster than both Pandas and Polars, with benchmarks indicating it is, on average, 50 times faster than Pandas and even outperforms Polars in certain tests. FireDucks allows users to integrate it into their existing Pandas code without any modifications, providing a seamless transition to improved performance. The author, who has extensive experience in finance data analysis, highlights the challenges of rewriting a large codebase in Polars but finds FireDucks to be a compelling solution due to its speed and compatibility. The benchmarks conducted by the author show impressive results, with FireDucks achieving speed improvements of 130x and 200x in specific operations compared to Pandas. The library aims to address common criticisms of Python's performance by leveraging its C engine, demonstrating that optimized Python can be efficient for serious workloads.
- FireDucks is launched by NEC Corporation and claims to be 50x faster than Pandas.
- It requires no changes to existing Pandas code for integration.
- Benchmarks show FireDucks outperforming both Pandas and Polars in various operations.
- The library is designed for users who need high performance in data manipulation tasks.
- It emphasizes the potential of optimized Python for handling large datasets efficiently.
Related
Announcing Polars 1.0 (Blog Post)
Polars releases Python version 1.0 after 4 years, gaining popularity with 27.5K GitHub stars and 7M monthly downloads. Plans include improving performance, GPU acceleration, Polars Cloud, and new features.
Farewell Pandas, and thanks for all the fish
Ibis will remove its pandas and Dask backends in version 10.0, favoring DuckDB for better performance and ease of use, while still allowing pandas DataFrames for data transfer.
GPU Acceleration with Polars
Polars has introduced GPU acceleration with NVIDIA RAPIDS, offering up to 13 times faster performance for compute-bound queries in Python, while maintaining existing API semantics and fallback to CPU execution.
DuckDB over Pandas/Polars
Paul Gross prefers DuckDB for data analysis over Polars and Pandas, citing its intuitive SQL syntax, ease of use for data manipulation, and automatic date parsing as significant advantages.
Non-elementary group-by aggregations in Polars vs pandas
The article compares Polars and pandas, highlighting Polars' advanced non-elementary group-by aggregations and efficient API for complex operations, while pandas requires more complicated methods, leading to inefficiencies.
- Many users express concerns about the compatibility and potential limitations of FireDucks compared to existing libraries like Pandas and Polars.
- Some commenters appreciate the promise of speed improvements but are wary of the closed-source nature of FireDucks.
- There is a recurring theme regarding the need for a more intuitive API, with users lamenting the complexity of Pandas and the verbosity of Polars.
- Several users highlight the importance of open-source options and extensibility in data manipulation tools.
- Discussions also touch on the performance of FireDucks in comparison to other libraries, with mixed opinions on its claimed speed advantages.
> By providing the beta version of FireDucks free of charge and enabling data scientists to actually use it, NEC will work to improve its functionality while verifying its effectiveness, with the aim of commercializing it within FY2024.
In other words, it's free only to trap you.
So many foot guns, poorly thought through functions, 10s of keyword arguments instead of good abstractions, 1d and 2d structures being totally different objects (and no higher-order structures). I'd take 50% of the speed for a better API.
I looked at Polars, which looks neat, but seems made for a different purpose (data pipelines rather than building models semi-interactively).
To be clear, this library might be great, it's just a shame for me that there seems no effort to make a Pandas-like thing with better API. Maybe time to roll up my sleeves...
Polars rocked my world by having a sane API, not by being fast. I can see the value in this approach if, like the author, you have a large amount of pandas code you don't want to rewrite, but personally I'm extremely glad to be leaving the pandas API behind.
https://fireducks-dev.github.io/files/20241003_PyConZA.pdf
The main reasons are
* multithreading
* rewriting base pandas functions like dropna in c++
* in-built compiler to remove unused code
Pretty impressive especially given you import fireducks.pandas as pd instead of import pandas as pd, and you are good to go
However I think if you are using a pandas function that wasn't rewritten, you might not see the speedups
> Future Plans By providing the beta version of FireDucks free of charge and enabling data scientists to actually use it, NEC will work to improve its functionality while verifying its effectiveness, with the aim of commercializing it within FY2024.
Where can I find the code? I don't see it on GitHub.
> contact@fireducks.jp.nec.com
So it's from NEC (a major Japanese computer company), presumably a research artifact?
> https://fireducks-dev.github.io/docs/about-us/ Looks like so.
I've had the chance to play with it on some of my code it queries than ran in 8+ minutes come down to 20 seconds.
Re-writing in Polars involves more code changes.
However, with Pandas 2.2+ and arrow, you can use .pipe to move data to Polars, run the slow computation there, and then zero copy back to Pandas. Like so...
(df
# slow part
.groupby(...)
.agg(...)
)
to: def polars_agg(df):
return (pl.from_pandas(df)
.group_by(...)
.agg(...)
.to_pandas()
)
(df
.pipe(polars_agg)
)
I wonder how much of this is fundamental to the common approach of writing libraries in Python with the processing-heavy parts delegated to C/C++ -- that the expressive parts cannot be fast and the fast parts cannot be expressive. Also, whether Rust (for polars, and other newer generation of libraries) changes this tradeoff substantially enough.
Is it actually? Do people see that level of compatibility in practice?
We found `numpy` and `jax` to be a good trade-off between "too high level to optimize" and "too low level to understand". Therefore in our hedge fund we just build data structures and helper functions on top of them. The downside of the above combination is on sparse data, for which we call wrapped c++/rust code in python.
My easy guess is that compared to pandas, it's multi-threaded by default, which makes for an easy perf win. But even then, 130-200x feels extreme for a simple sum/mean benchmark. I see they are also doing lazy evaluation and some MLIR/LLVM based JIT work, which is probably enough to get an edge over polars; though its wins over DuckDB _and_ Clickhouse are also surprising out of nowhere.
Also, I thought one of the reasons for Polars's API was that Pandas API is way harder to retrofit lazy evaluation to, so I'm curious how they did that.
I wrote a nice article about chaining for Ponder. (Sadly, it looks like the Snowflake acquisition has removed that. My book, Effective Pandas 2, goes deep into my best practices.)
EDIT: I've found some benchmarks https://fireducks-dev.github.io/docs/benchmarks/
Would be nice to know what are internals of FireDucks
```
>>> df['year'].dtype == np.dtype('int32')
True
```
The promise of a 100x speedup with 0 changes to your codebase is pretty huge, but even a few correctness / incompatibility issues would probably make it a no-go for a bunch of potential users.
I haven’t seen that in other system like Polars, but maybe I’m wrong.
edit: I know pandas uses numpy under the hood, but "raw" numpy is typically faster (and more flexible), so curious as to why it's not mentioned
Q: Why do ducks have big flat feet?
A: So they can stomp out forest fires.
Q: Why do elephants have big flat feet?
A: So they can stomp out flaming ducks.
Related
Announcing Polars 1.0 (Blog Post)
Polars releases Python version 1.0 after 4 years, gaining popularity with 27.5K GitHub stars and 7M monthly downloads. Plans include improving performance, GPU acceleration, Polars Cloud, and new features.
Farewell Pandas, and thanks for all the fish
Ibis will remove its pandas and Dask backends in version 10.0, favoring DuckDB for better performance and ease of use, while still allowing pandas DataFrames for data transfer.
GPU Acceleration with Polars
Polars has introduced GPU acceleration with NVIDIA RAPIDS, offering up to 13 times faster performance for compute-bound queries in Python, while maintaining existing API semantics and fallback to CPU execution.
DuckDB over Pandas/Polars
Paul Gross prefers DuckDB for data analysis over Polars and Pandas, citing its intuitive SQL syntax, ease of use for data manipulation, and automatic date parsing as significant advantages.
Non-elementary group-by aggregations in Polars vs pandas
The article compares Polars and pandas, highlighting Polars' advanced non-elementary group-by aggregations and efficient API for complex operations, while pandas requires more complicated methods, leading to inefficiencies.