Non-elementary group-by aggregations in Polars vs pandas
The article compares Polars and pandas, highlighting Polars' advanced non-elementary group-by aggregations and efficient API for complex operations, while pandas requires more complicated methods, leading to inefficiencies.
Read original articleThe article discusses the differences between Polars and pandas, particularly focusing on group-by operations and non-elementary aggregations. While Polars has gained attention for its features like lazy execution, multithreading, and efficient handling of null values, the author highlights a less-discussed innovation: the ability to perform non-elementary group-by aggregations. The article explains how both libraries handle group-by operations, with pandas requiring more complex and less efficient methods for certain aggregations. In contrast, Polars allows users to express complex aggregations cleanly using its API, enabling more efficient computations. The author argues that while pandas is a valuable tool, libraries that mimic its API may limit their potential by not allowing for more advanced operations. The conclusion emphasizes the importance of innovative API design in enabling new possibilities for data manipulation.
- Polars offers advanced non-elementary group-by aggregations not easily achievable in pandas.
- The Polars API allows for clean expression of complex aggregations, enhancing efficiency.
- Pandas requires more complicated methods for certain operations, often leading to inefficiencies.
- The article advocates for innovative API designs in data manipulation libraries.
- Users appreciate Polars for both its speed and its intuitive syntax.
Related
Announcing Polars 1.0 (Blog Post)
Polars releases Python version 1.0 after 4 years, gaining popularity with 27.5K GitHub stars and 7M monthly downloads. Plans include improving performance, GPU acceleration, Polars Cloud, and new features.
Why Polars rewrote its Arrow string data type
Polars has refactored its string data structure for improved performance, implementing a new storage method inspired by Hyper/Umbra, allowing inline storage of small strings and enhancing filter operation efficiency.
Rust for the small things? but what about Python?
The article compares Rust and Python for data engineering, highlighting Python's integration with LLMs and tools like Polars, while noting Rust's speed and safety but greater complexity.
GPU Acceleration with Polars
Polars has introduced GPU acceleration with NVIDIA RAPIDS, offering up to 13 times faster performance for compute-bound queries in Python, while maintaining existing API semantics and fallback to CPU execution.
DuckDB over Pandas/Polars
Paul Gross prefers DuckDB for data analysis over Polars and Pandas, citing its intuitive SQL syntax, ease of use for data manipulation, and automatic date parsing as significant advantages.
- Many users appreciate Polars for its efficiency and user-friendly API, especially compared to pandas.
- Some users express frustration with pandas after using other frameworks like Apache Spark, finding Polars more intuitive.
- There is a recognition of the importance of competition in the data manipulation space, with Polars pushing pandas to improve.
- Users are considering transitioning from pandas to Polars, citing performance benefits, particularly for large datasets.
- Concerns are raised about the appropriateness of using dataframes versus simpler Python classes for certain tasks.
DuckDb even goes so far as to include a clone of the pyspark dataframe API, so somebody there must like it too.
And I say this as someone who makes much of their living from Pandas.
On a tangent, i wonder what this op would look like in SQL? Probably would need support for filtering in a window function, which I'm not sure is standardized?
Awesome! Didn't expect such a vast difference in usability at first.
import numpy as np
def scatter_mean(index, value):
sums = np.zeros(max(index)+1)
counts = np.zeros(max(index)+1)
for i in range(len(index)):
j = index[i]
sums[j] += value[i]
counts[j] += 1
return sums / counts
def scatter_max(index, value):
maxs = -np.inf * np.ones(max(index)+1)
for i in range(len(index)):
j = index[i]
maxs[j] = max(maxs[j], value[i])
return maxs
def scatter_count(index):
counts = np.zeros(max(index)+1, dtype=np.int32)
for i in range(len(index)):
counts[index[i]] += 1
return counts
id = np.array([1, 1, 1, 2, 2, 2]) - 1
sales = np.array([4, 1, 2, 7, 6, 7])
views = np.array([3, 1, 2, 8, 6, 7])
means = scatter_mean(id, sales).repeat(scatter_count(id))
print(views[sales > means].max())
Obviously you'd need good implementations of the scatter operations, not these naive python for-loops. But once you have them the solution is a pretty readable two-liner.Even better is using tools like Narwhals and Ibis which can convert back and forth to any frames you want.
Reason: I can speed things up fairly easily with Cython functions, and do multithreading using the Python module. With polars I would have to learn Rust for that.
I took some data science classes in grad school, but basically haven't had any reason to touch pandas since I graduated. But, did like the ecosystem of tools, learning materials, and other libraries surrounding it when I was working with it. I recently just started a new project and am quickly going through my old notes to refamiliarize myself with pandas, but maybe I should just go and learn Polars?
Related
Announcing Polars 1.0 (Blog Post)
Polars releases Python version 1.0 after 4 years, gaining popularity with 27.5K GitHub stars and 7M monthly downloads. Plans include improving performance, GPU acceleration, Polars Cloud, and new features.
Why Polars rewrote its Arrow string data type
Polars has refactored its string data structure for improved performance, implementing a new storage method inspired by Hyper/Umbra, allowing inline storage of small strings and enhancing filter operation efficiency.
Rust for the small things? but what about Python?
The article compares Rust and Python for data engineering, highlighting Python's integration with LLMs and tools like Polars, while noting Rust's speed and safety but greater complexity.
GPU Acceleration with Polars
Polars has introduced GPU acceleration with NVIDIA RAPIDS, offering up to 13 times faster performance for compute-bound queries in Python, while maintaining existing API semantics and fallback to CPU execution.
DuckDB over Pandas/Polars
Paul Gross prefers DuckDB for data analysis over Polars and Pandas, citing its intuitive SQL syntax, ease of use for data manipulation, and automatic date parsing as significant advantages.