November 15th, 2024

Non-elementary group-by aggregations in Polars vs pandas

The article compares Polars and pandas, highlighting Polars' advanced non-elementary group-by aggregations and efficient API for complex operations, while pandas requires more complicated methods, leading to inefficiencies.

Read original article

FrustrationExcitementCuriosity

Non-elementary group-by aggregations in Polars vs pandas

The article discusses the differences between Polars and pandas, particularly focusing on group-by operations and non-elementary aggregations. While Polars has gained attention for its features like lazy execution, multithreading, and efficient handling of null values, the author highlights a less-discussed innovation: the ability to perform non-elementary group-by aggregations. The article explains how both libraries handle group-by operations, with pandas requiring more complex and less efficient methods for certain aggregations. In contrast, Polars allows users to express complex aggregations cleanly using its API, enabling more efficient computations. The author argues that while pandas is a valuable tool, libraries that mimic its API may limit their potential by not allowing for more advanced operations. The conclusion emphasizes the importance of innovative API design in enabling new possibilities for data manipulation.

- Polars offers advanced non-elementary group-by aggregations not easily achievable in pandas.

- The Polars API allows for clean expression of complex aggregations, enhancing efficiency.

- Pandas requires more complicated methods for certain operations, often leading to inefficiencies.

- The article advocates for innovative API designs in data manipulation libraries.

- Users appreciate Polars for both its speed and its intuitive syntax.

Announcing Polars 1.0 (Blog Post)

Polars releases Python version 1.0 after 4 years, gaining popularity with 27.5K GitHub stars and 7M monthly downloads. Plans include improving performance, GPU acceleration, Polars Cloud, and new features.

Why Polars rewrote its Arrow string data type

Polars has refactored its string data structure for improved performance, implementing a new storage method inspired by Hyper/Umbra, allowing inline storage of small strings and enhancing filter operation efficiency.

Rust for the small things? but what about Python?

The article compares Rust and Python for data engineering, highlighting Python's integration with LLMs and tools like Polars, while noting Rust's speed and safety but greater complexity.

GPU Acceleration with Polars

Polars has introduced GPU acceleration with NVIDIA RAPIDS, offering up to 13 times faster performance for compute-bound queries in Python, while maintaining existing API semantics and fallback to CPU execution.

DuckDB over Pandas/Polars

Paul Gross prefers DuckDB for data analysis over Polars and Pandas, citing its intuitive SQL syntax, ease of use for data manipulation, and automatic date parsing as significant advantages.

AI: What people are saying

The comments reflect a diverse range of opinions on the use of Polars versus pandas for data manipulation and analysis.

Many users appreciate Polars for its efficiency and user-friendly API, especially compared to pandas.
Some users express frustration with pandas after using other frameworks like Apache Spark, finding Polars more intuitive.
There is a recognition of the importance of competition in the data manipulation space, with Polars pushing pandas to improve.
Users are considering transitioning from pandas to Polars, citing performance benefits, particularly for large datasets.
Concerns are raised about the appropriateness of using dataframes versus simpler Python classes for certain tasks.

14 comments

By @Nihilartikel - 6 days

I did non trivial work with apache spark dataframes and came to appreciate them before ever being exposed to Pandas. After spark, pandas just seemed frustrating and incomprehensible. Polars is much more like spark and I am very happy about that.

DuckDb even goes so far as to include a clone of the pyspark dataframe API, so somebody there must like it too.

By @__mharrison__ - 6 days

Pandas sat alone in the Python ecosphere for a long time. Lack of competition is generally not a good thing. I'm thrilled to have Polars around to innovate on the API end (and push Pandas to be better).

And I say this as someone who makes much of their living from Pandas.

By @akdor1154 - 6 days

The difference is a sanely and presciently designed expression API, which is a bit more verbose in some common cases, but is more predictable and much more expressive in more complex situations like this.

On a tangent, i wonder what this op would look like in SQL? Probably would need support for filtering in a window function, which I'm not sure is standardized?

By @lend000 - 6 days

I've wanted to convert a massive Pandas codebase to Polars for a long time. Probably 90% of the compute time is Pandas operations, especially creating new columns / resizing dataframes (which I understand to involve less of a speed difference compared to the grouping operations mentioned in the post, but still substantial). Anyone had success doing this and found it to be worth the effort?

By @winwang - 6 days

The power of having an API that allows usage of the Free monad. And in less-funny-FP-speak, the power of allowing the user write a program (expressions), that the sufficiently-smart backend later compiles/interprets.

Awesome! Didn't expect such a vast difference in usability at first.

By @combocosmo - 5 days

I've always liked scatter solutions for these kind of problems:

  import numpy as np
  
  def scatter_mean(index, value):
      sums = np.zeros(max(index)+1)
      counts = np.zeros(max(index)+1)
      for i in range(len(index)):
          j = index[i]
          sums[j] += value[i]
          counts[j] += 1
      return sums / counts
  
  def scatter_max(index, value):
      maxs = -np.inf * np.ones(max(index)+1)
      for i in range(len(index)):
          j = index[i]
          maxs[j] = max(maxs[j], value[i])
      return maxs
  
  def scatter_count(index):
      counts = np.zeros(max(index)+1, dtype=np.int32)
      for i in range(len(index)):
          counts[index[i]] += 1
      return counts
  
  id = np.array([1, 1, 1, 2, 2, 2]) - 1
  sales = np.array([4, 1, 2, 7, 6, 7])
  views = np.array([3, 1, 2, 8, 6, 7])
  means = scatter_mean(id, sales).repeat(scatter_count(id))
  print(views[sales > means].max())

Obviously you'd need good implementations of the scatter operations, not these naive python for-loops. But once you have them the solution is a pretty readable two-liner.

By @Vaslo - 6 days

I’ve moved mostly to polars. I still have some frameworks that demand pandas and pandas is still a very solid dataframe, but when I need to interpolate months in millions of lines of quarterly data, polars just blows it away.

Even better is using tools like Narwhals and Ibis which can convert back and forth to any frames you want.

By @Def_Os - 5 days

Data point: I have a medium-complexity data transformation use case that I still prefer pandas for.

Reason: I can speed things up fairly easily with Cython functions, and do multithreading using the Python module. With polars I would have to learn Rust for that.

By @Larrikin - 6 days

If I'm doing some data science just for fun and personal projects, is there any reason to not go with Polars?

I took some data science classes in grad school, but basically haven't had any reason to touch pandas since I graduated. But, did like the ecosystem of tools, learning materials, and other libraries surrounding it when I was working with it. I recently just started a new project and am quickly going through my old notes to refamiliarize myself with pandas, but maybe I should just go and learn Polars?

By @kolja005 - 6 days

Does anyone have a good heuristic for when a dataframe library is a good tool choice? I work on a team that has a lot of data scientists and a few engineers (including myself) and I often see the data scientists using dataframes when simple python classes would be much more appropriate so that you have a better sense of the object you're working with. I'm been having a hard time getting this idea across to people though.

By @wismwasm - 5 days

I’m just using Ibis: https://ibis-project.org/ They provide a nice backend agnostic API. For most backends it will just compile to SQL and act as a query builder. SQL basically has solved the problem of providing a declarative data transformation syntax, why reinvent the wheel?

By @xgdgsc - 6 days

I' m tired of remembering of all these library invented concepts and prefer doing brainless for loops to process data in Julia.