August 29th, 2024

Farewell Pandas, and thanks for all the fish

Ibis will remove its pandas and Dask backends in version 10.0, favoring DuckDB for better performance and ease of use, while still allowing pandas DataFrames for data transfer.

Read original article

Farewell Pandas, and thanks for all the fish

Ibis has announced the deprecation of its pandas and Dask backends, with plans to remove them in version 10.0. The decision stems from the lack of feature disparity between the pandas backend and the default DuckDB backend, which offers superior performance. While pandas DataFrames will still be usable for data transfer to and from Ibis, the execution of queries using pandas will no longer be supported. The initial inclusion of the pandas backend was to facilitate user adoption without requiring complex installations, but it has proven to be less efficient due to its eager execution model, which contrasts with Ibis's deferred execution approach. This mismatch has led to slower performance and unnecessary complexity in the codebase. Additionally, the use of NaN for missing values in pandas has created ongoing challenges for Ibis, which uses NULL. The DuckDB backend, which can query pandas DataFrames and supports various data formats, is now favored for its ease of installation and speed, making it the recommended choice for users.

- Ibis will remove the pandas and Dask backends in version 10.0.

- DuckDB is now the default backend due to its superior performance and ease of use.

- Users can still use pandas DataFrames for data transfer, but not for executing queries.

- The pandas backend's eager execution model has led to inefficiencies compared to Ibis's deferred model.

- The decision aims to streamline Ibis's functionality and improve user experience.

Db2 is a story worth telling, even if IBM won't

Db2, IBM's renowned database, with a history since the 1980s, faces uncertainty as IBM remains quiet about its future. Recent updates include AI features and cloud integration, but lack of communication raises concerns about its competitiveness against growing alternatives like PostgreSQL.

Db2 is a story worth telling, even if IBM won't

Db2, IBM's longstanding relational database, faces uncertainty as IBM remains tight-lipped about its future. Recent AI enhancements and a move towards a cloud-first approach contrast with IBM's vague roadmap, sparking speculation.

Memory Management in DuckDB

DuckDB optimizes query processing with effective memory management, using a streaming execution engine and disk spilling for large datasets. Its buffer manager enhances performance by caching frequently accessed data.

The Future of Kdb+

The article examines kdb+'s future in financial services, noting competition from newer technologies and suggesting KX should enhance its product and consider strategic changes to maintain relevance.

pg_duckdb: Splicing Duck and Elephant DNA

MotherDuck launched pg_duckdb, an open-source extension integrating DuckDB with Postgres to enhance analytical capabilities while maintaining transactional efficiency, supported by a consortium of companies and community contributions.

24 comments

By @mananaysiempre - 8 months

> NULL indicates a missing value, and NaN is Not a Number.

That’s actually less true than it sounds. One of the primary functions of NaN is to be the result of 0/0, so there it means that there could be a value but we don’t know what it is because we didn’t take the limit properly. One of the primary functions of NULL is to say that a tuple satisfies a predicate except we don’t know what this one position is—it’s certainly is something out in the real world, we just don’t know what. These ideas are what motivates the comparison shenanigans both NaN and NULL are known for.

There’s certainly an argument to be made that the actual implementation of both of these ideas is half-baked in the respective standards, and that they are half-baked differently so we shouldn’t confuse them. But I don’t think it’s fair to say that they are just completely unrelated. If anything, it’s Python’s None that’s doesn’t belong.

By @__mharrison__ - 8 months

Not surprising. There are much better compute engines than pandas.

Folks then ask why not jump from pandas to [insert favorite tool]?

- Existing codebases. Lots of legacy pandas floating about.

- Third party integration. Everyone supports pandas. Lots of libraries work with tools like Polars, but everything works with pandas.

- YAGNI - For lots of small data tasks, pandas is perfectly fine.

By @riezebos - 8 months

Nice to see, over the past months I've replaces pandas with ibis in all new projects and I am a huge fan!

- Syntax in general feels more fluid than pandas

- Chaining operations with deferred expressions makes code snippets very portable

- Duckdb backend is super fast

- Community is very active, friendly and responsive

I'm trying to promote it to all my peers but it's not a very well known project in my circles. (Unlike Polars which seems to be the subject of 10% of the talks at all Python conferences)

By @kremi - 8 months

Pandas has been working fine for me. The most powerful feature that makes me stick to it is the multi-index (hierarchical indexes) [1]. Can be used for columns too. Not sure how the cool new kids like polars or ibis would fare in that category.

[1] https://pandas.pydata.org/docs/user_guide/advanced.html#adva...

By @frakt0x90 - 8 months

Curious if you considered Polars. That's become the defacto standard in my group as we all dislike pandas.

By @softwaredoug - 8 months

One thing I do like about pandas is it’s pretty extensible to columns of new types. Maybe I’m missing something, but does Polars allow this? Last time I checked there wasn’t a clear path forward.

By @spywaregorilla - 8 months

What is this exactly?

The front page doesn't make it clear to make what ibis is besides being an alternative to pandas/polars.

By @joelschw - 8 months

Huge fan of Ibis, the value isn't that you can now use DuckDB... it's that your syntax will work when the next cool thing arrives too

By @dammaj - 8 months

Personnally, I tend to use Pandas because it is integrated everywhere, because of the ecosystem that uses it. Let's say I want to read data from json file (csv file, python dict, etc.) and I want to plot it using plotly. If Ibis is compatible with whatever Pands dataframe is compatible with, then for most of my usage I don't really care much about the "backend".

By @thsgtu - 8 months

About time. It always surprises me how long pandas has been able to hold on. Wes McKinney talked about pandas' limitations back in 2017 https://wesmckinney.com/blog/apache-arrow-pandas-internals/

By @glial - 8 months

In my experience, the best thing about Pandas is how much it made me appreciate using dplyr and the tidyverse. If it wasn't for Pandas, I may not be the avid R user I am today.

By @infecto - 8 months

I will try it out but honestly the pain-point for me is the library api to pandas is not always intuitive/natural to how things should be in Python. The NaN/None bit is annoying but I find that to be a minor annoyance.

By @Kalanos - 8 months

I've only heard about Ibis maybe three times in the past two years and I pay pretty close attention to the space. If Ibis moves away from pandas, then it just means that I am less likely to try Ibis because there is no bridge.

Sure, hip new frameworks are moving away from pandas/numpy, but I'll wait 5 years for the dust to settle here while the compatibility and edge cases sort themselves out. The pydata/numfocus ecosystem is extensive.

It's just tabular data. So what if I have to wait a few more milliseconds to get my result.

By @mrguyorama - 8 months

Pandas made me think I hated python.

By @log4shell - 8 months

Its great to have a single entrypoint for multiple backends. What I am trying to understand and couldn't find much information related to: How does the use of multiple engines in Ibis impact the consistency of results for the same input and query, particularly in relation to semantic differences among the engines?

By @whimsicalism - 8 months

Frankly, pandas/dask/polars - all are trying to recreate something that has existed for years (ie. sql, spark) and with a terrible API.

I truly thought I was terrible at python for a long time because of pandas - turns out it just has an absolutely terrible API surface

By @zhangyt26 - 8 months

How does ibis manage ordering of the data frame for SQL based engine?

One big gap that I noticed is that pandas retain ordering between operations, bust most SQL based engine doesn’t give you that guarantee. Looking at duckdb, it looks they only do that for csv.

By @maxdo - 8 months

curious to know what they use instead of Dask ?

By @mrbluecoat - 8 months

Dask was really cool back in the day. Farewell, friend.

By @ironmagma - 8 months

Man, these frontend people are constantly churning through dependencies, a new framework every year. Don't reinvent the wheel, just be like backend folks and pick a dependency and stick with it. Oh wait...

By @antman - 8 months

"TL; DR: we are deprecating the pandas and dask backends and will be removing them in version 10.0.

There is no feature gap between the pandas backend and our default DuckDB backend"

There is an important feature gap, pandas has all the users and no one knows you guys. Was trying to pitch this to my company but I guess its goodbye Ibis and thanks for all the fish.

By @rldjbpin - 8 months

you could completely respect the decision to deprecate some support, but the marketing speak comes across distasteful.

from a user's perspective, selecting a tool should be about what's "best" for you and not what is the "best" out there (from some arbitrary measure nonetheless).

pandas made sense and might still make sense for many, especially who are migrating from R and are already comfortable with dataframe paradigm. using the same paradigm and just saying "we run faster" does not solve the ergonomics or the reason why people adopted this way of processing data.

there are plenty players in the space going about this from the stack. would be nice to see more discussion on alternative generic data processing paradigms instead.

By @UncleOxidant - 8 months

Related to this pandas?: https://pandas.pydata.org/

Db2 is a story worth telling, even if IBM won't

Memory Management in DuckDB

The Future of Kdb+

The article examines kdb+'s future in financial services, noting competition from newer technologies and suggesting KX should enhance its product and consider strategic changes to maintain relevance.