Farewell Pandas, and thanks for all the fish
Ibis will remove its pandas and Dask backends in version 10.0, favoring DuckDB for better performance and ease of use, while still allowing pandas DataFrames for data transfer.
Read original articleIbis has announced the deprecation of its pandas and Dask backends, with plans to remove them in version 10.0. The decision stems from the lack of feature disparity between the pandas backend and the default DuckDB backend, which offers superior performance. While pandas DataFrames will still be usable for data transfer to and from Ibis, the execution of queries using pandas will no longer be supported. The initial inclusion of the pandas backend was to facilitate user adoption without requiring complex installations, but it has proven to be less efficient due to its eager execution model, which contrasts with Ibis's deferred execution approach. This mismatch has led to slower performance and unnecessary complexity in the codebase. Additionally, the use of NaN for missing values in pandas has created ongoing challenges for Ibis, which uses NULL. The DuckDB backend, which can query pandas DataFrames and supports various data formats, is now favored for its ease of installation and speed, making it the recommended choice for users.
- Ibis will remove the pandas and Dask backends in version 10.0.
- DuckDB is now the default backend due to its superior performance and ease of use.
- Users can still use pandas DataFrames for data transfer, but not for executing queries.
- The pandas backend's eager execution model has led to inefficiencies compared to Ibis's deferred model.
- The decision aims to streamline Ibis's functionality and improve user experience.
Related
Db2 is a story worth telling, even if IBM won't
Db2, IBM's renowned database, with a history since the 1980s, faces uncertainty as IBM remains quiet about its future. Recent updates include AI features and cloud integration, but lack of communication raises concerns about its competitiveness against growing alternatives like PostgreSQL.
Db2 is a story worth telling, even if IBM won't
Db2, IBM's longstanding relational database, faces uncertainty as IBM remains tight-lipped about its future. Recent AI enhancements and a move towards a cloud-first approach contrast with IBM's vague roadmap, sparking speculation.
Memory Management in DuckDB
DuckDB optimizes query processing with effective memory management, using a streaming execution engine and disk spilling for large datasets. Its buffer manager enhances performance by caching frequently accessed data.
The Future of Kdb+
The article examines kdb+'s future in financial services, noting competition from newer technologies and suggesting KX should enhance its product and consider strategic changes to maintain relevance.
pg_duckdb: Splicing Duck and Elephant DNA
MotherDuck launched pg_duckdb, an open-source extension integrating DuckDB with Postgres to enhance analytical capabilities while maintaining transactional efficiency, supported by a consortium of companies and community contributions.
That’s actually less true than it sounds. One of the primary functions of NaN is to be the result of 0/0, so there it means that there could be a value but we don’t know what it is because we didn’t take the limit properly. One of the primary functions of NULL is to say that a tuple satisfies a predicate except we don’t know what this one position is—it’s certainly is something out in the real world, we just don’t know what. These ideas are what motivates the comparison shenanigans both NaN and NULL are known for.
There’s certainly an argument to be made that the actual implementation of both of these ideas is half-baked in the respective standards, and that they are half-baked differently so we shouldn’t confuse them. But I don’t think it’s fair to say that they are just completely unrelated. If anything, it’s Python’s None that’s doesn’t belong.
Folks then ask why not jump from pandas to [insert favorite tool]?
- Existing codebases. Lots of legacy pandas floating about.
- Third party integration. Everyone supports pandas. Lots of libraries work with tools like Polars, but everything works with pandas.
- YAGNI - For lots of small data tasks, pandas is perfectly fine.
- Syntax in general feels more fluid than pandas
- Chaining operations with deferred expressions makes code snippets very portable
- Duckdb backend is super fast
- Community is very active, friendly and responsive
I'm trying to promote it to all my peers but it's not a very well known project in my circles. (Unlike Polars which seems to be the subject of 10% of the talks at all Python conferences)
[1] https://pandas.pydata.org/docs/user_guide/advanced.html#adva...
The front page doesn't make it clear to make what ibis is besides being an alternative to pandas/polars.
Sure, hip new frameworks are moving away from pandas/numpy, but I'll wait 5 years for the dust to settle here while the compatibility and edge cases sort themselves out. The pydata/numfocus ecosystem is extensive.
It's just tabular data. So what if I have to wait a few more milliseconds to get my result.
I truly thought I was terrible at python for a long time because of pandas - turns out it just has an absolutely terrible API surface
One big gap that I noticed is that pandas retain ordering between operations, bust most SQL based engine doesn’t give you that guarantee. Looking at duckdb, it looks they only do that for csv.
There is no feature gap between the pandas backend and our default DuckDB backend"
There is an important feature gap, pandas has all the users and no one knows you guys. Was trying to pitch this to my company but I guess its goodbye Ibis and thanks for all the fish.
from a user's perspective, selecting a tool should be about what's "best" for you and not what is the "best" out there (from some arbitrary measure nonetheless).
pandas made sense and might still make sense for many, especially who are migrating from R and are already comfortable with dataframe paradigm. using the same paradigm and just saying "we run faster" does not solve the ergonomics or the reason why people adopted this way of processing data.
there are plenty players in the space going about this from the stack. would be nice to see more discussion on alternative generic data processing paradigms instead.
Related
Db2 is a story worth telling, even if IBM won't
Db2, IBM's renowned database, with a history since the 1980s, faces uncertainty as IBM remains quiet about its future. Recent updates include AI features and cloud integration, but lack of communication raises concerns about its competitiveness against growing alternatives like PostgreSQL.
Db2 is a story worth telling, even if IBM won't
Db2, IBM's longstanding relational database, faces uncertainty as IBM remains tight-lipped about its future. Recent AI enhancements and a move towards a cloud-first approach contrast with IBM's vague roadmap, sparking speculation.
Memory Management in DuckDB
DuckDB optimizes query processing with effective memory management, using a streaming execution engine and disk spilling for large datasets. Its buffer manager enhances performance by caching frequently accessed data.
The Future of Kdb+
The article examines kdb+'s future in financial services, noting competition from newer technologies and suggesting KX should enhance its product and consider strategic changes to maintain relevance.
pg_duckdb: Splicing Duck and Elephant DNA
MotherDuck launched pg_duckdb, an open-source extension integrating DuckDB with Postgres to enhance analytical capabilities while maintaining transactional efficiency, supported by a consortium of companies and community contributions.