Databases in 2024: A Year in Review
In 2024, Redis Ltd. and Elastic N.V. changed their licensing models amid competition, while Databricks and Snowflake intensified rivalry, focusing on ecosystem integration. DuckDB gained popularity for analytics.
Read original articleIn 2024, the database landscape experienced significant changes, particularly regarding licensing and competition among major players. Redis Ltd. transitioned from a permissive BSD-3 license to a dual license model, prompting backlash and the emergence of forks like Valkey and Redict. This shift was part of Redis Ltd.'s strategy to consolidate control and prepare for an IPO. Similarly, Elastic N.V. reverted its licensing for Elasticsearch back to AGPL after initially adopting a dual-license model, following competitive pressures from Amazon's OpenSearch. The ongoing rivalry between Databricks and Snowflake intensified, with both companies investing heavily in open-source large language models (LLMs) and competing for dominance in data management ecosystems. Databricks acquired Tabular for $2 billion, while Snowflake announced its Polaris catalog service. The competition has shifted from raw performance to the broader ecosystem surrounding databases, emphasizing compatibility and user experience. Additionally, DuckDB emerged as a popular choice for analytical queries, with various extensions developed to integrate it into existing systems like Postgres. Overall, the year highlighted the challenges faced by open-source database vendors against cloud giants and the evolving dynamics of database technology.
- Redis Ltd. faced backlash after changing its license, leading to the creation of forks.
- Elastic N.V. reverted Elasticsearch's license back to AGPL amid competition with Amazon.
- Databricks and Snowflake intensified their rivalry, focusing on ecosystem integration and open-source LLMs.
- DuckDB gained popularity for analytical queries, with extensions for Postgres released.
- The database market is increasingly influenced by cloud vendors, challenging open-source ISVs.
Related
Why Did Databricks Open-Source Unity Catalog?
Databricks has open-sourced Unity Catalog and acquired Tabular, signaling a shift towards open-source solutions in lakehouse architecture, with support from major companies and potential impacts on Apache Iceberg.
Elastic's Return to Open Source
Elastic is reintroducing open source licenses, including AGPL, following AWS's fork of Elasticsearch into OpenSearch. This shift has improved Elastic's partnership with AWS and may influence other companies' open source strategies.
Valkey 8 sets a new bar for open-source in-memory NoSQL data storage
Valkey 8.0 has been released, enhancing performance and reliability as a competitor to Redis, with features like multi-core utilization and automatic failover, receiving strong support from major tech companies.
Redis users considering alternatives after licensing move
Around 70% of Redis users are seeking alternatives after the shift to restrictive licenses. Valkey, a new open-source option backed by major companies, has emerged in response to these changes.
Valkey 8 sets a new bar for open-source in-memory NoSQL data storage
Valkey 8.0 has been released, enhancing performance and reliability as a NoSQL competitor to Redis, with features like multi-core utilization and automatic failover, supported by major tech companies.
The architecture is quite ancient at this point, but I'm not sure it's completely outdated. It's single-master shared-nothing, with shards distributed among replicas, similar to Citus. But the GPORCA query planner is probably the most advanced distributed query planner in the open source world at this point. From what I know, Greenplum/Cloudberry can be significantly faster than Citus thanks to the planner being smarter about splitting the work across shards.
DuckDB is a great tool. In April 2020, the creator of DuckDB gave a talk at CMU. In the beginning he makes a convincing argument (in 5 minutes) why data scientists don't use RDBMS and how this was the genesis of DuckDB. Here is a video that starts 3 minutes into the talk (where is argument starts): https://youtu.be/PFUZlNQIndo?si=ql9n2QuBlAEuGIqo&t=204
Of course it’s not to be used as a general purpose DB it’s keys and values. Used for caches and things like that. In my experience in real world scenarios and loads vanilla single threaded Redis is stable, fast, and nigh bulletproof.
When original license is as restricted as AGPL it is unlikely there is much of embedded use... so less people are impacted in truly catastrophic way
Also if there is no contributor community to speak of... who is going to do the fork ?
I put some thoughts about it in my post about ScyllaDB https://peterzaitsev.com/thoughts-on-scylladb-license-change...
I have! It's a pretty good no-code/minimal-code graphical ELT+Analytics in one tool. It's one of those alternate-universe tools that has it's own way of doing things from everything else in the industry, but it’s pragmatic and the people who use it tend to love it.
The one thing that makes it viable is that is has/had (pre-acquisition) very aggressive compatibility with anything else that can hold data, so you can use it as a bolt-on to whatever other databases or files your company has.
Despite what the PE press release about the acquisition says, it has virtually nothing to do with AI, at lease in the modern big NN sense.
If you're looking to fix your giant pile of alteryx workbooks or migrate them to something else, hmu
> OtterTune. Dana, Bohan, and I worked on this research project and startup for almost a decade. And now it is dead. I am disappointed at how a particular company treated us at the end, so they are forever banned from recruiting CMU-DB students. They know who they are and what they did.
Ouch.
> Lastly, I want to give a shout-out to ByteBase for their article Database Tools in 2024: A Year in Review. In previous years, they emailed me asking for permission to translate my end-of-year database articles into Chinese for their blog. This year, they could not wait for me to finish writing this one, so they jocked my flow and wrote their own off-brand article with the same title and premise.
Also sounds like he's preparing a new company:
> I hope to announce our next start-up soon (hint: it’s about databases).
If anything this shows how insanely difficult it must be to succeed as a database startup (when was the most recent startup success in this space?), as the founding team is stellar.
On the other hand I am surprised it died this quick and interested to know if they did a proper postmortem. Not only did they raise way more than is needed to survive for three years but the idea is about utilising AI to improve DB performance and I find it hard to imagine they couldn't find more investors to lend them a lifeline with all the AI hype.
AFAIK people didn't take MongoDB seriously from the start, especially with the "web scale database" joke circulating. The Neo4j Community version has been under GPLv3 for quite some time, while the Enterprise version has always been somewhat closed, regardless of whether the source code was available on GitHub (the mentioned license change affected the Enterprise version).
Regarding CockroachDB, I must admit that I've only heard about it on HN and don't know anyone who seriously uses it. As for Kafka, there are two versions: Apache Kafka, the open-source version that almost everyone uses (under the Apache license), and Confluent Kafka, which is Apache Kafka enhanced with many additional features from Confluent, and the license change affected Confluent Kafka. In short, maybe the majority simply didn't care about these projects very much, so there is no major fork.
> It cannot be because the Redis and Elasticsearch install base is so much larger than these other systems, and therefore, there were more people upset by the change since the number of MongoDB and Kafka installations was equally as large when they switched their licenses.
I can’t speak for MongoDB, but the Confluent Kafka install base is significantly smaller than that of Apache Kafka, Redis and ES.
> Dana, Bohan, and I worked on this research project and startup for almost a decade. And now it is dead. I am disappointed at how a particular company treated us at the end, so they are forever banned from recruiting CMU-DB students. They know who they are and what they did.
Call me a skeptic, but I can't see this as a fair approach. If your company fails for whatever reasons, you should not recruit the university department/group/students against your peers (I can't find that CMU-DB was one of the founders of Ottertune).
Wrt Andy, here are [1] somehow interesting views from (presumably) previous employees.
[1] https://www.reddit.com/r/Database/comments/1dgaazw/comment/l...
I worked at a company for a while that used QLDB as the primary system of record. The idea is great but the problem is that due to performance and other QLDB limitations all data had to be mirrored to an RDBMS via a streaming/queuing system, and there always were programmatic errors in interpreting data arriving for import into the RDBMS ... text field too long for RDBMS field; wrong data type or overflowing integer; invalid text encoding; ... Etc. These errors had to be noticed, debugged, fixed, and data had to be re-streamed. In the meantime official transactions were missing from the RDBMS side, which was used for reporting, driving the UI, deriving monetary obligations, etc. it was not worth the trouble. (I was lucky to not be involved in that design or implementation.)
Yes this can happen. But a lot of people don’t want a AWS managed service. They're like 30% cheaper for 30% less value. They can develop a bad reputation and feel like weird forks (kinesis vs Kafka) that have weird undocumented gotchas and edge cases that never get fixed. Many teams want to host on k8s anyway, and you’ll probably have better k8s support from the main project. Another example is the success of Flink over hosted Google Dataflow. Seems eventually the teams I know trend to the most mainstream OSS implementation over time, maybe after early prototyping on a managed system.
IMO it might not be the highest growth market anymore. Those who want to pay for a managed service will. But many are just figuring out a k8s based solution to their infra needs as k8s knowledge becomes more ubiquitous.
One factual issue: "The university had previously announced that this player was transferring from Louisiana State to Michigan." This is not true. Underwood had committed to LSU but then switched his commitment to Michigan. He was still in high school at the time, and has never attended LSU.
But, do you really expect a funny database prof to know much about football?
Oracle actually released 9.1 already in 2024. [1] And expect another release this month, and every quarter. So I think MySQL continues to get some new features bug fix and support like it used to. Contrary to most people think it is all going to Heatwave. I just hope Vector will be open source later as official to MySQL rather than behind Heatwaves.
[1] https://dev.mysql.com/doc/relnotes/mysql/9.1/en/news-9-1-0.h...
> Postgres' support for extensions and plugins is impressive. One of the original design goals of Postgres from the 1980s was to be extensible. The intention was to easily support new access methods and new data types and operations on those data types (i.e., object-relational). Since 2006, Postgres' "hook" API. Our research shows that Postgres has the most expansive and diverse extension ecosystem compared to every other DBMS.
Greenhorn developers don't even know that there are non-Postgres databases which have extensions too - such is the gap! I wouldn't be surprised if Postgres had as many as all others combined.
It appears that a lot of attention is now directed at the folks doing 100 MB queries, and the high end has moved past everybody's radar. My idea of an exciting product is Ocient, who have skipped over Cloud and gone for hyperscale on-prem hardware. Yellowbrick is also a contender here.
I have a lot of experience with Vertica, and they seem to have gotten stuck in this niche as well, with sales tilted towards big accounts, but less traction in smaller shops, and a difficult road to get a SaaS or similar easy-start offering.
There's a crossover point where self-managed is cheaper than cloud, but nobody seems to have any idea where it is. Snowflake will gladly tell you that your sub-$1M Vertica cluster should be replaced by $10M of sluggish SaaS, and that you are saving money by doing so. These decisions seem more in the realm of psychology or political science.
DHH's cloud exit was a refreshing take on the expense issue, even if it wasn't strictly in the database space -- the cost per VCPU and so forth that he documented is a good start for estimating savings, and he debunked a lot of the "hidden costs" that cloud maximalists claim.
In the business/financial space the biggest news to me was the correction in Snowflake's stock price, which seemed to indicate that investors were finally noticing metrics like price-performance, but they added a little more AI and went back into irrationality.
I'm heavily in favor of DuckDB, Hudi, Iceberg, S3 tables, and the like. Mixing high-end and low-end tools seems like the best strategy (although settling on one high-end DWH has also worked IME), and the low end is getting better and cheaper, squeezing out the mid-range SaaS vendors.
In research I found Goetz Graefe's work in offset-value coding exciting -- he's wired it into query operators in a way that saves a lot of CPU on sorting and joins/aggregation. This is a technique that I've applied favorably in string sorting, and it was discovered in the DB community decades ago but largely forgotten. (This work precedes 2024, but I'm a slow study.)
The link for "MariaDB corporation" points to an empty image with white colour background. Can anyone explain the context here?
I don’t care about the billion-dollar drama behind a piece of tech, but Redis defined the key-value query API for many similar databases. Trashing it just because it isn’t SQL-like feels unjustified.
Maybe algorithms review or TCS review or some specific math topic review next?
for a moment i got reminded of the rap music in your courses
im glad that tigerbeetle got here, really impressive team they have.
there are a lot of other missing alien technologies i've discovered recently too like quickwit which is like elasticsearch but s3-compatible, and typesense which is like elasticsearch but memory-based
A little sad Andy didn't share more of his thoughts on the intersection between Data and AI, and how that's going to evolve.
guys, what are we doing here. this is ridiculous. andy pavlo cannot get an article on wikipedia? have you seen his work?
Related
Why Did Databricks Open-Source Unity Catalog?
Databricks has open-sourced Unity Catalog and acquired Tabular, signaling a shift towards open-source solutions in lakehouse architecture, with support from major companies and potential impacts on Apache Iceberg.
Elastic's Return to Open Source
Elastic is reintroducing open source licenses, including AGPL, following AWS's fork of Elasticsearch into OpenSearch. This shift has improved Elastic's partnership with AWS and may influence other companies' open source strategies.
Valkey 8 sets a new bar for open-source in-memory NoSQL data storage
Valkey 8.0 has been released, enhancing performance and reliability as a competitor to Redis, with features like multi-core utilization and automatic failover, receiving strong support from major tech companies.
Redis users considering alternatives after licensing move
Around 70% of Redis users are seeking alternatives after the shift to restrictive licenses. Valkey, a new open-source option backed by major companies, has emerged in response to these changes.
Valkey 8 sets a new bar for open-source in-memory NoSQL data storage
Valkey 8.0 has been released, enhancing performance and reliability as a NoSQL competitor to Redis, with features like multi-core utilization and automatic failover, supported by major tech companies.