What If We Could Rebuild Kafka from Scratch?
Gunnar Morling proposes "Kafka.next," a cloud-optimized version of Kafka, featuring key-centric access, topic hierarchies, concurrency control, extensibility, synchronous commit callbacks, snapshotting, and multi-tenancy for improved data management.
Read original articleGunnar Morling discusses the potential for a new version of Kafka, referred to as "Kafka.next," which would be designed from the ground up to better suit cloud environments. He highlights recent developments like KIP-1150 ("Diskless Kafka") and AutoMQ’s Kafka fork, which aim to enhance Kafka's functionality in cloud settings. Morling outlines several desirable features for this new system, including the elimination of partitions in favor of key-centric access, which would allow for more efficient data retrieval and processing. He suggests implementing topic hierarchies for better subscription management, concurrency control to prevent outdated data writes, and broker-side schema support to improve data integrity. Additionally, he emphasizes the importance of extensibility and pluggability, enabling users to customize the system without altering its core. Other proposed features include synchronous commit callbacks for stronger consistency, snapshotting for better state management, and built-in multi-tenancy to support diverse workloads. Morling notes that while some of these features exist in other systems, no single open-source platform currently combines all of them. He invites feedback from others who have experience with Kafka or similar platforms to contribute their ideas.
- Morling proposes a new version of Kafka designed for cloud environments.
- Key features include key-centric access, topic hierarchies, and concurrency control.
- Emphasis on extensibility and pluggability for user customization.
- Synchronous commit callbacks and snapshotting are suggested for improved data management.
- Multi-tenancy is highlighted as essential for modern data systems.
Related
The Essence of Apache Kafka
Apache Kafka is a distributed event-driven architecture that enables efficient real-time data streaming, ensuring fault tolerance and scalability through an append-only log structure and partitioned topics across multiple nodes.
Kafka at the low end: how bad can it get?
The blog post outlines Kafka's challenges as a job queue in low-volume scenarios, highlighting unfair job distribution, increased latency, and recommending caution until improvements from KIP-932 are implemented.
Apache Kafka 4.0 Released
Apache Kafka 4.0.0 has been released, eliminating ZooKeeper, enhancing scalability, and introducing new features like improved consumer performance and queue semantics support, while requiring updated Java versions and removing deprecated APIs.
KIP-1150: Diskless Kafka Topics
KIP-1150 proposes Diskless Topics in Apache Kafka to optimize storage and reduce costs by using object storage, enabling multi-region active-active topics and automatic failover, enhancing Kafka's market competitiveness.
- Many users express frustration with Kafka's complexity and limitations, suggesting that it may not be the best solution for all use cases.
- Alternatives to Kafka, such as NATS and Apache Pulsar, are mentioned as potentially simpler and more effective options.
- Concerns are raised about Kafka's design choices, particularly regarding partitioning and latency, with some suggesting that it may not be suitable for certain applications.
- There is a call for reflection on the purpose of engineering improvements, with some commenters questioning the need for a cloud-optimized Kafka.
- Several users mention ongoing developments in the space, including LinkedIn's Northguard and other projects that aim to address Kafka's shortcomings.
But today, all streaming systems (or workarounds) with per message key acknowledgements incur O(n^2) costs in either computation, bandwidth, or storage per n messages. This applies to Pulsar for example, which is often used for this feature.
Now, now, this degenerate time/space complexity might not show up every day, but when it does, you’re toast, and you have to wait it out.
My colleagues and I have studied this problem in depth for years, and our conclusion is that a fundamental architectural change is needed to support scalable per message key acknowledgements. Furthermore, the architecture will fundamentally require a sorted index, meaning that any such a queuing / streaming system will process n messages in O (n log n).
We’ve wanted to blog about this for a while, but never found the time. I hope this comment helps out if you’re thinking of relying on per message key acknowledgments; you should expect sporadic outages / delays.
Just don't use Kafka.
Write to the downstream datastore directly. Then you know your data is committed and you have a database to query.
I feel like Kafka is a victim of it's own success, it's excellent for what it was designed, but since the design is simple and elegant, people have been using it for all sorts of things for which it was not designed. And well, of course it's not perfect for these use cases.
Anyone here have some real world experience with it?
> "Key-level streams (... of events)"
When you are leaning on the storage backend for physical partitioning (as per the cloud example, where they would literally partition based on keys), doesnt this effectively just boil down to renaming partitions to keys, and keys to events?
I skipped learning Kafka, and jumped right into Pulsar. It works great for our use case. No complaints. But I wonder why so few use it?
This is exactly what I interpret from these kind of articles: engineering just for the cause of engineering. I am not saying we should not investigate on how to improve our engineered artifacts, or that we should not improve them. But I see a generalized lack of reflection on why we should do it, and I think it is related to a detachment from the domains we create software for. The article suggests uses of the technology that come from so different ways of using it, that it looses coherence as a technical item.
While it would be useful to just blame Kafka for being bad technology it seems many other people get it wrong, too.
But don’t public cloud providers already all have cloud-native event sourcing? If that’s what you need, just use that instead of Kafka.
Fast forward into 2025, there are many performant, efficient and less complex alternatives to Kafka that save you money, instead of burning millions in operational costs "to scale".
Unless you are at a hundred million dollar revenue company, choosing Kafka in 2025 is doesn't make sense anymore.
The backing db in this wishlist would be something in the vein of Aurora to achieve the storage compute split.
I'm currently building a full workload scheduler/orchestrator. I'm sick of Kubernetes. The world needs better -> https://x.com/GeoffreyHuntley/status/1915677858867105862
>I cannot make you understand. I cannot make anyone understand what is happening inside me. I cannot even explain it to myself. -Franz Kafka, The Metamorphosis
Feels like there is another squeeze in that idea if someone “just” took all their docs and replicated the feature set. But maybe that’s what S2 is already aiming at.
Wonder how long warpstream docs, marketing materials and useful blogs will stay up.
Related
The Essence of Apache Kafka
Apache Kafka is a distributed event-driven architecture that enables efficient real-time data streaming, ensuring fault tolerance and scalability through an append-only log structure and partitioned topics across multiple nodes.
Kafka at the low end: how bad can it get?
The blog post outlines Kafka's challenges as a job queue in low-volume scenarios, highlighting unfair job distribution, increased latency, and recommending caution until improvements from KIP-932 are implemented.
Apache Kafka 4.0 Released
Apache Kafka 4.0.0 has been released, eliminating ZooKeeper, enhancing scalability, and introducing new features like improved consumer performance and queue semantics support, while requiring updated Java versions and removing deprecated APIs.
KIP-1150: Diskless Kafka Topics
KIP-1150 proposes Diskless Topics in Apache Kafka to optimize storage and reduce costs by using object storage, enabling multi-region active-active topics and automatic failover, enhancing Kafka's market competitiveness.