February 18th, 2025

Kafka at the low end: how bad can it get?

The blog post outlines Kafka's challenges as a job queue in low-volume scenarios, highlighting unfair job distribution, increased latency, and recommending caution until improvements from KIP-932 are implemented.

Read original article

Kafka at the low end: how bad can it get?

The blog post discusses the challenges of using Kafka as a job queue, particularly in low-volume scenarios. The author highlights that Kafka can lead to unfair job distribution among workers, where one worker may be overloaded while others remain idle. This issue arises from the way Kafka assigns jobs based on partitions and consumers, which can result in a single consumer handling multiple jobs consecutively. The author provides a formula to calculate the worst-case scenario for job assignments and illustrates it with an example involving a web application with multiple workers and producers. The analysis suggests that if the number of jobs in-flight during peak periods exceeds a certain threshold, workers will be more evenly utilized. However, if the job volume is low, some workers may not contribute effectively, leading to increased latency and user dissatisfaction. The author emphasizes that Kafka was not designed for low-volume job processing and that its performance benefits come at the cost of losing features found in traditional message brokers. The post concludes that Kafka is not an ideal choice for job queuing, especially until the implementation of Queues for Kafka (KIP-932) addresses these concerns.

- Kafka can lead to unfair job distribution among workers in low-volume scenarios.

- The worst-case job assignment can result in one worker handling multiple jobs while others remain idle.

- A formula is provided to calculate the worst-case scenario for job assignments.

- Kafka is not designed for low-volume job processing and sacrifices features of traditional brokers for speed.

- The author recommends caution in using Kafka as a job queue until improvements are made with KIP-932.

The Essence of Apache Kafka

Apache Kafka is a distributed event-driven architecture that enables efficient real-time data streaming, ensuring fault tolerance and scalability through an append-only log structure and partitioned topics across multiple nodes.

9 comments

By @NovemberWhiskey - 2 months

Kafka for small message volumes is one of those distinct resume-padding architectural vibes.

By @xyst - 2 months

> Each of these Web workers puts those 4 records onto 4 of the topic’s partitions in a round-robin fashion. And, because they do not coordinate this, they might choose the same 4 partitions, which happen to all land on a single consumer

Then choose a different partitioning strategy. Often key based partitioning can solve this issue. Worst case scenario, you use a custom partitioning strategy.

Additionally , why can’t you match the number of consumers in consumer group to number of partitions?

The KIP mentioned seems interesting though. Kafka folks trying to make a play towards replacing all of the distributed messaging systems out there. But does seem a bit complex on the consumer side, and probably a few foot guns here for newbies to Kafka. [1]

[1] https://cwiki.apache.org/confluence/plugins/servlet/mobile?c...

By @rockwotj - 2 months

The kafka protocol is a distributed write ahead log. If you want a job queue you need to build something on top of that, it’s a pretty low level primative.

By @jszymborski - 2 months

What do people recommend?

Especially for low levels of load, that doesn't require that the dispatcher and consumer are written in the same language.

By @voodooEntity - 2 months

We build an Infrastructure with about 6 microservices and Kafka as main message queue (job queue).

The problem the author describes is 100% true and if you are scaled with enaugh workers this can turn out really bad.

While not beeing the only issue we faced (others are more environment/project-language specific) we got to a point where we decided to switch from kafka to rabbitmq.

By @enether - 2 months

thankfully early access for KIP-932 is coming in 1-3 weeks as the 4.0.0 release gets published

By @brunoborges - 2 months

For a small load queueing system, I had great success with Apache ActiveMQ back in the days. I designed and implemented a system with the goal of triggering SMS for paid content. This was in 2012.

Ultimately, the system was fast enough that the telco company emailed us and asked to slow down our requests because their API was not keeping up.

In short: we had two Apache Camel based apps: one to look at the database for paid content schedule, and queue up the messages (phone number and content). Then, another for triggering the telco company API.

By @araes - 2 months

Having never actually used this platform before, does anybody know why they named it Kafka, with all the horrible meanings?

Per Wiktionary, Kafkaesque: [1]

1. "Marked by a senseless, disorienting, often menacing complexity."

2. "Marked by surreal distortion and often a sense of looming danger."

3. "In the manner of something written by Franz Kafka." (like the software language was written by Franz Kafka)

Example: Metamorphosis Intro: "One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections. The bedding was hardly able to cover it and seemed ready to slide off any moment. His many legs, pitifully thin compared with the size of the rest of him, waved about helplessly as he looked." [2]

[1] Wiktionary, Kafkaesque: https://en.wiktionary.org/wiki/Kafkaesque

[2] Gutenberg, Metamorphosis: https://www.gutenberg.org/cache/epub/5200/pg5200.txt

By @techcode - 2 months

What that post describes (all work going to one/few workers) in practice doesn't really happen if you properly randomize (e.g. just use random UUID) ID of the item/task when inserting it into Kafka.

With that (and sharding based on that ID/value) - all your consumers/workers will get equal amount of messages/tasks.

Both post and seemingly general theme of comments here is trashing choice of Kafka for low volume.

Interestingly both are ignoring other valid reasons/requirements making Kafka perfectly good choice despite low volume - e.g.:

- multiple different consumers/workers consuming same messages at their own pace

- needing to rewind/replay messages

- guarantee that all messages related to specific user (think bank transactions in book example of CQRS) will be handled by one pod/consumer, and in consistent order

- needing to chain async processing

And I'm probably forgetting bunch of other use cases.

And yes, even with good sharding - if you have some tasks/work being small/quick while others being big/long can still lead to non-optimal situations where small/quick is waiting for bigger one to be done.

However - if you have other valid reasons to use Kafka, and it's just this mix of small and big tasks that's making you hesitant... IMHO it's still worth trying Kafka.

Between using bigger buckets (so instead of 1 fetch more items/messages and handle work async/threads/etc), and Kafka automatically redistributing shards/partitions if some workers are slow ... You might be surprised it just works.

And sure - you might need to create more than one topic (e.g. light, medium, heavy) so your light work doesn't need to wait for heavier one.

Finally - I still didn't see anyone mention actual real deal breakers for Kafka.

From the top of my head I recall a big one is no guarantee of item/message being processed only once - even without you manually rewinding/reprocessing it.

It's possible/common to have situations where worker picks up a message from Kafka, processes (wrote/materialized/updated) it and when it's about to commit the kafka offset (effectively mark it as really done) it realizes Kafka already re-partitioned shards and now another pod owns particular partition.

So if you can't model items/messages or the rest of system in a way that can handle such things ... Say with versioning you might be able to just ignore/skip work if you know underlying materialized data/storage already incorporates it, or maybe whole thing is fine with INSERT ON DUPLICATE KEY UPDATE) - then Kafka is probably not the right solution.

Kafka at the low end: how bad can it get?

Related

The Essence of Apache Kafka

Related

The Essence of Apache Kafka