July 25th, 2024

The CAP theorem Is Irrelevant for Cloud Systems

Marc Brooker argues that the CAP theorem is less relevant for cloud-based applications, emphasizing that engineers should focus on practical trade-offs like durability versus latency and consistency versus throughput.

Read original articleLink Icon
FrustrationSkepticismConfusion
The CAP theorem Is Irrelevant for Cloud Systems

The CAP theorem, which addresses trade-offs in distributed systems, is often considered foundational for engineers. However, Marc Brooker argues that it is largely irrelevant for those developing cloud-based applications. While CAP is more applicable to intermittently connected systems like IoT and mobile applications, cloud architectures typically manage network partitions effectively through redundancy and routing mechanisms. This allows for strong consistency and high availability, even during failures. Brooker emphasizes that the real challenges for cloud engineers lie in other trade-offs, such as durability versus latency and consistency versus throughput, which are more critical than the CAP theorem. He suggests that educators should focus on these practical trade-offs rather than starting with CAP when teaching new engineers. The post concludes with a call to relegate the CAP theorem to a lesser status in discussions about distributed systems, advocating for a shift towards more relevant and practical considerations in the field.

AI: What people are saying
The discussion surrounding the relevance of the CAP theorem in cloud-based applications reveals several key points of contention among commenters.
  • Many argue that the CAP theorem remains crucial for understanding trade-offs in distributed systems, despite cloud providers offering solutions that may obscure these complexities.
  • Commenters emphasize the importance of designing systems to handle network partitions and the potential consequences of ignoring CAP principles.
  • There is skepticism about the notion that cloud systems eliminate the need for careful consideration of consistency, availability, and partition tolerance.
  • Some highlight real-world experiences where neglecting CAP led to significant issues, reinforcing the theorem's relevance.
  • Others suggest that while cloud technologies may mitigate some challenges, they do not eliminate the fundamental trade-offs outlined by the CAP theorem.
Link Icon 40 comments
By @kstrauser - 7 months
So you’re setting up a multi-region RDS. If region A goes down, do you continue to accept writes to region B?

A bank: No! If region A goes down, do not process updates in B until A is back up! We’d rather be down than wrong!

A web forum: Yes! We can reconcile later when A comes back up. Until then keep serving traffic!

CAP theorem doesn’t let you treat the cloud as a magic infinite availability box. You still have to design your system to pick the appropriate behavior when something breaks. No one without deep insight into your business needs can decide for you, either. You’re on the hook for choosing.

By @mordae - 7 months
You wish.

> DNS, multi-cast, or some other mechanism directs them towards a healthy load balancer on the healthy side of the partition

Incidentally that's where CAP makes it's appearance and bites your ass.

No amount of VRRP, UCARP wishful thinking can guarantee a conclusion on what partition is "correct" in presence of a network partition between load balancer nodes.

Also, who determines where to point the DNS? A single point of failure VPS? Or perhaps a group of distributed machines voting? Yeah.

You still need to perform the analysis. It's just that some cloud providers offer the distributed voting clusters as a service and take care of the DNS and load balancer switchover for you.

And that's still not enough, because you might not want to allow stragglers write to orphan databases before the whole network fencing kicks in.

By @bunderbunder - 7 months
I once lost an entire Christmas vacation to fixing up the damage caused when an Elasticsearch cluster running in AWS responded poorly to a network partition event and started producing results that ruined our users' day (and business records) in a "costing millions of dollars" kind of way.

It was a very old version of ES, and the specific behavior that led to the problem has been fixed for a long time now. But still, the fact that something like this can happen in a cloud deployment demonstrates that this article's advice rests on an egregiously simplistic perspective on the possible failure modes of distributed systems.

In particular, the major premise that intermittent connectivity is only a problem on internetworks is just plain wrong. Hubs and switches flake out. Loose wires get jiggled. Subnetworks get congested.

And if you're on the cloud, nobody even tries to pretend that they'll tell you when server and equipment maintenance is going to happen.

By @throwaway71271 - 7 months
When I design systems I just think about tiny traitor generals and their sneaky traitor messengers racing in the war, their clocks are broken, and some of them are deaf, blind or both.

CAP or no CAP, chaos will reign.

I think FLP (https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf) is better way to think about systems.

I think CAP is not as relevant in the cloud because the complexity is so high that nobody even knows what is going on, so the just C part, regardless of the other letters, is ridiculously difficult even on a single computer. A book can be written just to explain write(2)'s surprise attacks.

So you think you have guarantees whatever the designers said they have AP or CP, and yet.. the impossible will happen twice a day (and 3 times at night when its your on-call).

By @killjoywashere - 7 months
The military lives in this world and will likely encourage people to continue thinking about it. Think about wearables on a submarine, as an example. Does the captain want to know his crew is fatigued, about to get sick, getting less exercise than they did on their last deployment? Yes. Can you talk to a cloud? No. Does the Admiral in Hawaii want to know those same answers about that boat, and every boat in the Group, eventually? Yes. For this situation, datacenter-aware databases are great. There are other solutions for other problems.
By @rdtsc - 7 months
> The CAP Theorem is Irrelevant

Just sprinkle the magic "cloud" powder on your system and ignore all the theory.

https://ferd.ca/beating-the-cap-theorem-checklist.html

Let's see, let's pick some checkboxes.

(x) you pushed the actual problem to another layer of the system

(x) you're actually building an AP system

By @xnorswap - 7 months
There's a better rebuttal(*) of CAP in Kleppmann's DDIA, under the title, "The unhelpful CAP theorem".

I won't plagiarize his text, instead the chapter references his blogpost, "Please stop calling databases CP or AP": https://martin.kleppmann.com/2015/05/11/please-stop-calling-...

(*): rebuttal I think is the wrong word, but I couldn't think of better.

By @vmaurin - 7 months
Plot twist: in the article drawings, replica one and two are split by network, and it could fail.

The author seems to not understand what the meaning of the P in CAP

By @pyrale - 7 months
Someone else tok ownership of the problem for you and sells you their solution : "The theoretical issue is irrelevant to me".

Sure. Also, there's a long list of other things that are probably irrelevant to you. That is, until your provider fails and you need to understand the situation in order to provide a workaround.

And slapping "load-balancers" everywhere on your schema is not really a solution, because load-balancers themselves are a distributed system with a state and are subject to CAP, as presented in the schema.

> DNS, multi-cast, or some other mechanism directs them towards a healthy load balancer on the healthy side of the partition.

"Somehow, something somewhere will fix my shit hopefully". Also, as a sidenote, a few friends would angrily shake their "it's always DNS" cup reading this.

edit: reading the rest of the blog and author's bio, I'm unsure whether the author is genuinely mistaken, or whether they're advertising their employer's product.

By @justinsaccount - 7 months
> None of the clients need to be aware that a network partition exists (except a small number who may see their connection to the bad side drop, and be replaced by a connection to the good side).

What a convenient world where the client is not affected by the network partition.

By @tristor - 7 months
As someone who's worked extensively on distributed systems, including at a cloud provider, after reading this I think the author doesn't actually understand the CAP theorem or the two generals problem. Their conclusions are essentially utterly incorrect.
By @kristjansson - 7 months
Many things can be solved by the SEP Field[0]

[0]: https://en.wikipedia.org/wiki/Somebody_else's_problem#Dougla...

By @ivan_gammel - 7 months
The CAP theorem is quantum mechanics of software with C*A = O(1) in theory, similarly to uncertainty principle, but in many use cases this value is so small that "classical" expectations of both C and A are fine.
By @mrkeen - 7 months
> In practice, the redundant nature of connectivity and ability to use routing mechanisms to send clients to the healthy side of partitions

Iow: You can have CAP as long as you can communicate across "partitions".

By @PaulHoule - 7 months
So glad to see that the CAP "theorem" is being recognized as a harmful selfish meme like Fielding's REST paper with a deadly seductive power against the overly pedantic.
By @senorrib - 7 months
I think every couple of months there's yet another article saying the CAP theorem is irrelevant. The problem with these is that they ignore the fact that CAP theorem isn't a guide, a framework or anything else.

It's simply the formalization of a fact, and whether or not that fact is *important* (although still a fact) depends on the actual use case. Hell, it applies even to services within the same memory space, although obviously the probability of losing any of the three is orders of magnitude less than on a network.

Can we please move on?

By @fractalic - 6 months
Hmm this article seems misleading. I suppose it's trying to make the point that application designers usually don't need to think too hard about it, because it's already being addressed by a quorum consensus protocol implemented by someone else. This is a bit of a tautology though; the author seems to be saying 'assume you have a solution to CAP theorem -- now isn't it silly to worry about CAP theorem?'.

One of the fundamental assumptions of CAP theorem is that you can't tell whether or not you have a partition. If you have an oracle that can instantaneously tell you the state of every subsystem, then yeah, CAP is pointless.

But if one of your DBs is connected, reporting itself as alive, and throwing all its writes into /dev/null, you won't be able to route traffic to a quorum of healthy instances because it's not possible to be certain that they're all healthy.

This is what CAP theorem is about: managing data in a distributed system where the status of any given system is fundamentally unknowable because of the Two Generals' Problem (https://en.wikipedia.org/wiki/Two_Generals'_Problem)

In many cases in Cloud though, we can skip that technical stuff and design systems as if we really _did_ have an oracle that could instantaneously and perfectly tell us the state of the system, and things will typically work fine.

By @rubiquity - 6 months
The point trying to be made is that with nimble infrastructure the A in CAP can be designed around to such a small amount you may as well be a CP system unless you have a really good reason to go after that 0.005% of availability. Not being CP means sacrificing the wonderful benefits that being consistent (linearizability, sequential consistency, strict serializibility) make possible. It's hard to disagree with that sentiment, and is likely why the Local First ideology is centered on data ownership rather than that extra 0.0005 ounces of availability. Once availability is no longer the center of attention the design space can be focused on durability or latency: how many copies to read/write before acking.

Unfortunately the point is lost because of the usage of the word "cloud", a somewhat contrived example of solving problems by reconfiguring load balancers (in the real world certain outages might not let you reconfigure!), and missing empathy that you can't tell people not to care about how the semantics that thinking about, or not thinking about, availability imposes on the correctness of their applications.

As for the usage of the word cloud: I don't know when a set of machines becomes a cloud. Is it the APIs for management? Or when you have two or more implementations of consensus running on the set of machines?

By @lupire - 7 months
He's saying that you don't need Partition Tolerance because network is never actually Partitioned. This is exactly why the Internet and the US Interstate Highway system were invented in the first place.

Or he's saying you don't need Consistency because your system isn't actually distributed; it's just a centralized system with hot backups.

It's unclear what he's trying to say.

No idea why he wrote the blog post. It doesn't increase my confidence in the engineering equality of his employer AWS

By @cryptonector - 7 months
> If the partition extended to the whole big internet that clients are on, this wouldn’t work. But they typically don’t.

This is the key, that network partitions either keep some clients from accessing any servers, or they keep some servers from talking to each other. The former case is uninteresting because nothing can be done server-side about it. The latter is interesting and we can fix it with load balancers.

This conflicts with the picture painted earlier in TFA where the unhappy client is somehow stuck with the unhappy server, but let's consider that just didactic.

We can also not use load balancers but have the clients talk to all the servers they can reach, when we trust the clients to behave correctly. Some architectures do this, like Lustre, which is why I mention it.

I see several comments here that seem to take TFA as saying that distributed consensus algorithms/protocols are not needed, but TFA does not say that. TFA says you can have consistency, availability, and partition tolerance because network partitions between servers typically don't extend to clients, and you can have enough servers to maintain quorum for all clients (if a quorum is not available it's as if the whole cloud is down, then it's not available to any clients). That is a very reasonable assertion, IMO.

By @linuxhansl - 6 months
So basically this is saying that the CAP theorem is irrelevant because a partition is not really have a partition (since the load balancer still can reach everybody). Hmm...

I agree that in modern data centers the CAP theorem is essentially irrelevant for intra-DC services, due the uptime and redundancy of networking H/W (making a partition less likely than other systemic failures).

Across DCs I'll claim it is still absolutely relevant.

By @hot_gril - 7 months
The only concrete solution the article proposes that I can think of: Spanner uses quorum to maintain availability and consistency. Your "master" is TrueTime, which is considered reliable enough. You have replicated app backends. If this isn't too generous, let's also say the cloud handles load balancing well enough. CAP isn't violated, but you might say the user no longer worries about it.

Most databases don't work like Spanner, and Spanner has its downsides, two of them being cost and performance. So most of the time, you're using a traditional DB with maybe a RW replica, which will sacrifice significant consistency or availability depending on whether you choose sync or async mode. And you're back to worrying about CAP.

By @skywhopper - 7 months
Weird article. Different users have different priorities and that’s what the CAP theorem expresses. The article also pretends that there’s a magic “load balancer” in the cloud that always works and also knows which segment of a partitioned network is the “correct” one (one of the points of CAP is that there’s not necessarily a “correct” side), and that no users will ever be on the “wrong” side of the partition. And not only that but all replicas see the exact same network partition. None of this is reality.

But the gist, I guess, is that for most applications it’s not actually that important, and that’s probably true. But when it is important, “the cloud” is not going to save you.

By @jorblumesea - 7 months
CAP was never designed as an end all template you blindly apply to large scale systems. Think of it more as a mental starting pointing, that systems have these trade offs you need to consider. Each system you integrate has complex and nuanced requirements that don't neatly fall into clean buckets.

As always Kleppmann has a great and deep answer for this.

https://martin.kleppmann.com/2015/05/11/please-stop-calling-...

By @bjornsing - 7 months
I suspect the CAP theorem factored into the design of these cloud architectures, in such a way that it now seems irrelevant. But it probably was relevant in preventing a lot of other more complex designs.
By @KaiserPro - 7 months
I kinda see what the author is getting at, but I don't buy the argument.

However, in the example with the network partition, it relies on proper monitoring to work out if the DB its attached to is currently in partition.

managing reads is a piece of piss, mostly. Its when you need propagate write to the rest of the DB system, thats where stuff gets hairy.

Now, most places can run from a single DB, especially as disks are fucking fast now. so CAP is never really that much of a problem. However when you go multi-region, thats when it gets interesting.

By @thayne - 6 months
This only addresses one kind of partition.

What if your servers can't talk to each other, but clients can?

What if clients can't connect to any of your servers?

What if there are multiple partitons, and none of them have a quorum?

Also, changing the routing isn't instantaneous, so you will have some period of unavailability between when the partition happens, and when the client is redirected to the partition with the quorum.

By @ibash - 7 months
> if a quorum of replicas is available to the client, they can still get both strong consistency, and uncompromised availability.

Then it’s not a partition.

By @api - 7 months
This is just saying because the cloud system hides the implications of the theorem from you, it's not relevant.

I suppose it's kinda true in the sense that how to operate a power plant is not relevant when I turn on my lights.

By @remram - 6 months
This article assumes P(artitions) don't happen, and then concludes you can have both C and A. Congrats, that's the CAP theorem.
By @hinkley - 7 months
> The formalized CAP theorem would call this system unavailable, based on their definition of availability:

Umm, no? That’s a picture of a partition. The partition is not able to make progress because the system is not partition tolerant. If it did it wouldn’t be consistent. It’s still available.

By @throw0101c - 7 months
See also:

> In database theory, the PACELC theorem is an extension to the CAP theorem. It states that in case of network partitioning (P) in a distributed computer system, one has to choose between availability (A) and consistency (C) (as per the CAP theorem), but else (E), even when the system is running normally in the absence of partitions, one has to choose between latency (L) and loss of consistency (C).

* https://en.wikipedia.org/wiki/PACELC_theorem

By @kwillets - 7 months
This seems like what I've noticed on MPP systems (a little before cloud): data replicas give a lot more availability than the number of partition events would suggest.

I likely need to read the paper linked, but it's common to have an MPP database lose a node but maintain data availability. CAP applies at various levels, but the notion of availability differs:

1. all nodes available 2. all data available

Redundancy can make #2 a lot more common than #1.

By @sir-dingleberry - 7 months
The CAP theorem is irrelevant if your acceptable response time is greater than the time it takes your partitions to sync.

At that point you get all 3: consistency,availability, partitioning.

In my opinion it should be the CAPR theorem.

By @motbus3 - 7 months
If you don't care about costs...
By @mcbrit - 6 months
All models are wrong, some are useful. CAP is probably at least as useful as Newtonian mechanics WHEN you are explaining why you just did a bunch of… extra stuff.

I would like to violate CAP, please. I would like to be nearish to c, please.

Here is my passport. I have done the work.

By @jumploops - 7 months
> The point of this post isn’t merely to be the ten billionth blog post on the CAP theorem. It’s to issue a challenge. A request. Please, if you’re an experienced distributed systems person who’s teaching some new folks about trade-offs in your space, don’t start with CAP.

Yeah… no. Just because the cloud offers primitives that allow you to skip many of the challenges that the CAP theorem outlines, doesn’t mean it’s not a critical step to learning about and building novel distributed systems.

I think the author is confusing systems practitioners with distributed systems researchers.

I agree in some part, the former rarely needs to think about CAP for the majority of B2B cloud SaaS. For the latter, it seems entirely incorrect to skip CAP theorem fundamentals in one’s education.

tl;dr — just because Kubernetes (et al.) make building distributed systems easier, it doesn’t mean you should avoid the CAP theorem in teaching or disregard it altogether.

By @hot_gril - 7 months
Every time someone tries to deprecate the nice and simple CAP theorem, it grows stronger. It's an unstoppable freight train at this point, like the concept of relational DBs after the NoSQL fad.