July 25th, 2024

The CAP theorem Is Irrelevant for Cloud Systems

Marc Brooker argues that the CAP theorem is less relevant for cloud-based applications, emphasizing that engineers should focus on practical trade-offs like durability versus latency and consistency versus throughput.

Read original article

FrustrationSkepticismConfusion

The CAP theorem Is Irrelevant for Cloud Systems

The CAP theorem, which addresses trade-offs in distributed systems, is often considered foundational for engineers. However, Marc Brooker argues that it is largely irrelevant for those developing cloud-based applications. While CAP is more applicable to intermittently connected systems like IoT and mobile applications, cloud architectures typically manage network partitions effectively through redundancy and routing mechanisms. This allows for strong consistency and high availability, even during failures. Brooker emphasizes that the real challenges for cloud engineers lie in other trade-offs, such as durability versus latency and consistency versus throughput, which are more critical than the CAP theorem. He suggests that educators should focus on these practical trade-offs rather than starting with CAP when teaching new engineers. The post concludes with a call to relegate the CAP theorem to a lesser status in discussions about distributed systems, advocating for a shift towards more relevant and practical considerations in the field.

Are rainy days ahead for cloud computing?

Some companies are moving away from cloud computing due to cost and security concerns, opting for shared data centers instead. Despite this trend, cloud computing remains significant for global presence and innovation.

DevOps: The Funeral

The article explores Devops' evolution, emphasizing reproducibility in system administration. It critiques mislabeling cloud sysadmins as Devops practitioners and questions the industry's shift towards new approaches like Platform Engineering. It warns against neglecting automation and reproducibility principles.

Are rainy days ahead for cloud computing?

Some companies are moving away from cloud computing due to cost concerns. Cloud repatriation trend emerges citing security, costs, and performance issues. Debate continues on cloud's suitability, despite its industry significance.

Are rainy days ahead for cloud computing?

Some companies are moving away from cloud computing due to cost and other concerns. 37signals saved $1m by hosting data in a shared center. Businesses are reevaluating cloud strategies for cost-effective solutions.

On Building Systems That Will Fail (1991)

The Turing Lecture Paper by Fernando J. Corbató discusses the inevitability of failures in ambitious systems, citing examples and challenges in handling mistakes. It highlights the impact of continuous change in the computer field.

AI: What people are saying

The discussion surrounding the relevance of the CAP theorem in cloud-based applications reveals several key points of contention among commenters.

Many argue that the CAP theorem remains crucial for understanding trade-offs in distributed systems, despite cloud providers offering solutions that may obscure these complexities.
Commenters emphasize the importance of designing systems to handle network partitions and the potential consequences of ignoring CAP principles.
There is skepticism about the notion that cloud systems eliminate the need for careful consideration of consistency, availability, and partition tolerance.
Some highlight real-world experiences where neglecting CAP led to significant issues, reinforcing the theorem's relevance.
Others suggest that while cloud technologies may mitigate some challenges, they do not eliminate the fundamental trade-offs outlined by the CAP theorem.

40 comments

By @kstrauser - 9 months

So you’re setting up a multi-region RDS. If region A goes down, do you continue to accept writes to region B?

A bank: No! If region A goes down, do not process updates in B until A is back up! We’d rather be down than wrong!

A web forum: Yes! We can reconcile later when A comes back up. Until then keep serving traffic!

CAP theorem doesn’t let you treat the cloud as a magic infinite availability box. You still have to design your system to pick the appropriate behavior when something breaks. No one without deep insight into your business needs can decide for you, either. You’re on the hook for choosing.

By @mordae - 9 months

You wish.

> DNS, multi-cast, or some other mechanism directs them towards a healthy load balancer on the healthy side of the partition

Incidentally that's where CAP makes it's appearance and bites your ass.

No amount of VRRP, UCARP wishful thinking can guarantee a conclusion on what partition is "correct" in presence of a network partition between load balancer nodes.

Also, who determines where to point the DNS? A single point of failure VPS? Or perhaps a group of distributed machines voting? Yeah.

You still need to perform the analysis. It's just that some cloud providers offer the distributed voting clusters as a service and take care of the DNS and load balancer switchover for you.

And that's still not enough, because you might not want to allow stragglers write to orphan databases before the whole network fencing kicks in.

By @bunderbunder - 9 months

I once lost an entire Christmas vacation to fixing up the damage caused when an Elasticsearch cluster running in AWS responded poorly to a network partition event and started producing results that ruined our users' day (and business records) in a "costing millions of dollars" kind of way.

It was a very old version of ES, and the specific behavior that led to the problem has been fixed for a long time now. But still, the fact that something like this can happen in a cloud deployment demonstrates that this article's advice rests on an egregiously simplistic perspective on the possible failure modes of distributed systems.

In particular, the major premise that intermittent connectivity is only a problem on internetworks is just plain wrong. Hubs and switches flake out. Loose wires get jiggled. Subnetworks get congested.

And if you're on the cloud, nobody even tries to pretend that they'll tell you when server and equipment maintenance is going to happen.

By @throwaway71271 - 9 months

When I design systems I just think about tiny traitor generals and their sneaky traitor messengers racing in the war, their clocks are broken, and some of them are deaf, blind or both.

CAP or no CAP, chaos will reign.

I think FLP (https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf) is better way to think about systems.

I think CAP is not as relevant in the cloud because the complexity is so high that nobody even knows what is going on, so the just C part, regardless of the other letters, is ridiculously difficult even on a single computer. A book can be written just to explain write(2)'s surprise attacks.

So you think you have guarantees whatever the designers said they have AP or CP, and yet.. the impossible will happen twice a day (and 3 times at night when its your on-call).

By @killjoywashere - 9 months

The military lives in this world and will likely encourage people to continue thinking about it. Think about wearables on a submarine, as an example. Does the captain want to know his crew is fatigued, about to get sick, getting less exercise than they did on their last deployment? Yes. Can you talk to a cloud? No. Does the Admiral in Hawaii want to know those same answers about that boat, and every boat in the Group, eventually? Yes. For this situation, datacenter-aware databases are great. There are other solutions for other problems.

By @rdtsc - 9 months

> The CAP Theorem is Irrelevant

Just sprinkle the magic "cloud" powder on your system and ignore all the theory.

https://ferd.ca/beating-the-cap-theorem-checklist.html

Let's see, let's pick some checkboxes.

(x) you pushed the actual problem to another layer of the system

(x) you're actually building an AP system

By @xnorswap - 9 months

There's a better rebuttal(*) of CAP in Kleppmann's DDIA, under the title, "The unhelpful CAP theorem".

I won't plagiarize his text, instead the chapter references his blogpost, "Please stop calling databases CP or AP": https://martin.kleppmann.com/2015/05/11/please-stop-calling-...

(*): rebuttal I think is the wrong word, but I couldn't think of better.

By @vmaurin - 9 months

Plot twist: in the article drawings, replica one and two are split by network, and it could fail.

The author seems to not understand what the meaning of the P in CAP

By @pyrale - 9 months

Someone else tok ownership of the problem for you and sells you their solution : "The theoretical issue is irrelevant to me".

Sure. Also, there's a long list of other things that are probably irrelevant to you. That is, until your provider fails and you need to understand the situation in order to provide a workaround.

And slapping "load-balancers" everywhere on your schema is not really a solution, because load-balancers themselves are a distributed system with a state and are subject to CAP, as presented in the schema.

> DNS, multi-cast, or some other mechanism directs them towards a healthy load balancer on the healthy side of the partition.

"Somehow, something somewhere will fix my shit hopefully". Also, as a sidenote, a few friends would angrily shake their "it's always DNS" cup reading this.

edit: reading the rest of the blog and author's bio, I'm unsure whether the author is genuinely mistaken, or whether they're advertising their employer's product.

By @justinsaccount - 9 months

> None of the clients need to be aware that a network partition exists (except a small number who may see their connection to the bad side drop, and be replaced by a connection to the good side).

What a convenient world where the client is not affected by the network partition.

By @tristor - 9 months

As someone who's worked extensively on distributed systems, including at a cloud provider, after reading this I think the author doesn't actually understand the CAP theorem or the two generals problem. Their conclusions are essentially utterly incorrect.

By @kristjansson - 9 months

Many things can be solved by the SEP Field[0]

[0]: https://en.wikipedia.org/wiki/Somebody_else's_problem#Dougla...

By @ivan_gammel - 9 months

The CAP theorem is quantum mechanics of software with C*A = O(1) in theory, similarly to uncertainty principle, but in many use cases this value is so small that "classical" expectations of both C and A are fine.

By @mrkeen - 9 months

> In practice, the redundant nature of connectivity and ability to use routing mechanisms to send clients to the healthy side of partitions

Iow: You can have CAP as long as you can communicate across "partitions".

By @PaulHoule - 9 months

So glad to see that the CAP "theorem" is being recognized as a harmful selfish meme like Fielding's REST paper with a deadly seductive power against the overly pedantic.

By @senorrib - 9 months

I think every couple of months there's yet another article saying the CAP theorem is irrelevant. The problem with these is that they ignore the fact that CAP theorem isn't a guide, a framework or anything else.

It's simply the formalization of a fact, and whether or not that fact is *important* (although still a fact) depends on the actual use case. Hell, it applies even to services within the same memory space, although obviously the probability of losing any of the three is orders of magnitude less than on a network.

Can we please move on?

By @fractalic - 9 months

Hmm this article seems misleading. I suppose it's trying to make the point that application designers usually don't need to think too hard about it, because it's already being addressed by a quorum consensus protocol implemented by someone else. This is a bit of a tautology though; the author seems to be saying 'assume you have a solution to CAP theorem -- now isn't it silly to worry about CAP theorem?'.

One of the fundamental assumptions of CAP theorem is that you can't tell whether or not you have a partition. If you have an oracle that can instantaneously tell you the state of every subsystem, then yeah, CAP is pointless.

But if one of your DBs is connected, reporting itself as alive, and throwing all its writes into /dev/null, you won't be able to route traffic to a quorum of healthy instances because it's not possible to be certain that they're all healthy.

This is what CAP theorem is about: managing data in a distributed system where the status of any given system is fundamentally unknowable because of the Two Generals' Problem (https://en.wikipedia.org/wiki/Two_Generals'_Problem)

In many cases in Cloud though, we can skip that technical stuff and design systems as if we really _did_ have an oracle that could instantaneously and perfectly tell us the state of the system, and things will typically work fine.

By @rubiquity - 9 months

The point trying to be made is that with nimble infrastructure the A in CAP can be designed around to such a small amount you may as well be a CP system unless you have a really good reason to go after that 0.005% of availability. Not being CP means sacrificing the wonderful benefits that being consistent (linearizability, sequential consistency, strict serializibility) make possible. It's hard to disagree with that sentiment, and is likely why the Local First ideology is centered on data ownership rather than that extra 0.0005 ounces of availability. Once availability is no longer the center of attention the design space can be focused on durability or latency: how many copies to read/write before acking.

Unfortunately the point is lost because of the usage of the word "cloud", a somewhat contrived example of solving problems by reconfiguring load balancers (in the real world certain outages might not let you reconfigure!), and missing empathy that you can't tell people not to care about how the semantics that thinking about, or not thinking about, availability imposes on the correctness of their applications.

As for the usage of the word cloud: I don't know when a set of machines becomes a cloud. Is it the APIs for management? Or when you have two or more implementations of consensus running on the set of machines?

By @lupire - 9 months

He's saying that you don't need Partition Tolerance because network is never actually Partitioned. This is exactly why the Internet and the US Interstate Highway system were invented in the first place.

Or he's saying you don't need Consistency because your system isn't actually distributed; it's just a centralized system with hot backups.

It's unclear what he's trying to say.

No idea why he wrote the blog post. It doesn't increase my confidence in the engineering equality of his employer AWS

By @cryptonector - 9 months

> If the partition extended to the whole big internet that clients are on, this wouldn’t work. But they typically don’t.

This is the key, that network partitions either keep some clients from accessing any servers, or they keep some servers from talking to each other. The former case is uninteresting because nothing can be done server-side about it. The latter is interesting and we can fix it with load balancers.

This conflicts with the picture painted earlier in TFA where the unhappy client is somehow stuck with the unhappy server, but let's consider that just didactic.

We can also not use load balancers but have the clients talk to all the servers they can reach, when we trust the clients to behave correctly. Some architectures do this, like Lustre, which is why I mention it.

I see several comments here that seem to take TFA as saying that distributed consensus algorithms/protocols are not needed, but TFA does not say that. TFA says you can have consistency, availability, and partition tolerance because network partitions between servers typically don't extend to clients, and you can have enough servers to maintain quorum for all clients (if a quorum is not available it's as if the whole cloud is down, then it's not available to any clients). That is a very reasonable assertion, IMO.

By @linuxhansl - 9 months

So basically this is saying that the CAP theorem is irrelevant because a partition is not really have a partition (since the load balancer still can reach everybody). Hmm...

I agree that in modern data centers the CAP theorem is essentially irrelevant for intra-DC services, due the uptime and redundancy of networking H/W (making a partition less likely than other systemic failures).

Across DCs I'll claim it is still absolutely relevant.

By @hot_gril - 9 months

The only concrete solution the article proposes that I can think of: Spanner uses quorum to maintain availability and consistency. Your "master" is TrueTime, which is considered reliable enough. You have replicated app backends. If this isn't too generous, let's also say the cloud handles load balancing well enough. CAP isn't violated, but you might say the user no longer worries about it.

Most databases don't work like Spanner, and Spanner has its downsides, two of them being cost and performance. So most of the time, you're using a traditional DB with maybe a RW replica, which will sacrifice significant consistency or availability depending on whether you choose sync or async mode. And you're back to worrying about CAP.

By @skywhopper - 9 months

Weird article. Different users have different priorities and that’s what the CAP theorem expresses. The article also pretends that there’s a magic “load balancer” in the cloud that always works and also knows which segment of a partitioned network is the “correct” one (one of the points of CAP is that there’s not necessarily a “correct” side), and that no users will ever be on the “wrong” side of the partition. And not only that but all replicas see the exact same network partition. None of this is reality.

But the gist, I guess, is that for most applications it’s not actually that important, and that’s probably true. But when it is important, “the cloud” is not going to save you.

By @jorblumesea - 9 months

CAP was never designed as an end all template you blindly apply to large scale systems. Think of it more as a mental starting pointing, that systems have these trade offs you need to consider. Each system you integrate has complex and nuanced requirements that don't neatly fall into clean buckets.

As always Kleppmann has a great and deep answer for this.

https://martin.kleppmann.com/2015/05/11/please-stop-calling-...

By @bjornsing - 9 months

I suspect the CAP theorem factored into the design of these cloud architectures, in such a way that it now seems irrelevant. But it probably was relevant in preventing a lot of other more complex designs.

By @KaiserPro - 9 months

I kinda see what the author is getting at, but I don't buy the argument.

However, in the example with the network partition, it relies on proper monitoring to work out if the DB its attached to is currently in partition.

managing reads is a piece of piss, mostly. Its when you need propagate write to the rest of the DB system, thats where stuff gets hairy.

Now, most places can run from a single DB, especially as disks are fucking fast now. so CAP is never really that much of a problem. However when you go multi-region, thats when it gets interesting.

By @thayne - 9 months

This only addresses one kind of partition.

What if your servers can't talk to each other, but clients can?

What if clients can't connect to any of your servers?

What if there are multiple partitons, and none of them have a quorum?

Also, changing the routing isn't instantaneous, so you will have some period of unavailability between when the partition happens, and when the client is redirected to the partition with the quorum.

By @ibash - 9 months

> if a quorum of replicas is available to the client, they can still get both strong consistency, and uncompromised availability.

Then it’s not a partition.

By @api - 9 months

This is just saying because the cloud system hides the implications of the theorem from you, it's not relevant.

I suppose it's kinda true in the sense that how to operate a power plant is not relevant when I turn on my lights.

By @remram - 9 months

This article assumes P(artitions) don't happen, and then concludes you can have both C and A. Congrats, that's the CAP theorem.

By @hinkley - 9 months

> The formalized CAP theorem would call this system unavailable, based on their definition of availability:

Umm, no? That’s a picture of a partition. The partition is not able to make progress because the system is not partition tolerant. If it did it wouldn’t be consistent. It’s still available.

By @throw0101c - 9 months

The CAP theorem Is Irrelevant for Cloud Systems

Related

Are rainy days ahead for cloud computing?

DevOps: The Funeral

Are rainy days ahead for cloud computing?

Are rainy days ahead for cloud computing?

On Building Systems That Will Fail (1991)

Related

Are rainy days ahead for cloud computing?

DevOps: The Funeral

Are rainy days ahead for cloud computing?

Are rainy days ahead for cloud computing?

On Building Systems That Will Fail (1991)