The CAP theorem Is Irrelevant for Cloud Systems
Marc Brooker argues that the CAP theorem is less relevant for cloud-based applications, emphasizing that engineers should focus on practical trade-offs like durability versus latency and consistency versus throughput.
Read original articleThe CAP theorem, which addresses trade-offs in distributed systems, is often considered foundational for engineers. However, Marc Brooker argues that it is largely irrelevant for those developing cloud-based applications. While CAP is more applicable to intermittently connected systems like IoT and mobile applications, cloud architectures typically manage network partitions effectively through redundancy and routing mechanisms. This allows for strong consistency and high availability, even during failures. Brooker emphasizes that the real challenges for cloud engineers lie in other trade-offs, such as durability versus latency and consistency versus throughput, which are more critical than the CAP theorem. He suggests that educators should focus on these practical trade-offs rather than starting with CAP when teaching new engineers. The post concludes with a call to relegate the CAP theorem to a lesser status in discussions about distributed systems, advocating for a shift towards more relevant and practical considerations in the field.
Related
Are rainy days ahead for cloud computing?
Some companies are moving away from cloud computing due to cost and security concerns, opting for shared data centers instead. Despite this trend, cloud computing remains significant for global presence and innovation.
DevOps: The Funeral
The article explores Devops' evolution, emphasizing reproducibility in system administration. It critiques mislabeling cloud sysadmins as Devops practitioners and questions the industry's shift towards new approaches like Platform Engineering. It warns against neglecting automation and reproducibility principles.
Are rainy days ahead for cloud computing?
Some companies are moving away from cloud computing due to cost concerns. Cloud repatriation trend emerges citing security, costs, and performance issues. Debate continues on cloud's suitability, despite its industry significance.
Are rainy days ahead for cloud computing?
Some companies are moving away from cloud computing due to cost and other concerns. 37signals saved $1m by hosting data in a shared center. Businesses are reevaluating cloud strategies for cost-effective solutions.
On Building Systems That Will Fail (1991)
The Turing Lecture Paper by Fernando J. Corbató discusses the inevitability of failures in ambitious systems, citing examples and challenges in handling mistakes. It highlights the impact of continuous change in the computer field.
- Many argue that the CAP theorem remains crucial for understanding trade-offs in distributed systems, despite cloud providers offering solutions that may obscure these complexities.
- Commenters emphasize the importance of designing systems to handle network partitions and the potential consequences of ignoring CAP principles.
- There is skepticism about the notion that cloud systems eliminate the need for careful consideration of consistency, availability, and partition tolerance.
- Some highlight real-world experiences where neglecting CAP led to significant issues, reinforcing the theorem's relevance.
- Others suggest that while cloud technologies may mitigate some challenges, they do not eliminate the fundamental trade-offs outlined by the CAP theorem.
A bank: No! If region A goes down, do not process updates in B until A is back up! We’d rather be down than wrong!
A web forum: Yes! We can reconcile later when A comes back up. Until then keep serving traffic!
CAP theorem doesn’t let you treat the cloud as a magic infinite availability box. You still have to design your system to pick the appropriate behavior when something breaks. No one without deep insight into your business needs can decide for you, either. You’re on the hook for choosing.
> DNS, multi-cast, or some other mechanism directs them towards a healthy load balancer on the healthy side of the partition
Incidentally that's where CAP makes it's appearance and bites your ass.
No amount of VRRP, UCARP wishful thinking can guarantee a conclusion on what partition is "correct" in presence of a network partition between load balancer nodes.
Also, who determines where to point the DNS? A single point of failure VPS? Or perhaps a group of distributed machines voting? Yeah.
You still need to perform the analysis. It's just that some cloud providers offer the distributed voting clusters as a service and take care of the DNS and load balancer switchover for you.
And that's still not enough, because you might not want to allow stragglers write to orphan databases before the whole network fencing kicks in.
It was a very old version of ES, and the specific behavior that led to the problem has been fixed for a long time now. But still, the fact that something like this can happen in a cloud deployment demonstrates that this article's advice rests on an egregiously simplistic perspective on the possible failure modes of distributed systems.
In particular, the major premise that intermittent connectivity is only a problem on internetworks is just plain wrong. Hubs and switches flake out. Loose wires get jiggled. Subnetworks get congested.
And if you're on the cloud, nobody even tries to pretend that they'll tell you when server and equipment maintenance is going to happen.
CAP or no CAP, chaos will reign.
I think FLP (https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf) is better way to think about systems.
I think CAP is not as relevant in the cloud because the complexity is so high that nobody even knows what is going on, so the just C part, regardless of the other letters, is ridiculously difficult even on a single computer. A book can be written just to explain write(2)'s surprise attacks.
So you think you have guarantees whatever the designers said they have AP or CP, and yet.. the impossible will happen twice a day (and 3 times at night when its your on-call).
Just sprinkle the magic "cloud" powder on your system and ignore all the theory.
https://ferd.ca/beating-the-cap-theorem-checklist.html
Let's see, let's pick some checkboxes.
(x) you pushed the actual problem to another layer of the system
(x) you're actually building an AP system
I won't plagiarize his text, instead the chapter references his blogpost, "Please stop calling databases CP or AP": https://martin.kleppmann.com/2015/05/11/please-stop-calling-...
(*): rebuttal I think is the wrong word, but I couldn't think of better.
The author seems to not understand what the meaning of the P in CAP
Sure. Also, there's a long list of other things that are probably irrelevant to you. That is, until your provider fails and you need to understand the situation in order to provide a workaround.
And slapping "load-balancers" everywhere on your schema is not really a solution, because load-balancers themselves are a distributed system with a state and are subject to CAP, as presented in the schema.
> DNS, multi-cast, or some other mechanism directs them towards a healthy load balancer on the healthy side of the partition.
"Somehow, something somewhere will fix my shit hopefully". Also, as a sidenote, a few friends would angrily shake their "it's always DNS" cup reading this.
edit: reading the rest of the blog and author's bio, I'm unsure whether the author is genuinely mistaken, or whether they're advertising their employer's product.
What a convenient world where the client is not affected by the network partition.
[0]: https://en.wikipedia.org/wiki/Somebody_else's_problem#Dougla...
Iow: You can have CAP as long as you can communicate across "partitions".
It's simply the formalization of a fact, and whether or not that fact is *important* (although still a fact) depends on the actual use case. Hell, it applies even to services within the same memory space, although obviously the probability of losing any of the three is orders of magnitude less than on a network.
Can we please move on?
One of the fundamental assumptions of CAP theorem is that you can't tell whether or not you have a partition. If you have an oracle that can instantaneously tell you the state of every subsystem, then yeah, CAP is pointless.
But if one of your DBs is connected, reporting itself as alive, and throwing all its writes into /dev/null, you won't be able to route traffic to a quorum of healthy instances because it's not possible to be certain that they're all healthy.
This is what CAP theorem is about: managing data in a distributed system where the status of any given system is fundamentally unknowable because of the Two Generals' Problem (https://en.wikipedia.org/wiki/Two_Generals'_Problem)
In many cases in Cloud though, we can skip that technical stuff and design systems as if we really _did_ have an oracle that could instantaneously and perfectly tell us the state of the system, and things will typically work fine.
Unfortunately the point is lost because of the usage of the word "cloud", a somewhat contrived example of solving problems by reconfiguring load balancers (in the real world certain outages might not let you reconfigure!), and missing empathy that you can't tell people not to care about how the semantics that thinking about, or not thinking about, availability imposes on the correctness of their applications.
As for the usage of the word cloud: I don't know when a set of machines becomes a cloud. Is it the APIs for management? Or when you have two or more implementations of consensus running on the set of machines?
Or he's saying you don't need Consistency because your system isn't actually distributed; it's just a centralized system with hot backups.
It's unclear what he's trying to say.
No idea why he wrote the blog post. It doesn't increase my confidence in the engineering equality of his employer AWS
This is the key, that network partitions either keep some clients from accessing any servers, or they keep some servers from talking to each other. The former case is uninteresting because nothing can be done server-side about it. The latter is interesting and we can fix it with load balancers.
This conflicts with the picture painted earlier in TFA where the unhappy client is somehow stuck with the unhappy server, but let's consider that just didactic.
We can also not use load balancers but have the clients talk to all the servers they can reach, when we trust the clients to behave correctly. Some architectures do this, like Lustre, which is why I mention it.
I see several comments here that seem to take TFA as saying that distributed consensus algorithms/protocols are not needed, but TFA does not say that. TFA says you can have consistency, availability, and partition tolerance because network partitions between servers typically don't extend to clients, and you can have enough servers to maintain quorum for all clients (if a quorum is not available it's as if the whole cloud is down, then it's not available to any clients). That is a very reasonable assertion, IMO.
I agree that in modern data centers the CAP theorem is essentially irrelevant for intra-DC services, due the uptime and redundancy of networking H/W (making a partition less likely than other systemic failures).
Across DCs I'll claim it is still absolutely relevant.
Most databases don't work like Spanner, and Spanner has its downsides, two of them being cost and performance. So most of the time, you're using a traditional DB with maybe a RW replica, which will sacrifice significant consistency or availability depending on whether you choose sync or async mode. And you're back to worrying about CAP.
But the gist, I guess, is that for most applications it’s not actually that important, and that’s probably true. But when it is important, “the cloud” is not going to save you.
As always Kleppmann has a great and deep answer for this.
https://martin.kleppmann.com/2015/05/11/please-stop-calling-...
However, in the example with the network partition, it relies on proper monitoring to work out if the DB its attached to is currently in partition.
managing reads is a piece of piss, mostly. Its when you need propagate write to the rest of the DB system, thats where stuff gets hairy.
Now, most places can run from a single DB, especially as disks are fucking fast now. so CAP is never really that much of a problem. However when you go multi-region, thats when it gets interesting.
What if your servers can't talk to each other, but clients can?
What if clients can't connect to any of your servers?
What if there are multiple partitons, and none of them have a quorum?
Also, changing the routing isn't instantaneous, so you will have some period of unavailability between when the partition happens, and when the client is redirected to the partition with the quorum.
Then it’s not a partition.
I suppose it's kinda true in the sense that how to operate a power plant is not relevant when I turn on my lights.
Umm, no? That’s a picture of a partition. The partition is not able to make progress because the system is not partition tolerant. If it did it wouldn’t be consistent. It’s still available.
> In database theory, the PACELC theorem is an extension to the CAP theorem. It states that in case of network partitioning (P) in a distributed computer system, one has to choose between availability (A) and consistency (C) (as per the CAP theorem), but else (E), even when the system is running normally in the absence of partitions, one has to choose between latency (L) and loss of consistency (C).
I likely need to read the paper linked, but it's common to have an MPP database lose a node but maintain data availability. CAP applies at various levels, but the notion of availability differs:
1. all nodes available 2. all data available
Redundancy can make #2 a lot more common than #1.
At that point you get all 3: consistency,availability, partitioning.
In my opinion it should be the CAPR theorem.
I would like to violate CAP, please. I would like to be nearish to c, please.
Here is my passport. I have done the work.
Yeah… no. Just because the cloud offers primitives that allow you to skip many of the challenges that the CAP theorem outlines, doesn’t mean it’s not a critical step to learning about and building novel distributed systems.
I think the author is confusing systems practitioners with distributed systems researchers.
I agree in some part, the former rarely needs to think about CAP for the majority of B2B cloud SaaS. For the latter, it seems entirely incorrect to skip CAP theorem fundamentals in one’s education.
tl;dr — just because Kubernetes (et al.) make building distributed systems easier, it doesn’t mean you should avoid the CAP theorem in teaching or disregard it altogether.
Related
Are rainy days ahead for cloud computing?
Some companies are moving away from cloud computing due to cost and security concerns, opting for shared data centers instead. Despite this trend, cloud computing remains significant for global presence and innovation.
DevOps: The Funeral
The article explores Devops' evolution, emphasizing reproducibility in system administration. It critiques mislabeling cloud sysadmins as Devops practitioners and questions the industry's shift towards new approaches like Platform Engineering. It warns against neglecting automation and reproducibility principles.
Are rainy days ahead for cloud computing?
Some companies are moving away from cloud computing due to cost concerns. Cloud repatriation trend emerges citing security, costs, and performance issues. Debate continues on cloud's suitability, despite its industry significance.
Are rainy days ahead for cloud computing?
Some companies are moving away from cloud computing due to cost and other concerns. 37signals saved $1m by hosting data in a shared center. Businesses are reevaluating cloud strategies for cost-effective solutions.
On Building Systems That Will Fail (1991)
The Turing Lecture Paper by Fernando J. Corbató discusses the inevitability of failures in ambitious systems, citing examples and challenges in handling mistakes. It highlights the impact of continuous change in the computer field.