Continuous reinvention: A brief history of block storage at AWS
Marc Olson discusses the evolution of Amazon Web Services' Elastic Block Store (EBS) from basic storage to a system handling over 140 trillion operations daily, emphasizing the need for continuous optimization and innovation.
Read original articleMarc Olson reflects on the evolution of Amazon Web Services' Elastic Block Store (EBS), which has transformed from a basic block storage service into a robust network storage system capable of handling over 140 trillion operations daily. Launched in 2008, EBS initially relied on shared hard disk drives (HDDs) and faced significant challenges, including performance limitations and issues with "noisy neighbors," where one workload negatively impacted another. The introduction of solid-state drives (SSDs) in 2012 marked a significant turning point, improving performance and reducing latency. However, the transition to SSDs revealed that other system components, such as the network and software, also required optimization to fully leverage the benefits of the new storage technology. Olson emphasizes the importance of incremental improvements and comprehensive measurement in enhancing system performance. He shares insights on queueing theory and the necessity of understanding system interactions to achieve high performance and reliability. The journey of EBS illustrates the complexities of scaling storage solutions and the continuous need for innovation to meet customer demands.
- EBS has evolved significantly since its launch in 2008, now handling over 140 trillion operations daily.
- The transition from HDDs to SSDs in 2012 greatly improved performance but highlighted the need for further system optimizations.
- Addressing "noisy neighbor" issues was crucial for maintaining a high-quality customer experience.
- Incremental improvements and thorough measurement are essential for effective system performance management.
- The evolution of EBS reflects broader challenges in scaling storage solutions within large distributed systems.
Related
Using S3 as a Container Registry
Adolfo Ochagavía discusses using Amazon S3 as a container registry, noting its speed advantages over ECR. S3's parallel layer uploads enhance performance, despite lacking standard registry features. The unconventional approach offers optimization potential.
The end of the Everything Cloud
AWS is deprecating several lesser-used services under new leadership, focusing on profitability and core offerings. This shift raises concerns about the longevity of new services and customer uncertainty.
How HashiCorp evolved its cloud infrastructure
Michael Galloway discusses HashiCorp's cloud infrastructure evolution, emphasizing the need for clear objectives, deadlines, and executive buy-in to successfully redesign and expand their services amid growing demands.
Building a highly-available web service without a database
A new architecture enables web services to use RAM as a primary data store, enhancing availability with the Raft Consensus algorithm, periodic snapshots, and sharding for efficient scaling.
AWS powered Prime Day 2024
Amazon Prime Day 2024, on July 17-18, set sales records with millions of deals. AWS infrastructure supported the event, deploying numerous AI and Graviton chips, ensuring operational readiness and security.
- Many commenters share personal experiences and challenges faced while using EBS, highlighting its inconsistent performance and the complexities of managing storage systems.
- There is a recognition of the technical innovations and lessons learned from past outages and performance issues, emphasizing the importance of continuous optimization.
- Several comments discuss the shift from traditional hardware to specialized solutions, noting the evolution of storage technology over the years.
- Some commenters express a desire for more insights into the operational challenges and business needs that drive technical decisions in cloud services.
- There is a common theme of nostalgia for early experiences in cloud computing and the learning curve associated with building scalable systems.
> Compounding this latency, hard drive performance is also variable depending on the other transactions in the queue. Smaller requests that are scattered randomly on the media take longer to find and access than several large requests that are all next to each other. This random performance led to wildly inconsistent behavior.
The effect of this can be huge! Given a reasonably sequential workload, modern magnetic drives can do >100MB/s of reads or writes. Given an entirely random 4kB workload, they can be limited to as little as 400kB/s of reads or writes. Queuing and scheduling can help avoid the truly bad end of this, but real-world performance still varies by over 100x depending on workload. That's really hard for a multi-tenant system to deal with (especially with reads, where you can't do the "just write it somewhere else" trick).
> To know what to fix, we had to know what was broken, and then prioritize those fixes based on effort and rewards.
This was the biggest thing I learned from Marc in my career (so far). He'd spend time working on visualizations of latency (like the histogram time series in this post) which were much richer than any of the telemetry we had, then tell a story using those visualizations, and completely change the team's perspective on the work that needed to be done. Each peak in the histogram came with it's own story, and own work to optimize. Really diving into performance data - and looking at that data in multiple ways - unlocks efficiencies and opportunities that are invisible without that work and investment.
> Armed with this knowledge, and a lot of human effort, over the course of a few months in 2013, EBS was able to put a single SSD into each and every one of those thousands of servers.
This retrofit project is one of my favorite AWS stories.
> The thing that made this possible is that we designed our system from the start with non-disruptive maintenance events in mind. We could retarget EBS volumes to new storage servers, and update software or rebuild the empty servers as needed.
This is a great reminder that building distributed systems isn't just for scale. Here, we see how building the system in a way that can seamlessly tolerate the failure of a server, and move data around without loss, makes large scale operations (everything from day-to-day software upgrades to a massive hardware retrofit project) possible that just wouldn't be possible in a "simpler" architecture. A "simpler" architecture would make these operations much harder, to the point of being impossible, making the end-to-end problem we're trying to solve for the customer harder.
At the time each volume had very inconsistent performance, so I would launch seven or eight, and then run some each write and read loads. I'd take the five best performers and then put them into a Linux software raid.
In the good case, I got the desired effect -- I did in fact get more IOPS then 5x a single node. But in the bad case, oh boy was it bad.
What I didn't realize was that if you're using a software raid, if one node is slow, the entire raid moves at the speed of the slowest volume. So this would manifest as a database going bad. It took a while to figure out it was the RAID that was the problem. And even then, removing the bad node was hard -- the software raid really didn't want to let go of the bad volume until it could finish writing out to it, which of course was super slow.
And then I would put in a new EBS volume and have to rebuild the array, which of course it was also bad at because it would be bottlenecked on the IOPS for the new volume.
We moved off of those software raids after a while. We almost never used EBS at Netflix, in part because I would tell everyone who would listen about my folly at reddit, and because they had already standardized on using only local disk before I ever got there.
And an amusing side note, when AWS had that massive EBS outage, I still worked at reddit and I was actually watching Netflix while I was waiting for the EBS to come back so I could fix all the databases. When I interviewed at Netflix one of the questions I asked them was "how were you still up during the EBS outage?", and they said, "Oh, we just don't use EBS".
One interesting tidbit is that during the period this author writes about, AWS had a roughly 4-day outage (impacted at least EC2, EBS, and RDS, iirc), caused by EBS, that really shook folks' confidence in AWS.
It resulted in a reorg and much deeper investment in EBS as a standalone service.
It also happened around the time Apple was becoming a customer, and AWS in general was going through hockey-stick growth thanks to startup adoption (Netflix, Zynga, Dropbox, etc).
It's fun to read about these technical and operational bits, but technical innovation in production is messy, and happens against a backdrop of Real Business Needs.
I wish more of THOSE stories were told as well.
Can anyone explain why?
https://www.allthingsdistributed.com/images/mo-manual-ssd.pn...
I think we got SSDs installed in blades from Dell well before that, but I may be misremembering.
I/O performance was a big thing in like 2010/2011/2012. We went from spinning HDs to Flash memory.
I remember experimenting with these raw Flash-based devices, no error/wear level handling at all. Insanity, but we were all desperate for that insane I/O performance bump from spinning rust to silicon.
Secondly it reminds me of the time when it simply made sense to ninja-break and rebuild mdraids with ssds in-place of the spinning drives WHILE the servers were running (sata kind of supported hotswapping the drives). Going from spinning to ssd gave us a 14x increase in IOPS in the most important system of the platform.
That's one of the reasons why I think we should have a professional license. By requiring an apprenticeship under a master engineer, somebody can pick up incredibly valuable knowledge and skills (that you only learn by experience) in a very short time frame, and then be released out into the world to be much more effective throughout their career. And as someone who also interviews candidates, some proof of their experience and a reference from their mentor would be invaluable.
> While the much celebrated ideal of a “full stack engineer” is valuable, in deep and complex systems it’s often even more valuable to create cohorts of experts who can collaborate and get really creative across the entire stack and all their individual areas of depth.
Otherwise, great article, illustrating that it's queues all the way down!
"EBS is capable of delivering more IOPS to a single instance today than it could deliver to an entire Availability Zone (AZ) in the early years on top of HDDs."
Dang!
> In retrospect, if we knew at the time how much we didn’t know, we may not have even started the project!
Related
Using S3 as a Container Registry
Adolfo Ochagavía discusses using Amazon S3 as a container registry, noting its speed advantages over ECR. S3's parallel layer uploads enhance performance, despite lacking standard registry features. The unconventional approach offers optimization potential.
The end of the Everything Cloud
AWS is deprecating several lesser-used services under new leadership, focusing on profitability and core offerings. This shift raises concerns about the longevity of new services and customer uncertainty.
How HashiCorp evolved its cloud infrastructure
Michael Galloway discusses HashiCorp's cloud infrastructure evolution, emphasizing the need for clear objectives, deadlines, and executive buy-in to successfully redesign and expand their services amid growing demands.
Building a highly-available web service without a database
A new architecture enables web services to use RAM as a primary data store, enhancing availability with the Raft Consensus algorithm, periodic snapshots, and sharding for efficient scaling.
AWS powered Prime Day 2024
Amazon Prime Day 2024, on July 17-18, set sales records with millions of deals. AWS infrastructure supported the event, deploying numerous AI and Graviton chips, ensuring operational readiness and security.