April 1st, 2025

Systems Correctness Practices at AWS: Leveraging Formal and Semi-Formal Methods

AWS emphasizes systems correctness for reliable services, utilizing TLA+ and the P programming language for formal methods. Lightweight testing, fault injection, and Cedar language enhance efficiency and security.

Read original articleLink Icon
CuriositySkepticismEnthusiasm
Systems Correctness Practices at AWS: Leveraging Formal and Semi-Formal Methods

AWS (Amazon Web Services) emphasizes the importance of systems correctness to ensure reliable services for its customers. This focus on correctness is supported by formal and semi-formal methods, particularly the use of TLA+, a formal specification language. TLA+ helps identify subtle bugs early in development and allows for performance optimizations while maintaining system integrity. Over the years, AWS has evolved its software testing practices, integrating formal methods into its development processes to enhance both correctness and performance. The P programming language, developed at AWS, has made formal methods more accessible to engineers by allowing them to model systems as communicating state machines. Tools like PObserve validate system behaviors during production, bridging the gap between design and implementation. AWS also employs lightweight formal methods, such as property-based testing and deterministic simulation, to improve testing efficiency. The introduction of the Fault Injection Service (FIS) enables customers to test system resilience against simulated faults. Additionally, AWS is addressing metastable failures—where systems fail to recover from overloads—by using discrete-event simulations. For critical security aspects, AWS has developed the Cedar authorization policy language, which allows for formal proofs of correctness. This comprehensive approach to systems correctness not only enhances reliability but also accelerates development cycles and reduces costs for AWS customers.

- AWS prioritizes systems correctness to ensure reliable services.

- TLA+ and the P programming language are key tools for formal methods.

- Lightweight formal methods and fault injection testing enhance testing efficiency.

- AWS addresses metastable failures through discrete-event simulations.

- Cedar language allows for formal proofs of correctness in security contexts.

AI: What people are saying
The comments reflect a mix of skepticism and interest regarding AWS's use of formal methods like TLA+ and the P programming language.
  • Some commenters highlight the importance of structured documentation alongside formal verification to enhance system reliability.
  • Concerns are raised about the accuracy of verification tools and the potential for bugs in formally verified systems.
  • There is skepticism about the practical application of formal methods within Amazon, with some questioning their relevance in hiring practices.
  • Several users express curiosity about which specific teams at AWS utilize these formal methods.
  • One commenter shares their positive experience using formal methods in software development, suggesting a broader interest in these techniques.
Link Icon 10 comments
By @jlcases - 2 days
I've noticed that the formalization of methods described by AWS parallels what we need in technical documentation. Complex systems require not just formal verification but also structured documentation following MECE principles (Mutually Exclusive, Collectively Exhaustive).

In my experience, the interfaces between components (where most errors occur) are exactly where fragmented documentation fails. I implemented a hierarchical documentation system for my team that organizes knowledge as a conceptual tree, and the accuracy of code generation with AI assistants improved notably.

Formal verification tools and structured documentation are complementary: verification ensures algorithmic correctness while MECE documentation guarantees conceptual and contextual correctness. I wonder if AWS has experimented with structured documentation systems specifically for AI, or if this remains an area to explore.

By @Cyphase - 2 days
Leslie Lamport gave the closing keynote at SCaLE 22x this year, talking about formal methods and TLA+. He mentioned some previous work Amazon has done in that area.

https://www.youtube.com/watch?v=tsSDvflzJbc

> Coding isn't Programming - Closing Keynote with Leslie Lamport - SCaLE 22x

By @pera - 2 days
> we also sought a language that would allow us to model check (and later prove) key aspects of systems designs while being more approachable to programmers.

I find it a bit surprising that TLA+ with PlusCal can be challenging to learn for your average software engineer, could anyone with experience in P show an example of something that can be difficult to express in TLA+ which is significantly easier in P?

By @gqgs - 2 days
A key concern I've consistently had regarding formal verification systems is: how does one confirm the accuracy of the verifier itself?

This issue appears to present an intrinsically unsolvable problem, implying that a formally verified system could still contain bugs due to potential issues in the verification software.

While this perspective doesn't necessarily render formal verification impractical, it does introduce certain caveats that, in my experience, are not frequently addressed in discussions about these systems.

By @csbartus - 2 days
I've recently created a likely-correct piece of software based on these principles.

https://www.osequi.com/studies/list/list.html

The structure (ontology, taxonomy) is created with ologs, a formal method from category theory. The behavior (choreography) is created with a semi-formal implementation (XState) of a formal method (Finite State Machines)

The user-facing aspect of the software is designed with Concept Design, a semi-formal method from MIT CSAIL.

Creating software with these methods is refreshing and fun. Maybe one day we can reach Tonsky's "Diagrams are code" vision.

https://tonsky.me/blog/diagrams/

By @OhMeadhbh - 2 days
I find this highly unlikely. My first day at Amazon I encountered an engineer puzzling over a perfect sine wave in a graph. After looking at the scale I made the comment "oh. you're using 5 minute metrics." Their response was "how could you figure that out just by looking at the graph?" When I replied "Queuing theory and control theory," their response was "what's that?"

Since then, Amazon's hiring practices have only gotten worse. Can you invert a tree? Can you respond "tree" or "hash map" when you're asked what is the best data structure for a specific situation? Can you solve a riddle or code an ill-explained l33tcode problem? Are you sure you can parse HTML with regexes? You're Amazon material.

Did you pay attention to the lecture about formal proofs. TLA+ or Coq/Kami? That's great, but it won't help you get a job at Amazon.

The idea that formal proofs are used anywhere but the most obscure corners of AWS is laughable.

Although... it is a nice paper. Props to Amazon for supporting Ph.D.'s doing pure research that will never impact AWS' systems or processes.

By @nullorempty - 2 days
And what teams use these methods exactly?
By @neuroelectron - 2 days
Great April 1st gag. Seems to have gone over everyone's head.