Practices of Reliable Software Design
The article outlines eight practices for reliable software design, emphasizing off-the-shelf solutions, cost-effectiveness, quick production deployment, simple data structures, and performance monitoring to enhance efficiency and reliability.
Read original articleThe article discusses eight practices for reliable software design, particularly in the context of building an in-memory cache. The author emphasizes the importance of using off-the-shelf solutions when possible, prioritizing cost and reliability over unnecessary features, and quickly moving ideas into production to gather real-world feedback. Simple data structures are recommended to avoid misuse and performance issues, while early resource reservation is suggested to prevent runtime failures. Setting maximum limits on operations helps manage performance and resource usage effectively. The author also advocates for making testing straightforward by allowing command input for verification and embedding performance counters to monitor system behavior. These practices are derived from the author's experiences and aim to enhance software engineering efficiency and reliability.
- Use off-the-shelf solutions to simplify development.
- Prioritize cost and reliability over excessive features.
- Move ideas to production quickly to gather user feedback.
- Employ simple data structures to avoid complexity and bugs.
- Implement performance counters for better system monitoring.
Related
Why We Build Simple Software
Simplicity in software development, likened to a Toyota Corolla's reliability, is crucial. Emphasizing straightforward tools and reducing complexity enhances reliability. Prioritizing simplicity over unnecessary features offers better value and reliability.
We Build Simple Software
Simplicity in software development, likened to a Toyota Corolla's reliability, is crucial. Emphasizing straightforward tools, Pickcode aims for user-friendly experiences. Beware of complex software's pitfalls; prioritize simplicity for better value and reliability.
Fear of over-engineering has killed engineering altogether
The article critiques the tech industry's focus on speed over engineering rigor, advocating for "Napkin Math" and Fermi problems to improve decision-making and project outcomes through basic calculations.
Algorithms We Develop Software By
The article explores software development methodologies that improve coding efficiency, emphasizing daily feature work, code rewriting, the "gun to the head" heuristic, and effective navigation of problem spaces.
The Tool Cache Manifesto
The Tool Cache Manifesto emphasizes correctness in tool caches, advocating for seamless operation, proper management of dependencies, and caution against poor practices, suggesting that modern tools may not require caching.
- Redundancy is emphasized as a key principle for achieving reliability in software systems, with suggestions for multiple independent paths to success.
- There is a strong advocacy for using off-the-shelf solutions rather than custom code, highlighting the long-term benefits of maintainability and cost-effectiveness.
- Some commenters express concerns about the complexity introduced in software design, questioning the necessity of certain features.
- Suggestions for additional practices include throwing errors to catch issues early and considering the context when balancing simplicity and complexity in design.
- There is skepticism about whether adhering strictly to these principles would be sufficient to secure a job in a competitive environment.
The way to build reliable software systems is to have multiple independent paths to success.
This is the Erlang "let it crash" strategy restated, but I've also found it embodied in things like the architecture of Google Search, Tandem Computer, Ethereum, RAID 5, the Space Shuttle, etc. Basically, you achieve reliability through redundancy. For any given task, compute the answer multiple times in parallel, ideally in multiple independent ways. If the answer agrees, great, you're done. If not, have some consensus mechanism to detect the true answer. If you can't compute the answer in parallel, or you still don't get one back, retry.
The reason for this is simply math. If you have n different events that must all go right to achieve success, the chance of this happening is x1 * x2 * ... * xn. This product goes to zero very quickly - if you have 20 components connected in series that are all 98% reliable, the chance of success is only 2/3. If instead you have n different events where any one can go right to achieve success, the chance of success is 1 - (1 - y1) * (1 - y2) * ... * (1 - yn). This inverse actually increases as the number of alternate pathways to success goes up and fast. If you have 3 alternatives each of which has just an 80% chance of success, but any of the 3 will work, then doing them all in parallel has a 97% chance of success.
This is why complex software systems that must stay up are built with redundancy, replicas, failover, retries, and other similar mechanisms in place. And the presence of those mechanisms usually trumps anything you can do to increase the reliability of individual components, simply because you get diminishing returns to carefulness. You might spend 100x more resources to go from 90% reliability to 99% reliability, but if you can identify a system boundary and correctness check, you can get that 99% reliability simply by having 2 teams each build a subsystem that is 90% reliable and checking that their answers agree.
Of course, as a programmer, this is by far not my first instinct. I am a programmer, my function is programming, not purchasing.
Of course buying something is always cheaper (compared to the cost of my time) and will be orders of magnitude cheaper once the costs to maintain written-by-me code is added in.
Things that are bought -tend- to last longer too. If I leave my job I leave behind a bunch of custom code nobody wants to work on. If I leave Redis behind, well, the next guy just carries on running Redis.
I know all this. I advocate for all this. But I'm a programmer, send coders gotta code:) do it's not like we buy everything, I'm still there, still writing.
Hopefully though my emphasis is on adding value. Build things that others will take over one-day. Keep designs clean, and code cleaner.
And if I add one 'practice' to the list; Don't Be Clever. Clever code is hard to read, hard to understand, hard to maintain. Keep all code as simple as it can be. Reliable software is software that mostly isn't trying to be too clever.
It sucks, because nobody likes the idea of the "squeaky wheel getting the grease." At the same time, nobody is surprised that the yard equipment that they haven't used in a year or so is going to need effort to get back to working. The longer it has been since it was relied on to work, the more likely that it won't work.
To that end, I'm not arguing that all things should be the critical path. But the more code you have that isn't regularly exercised, the more likely it will be broken if anything around it changes.
But why do we invest so much complexity into outputting html/js/css.
If you would build an in-memory cache, how would you do it?
It should have good performance and be able to hold many entries.
Reads are more common than writes. I know how I would do it already,
but I’m curious about your approach.
Was to add this requirement since it comes up so often: Let's assume that keys accessed follow a power law, so some keys get
accessed very frequently and we would like them to have the fastest
retrieval of all.
I'm not sure if there are any efficient tweaks to hash tables or b-trees that might help with this additional requirement. Obviously we could make a hash table take way more space than needed to reduce collisions, but with a decent load factor is the answer to just swap frequently accessed keys to the beginning of their probe chain? How do we know it's frequently accessed? Count-Min sketch?Even with that tweak, the hottest keys will still be scattered around memory. Wouldn't it be best if their entries could fit into fewer pages? So, maybe a much smaller "hot" table containing say the 1,000 most accessed keys. We still want a high load factor to maximize the use of cache pages so perhaps perfect hashing?
When designing software, you first need to nail down the requirements, which I didn't really find in TFA.
1. Make or buy
2. Release a MVP
3. Keep it simple
4. Prepare for the worst
5. Make it easy to tests
7. Benchmark, monitor, log...
Not sure about this tbh. In a lot of cases yeah maybe. But when you are dealing with complicated business logic where there is a lot of bells and whistles required, building a simple reliable version can lead you into a naive implementation that might be reliable but very hard to extend, while making an unstable complicated thing can help you understand the pit falls and you can work back from there into something more reliable. So I think this depends very much on the context.
If someone posed this question to you in an interview and you used these principles, would you get the job?
Probably not.
Related
Why We Build Simple Software
Simplicity in software development, likened to a Toyota Corolla's reliability, is crucial. Emphasizing straightforward tools and reducing complexity enhances reliability. Prioritizing simplicity over unnecessary features offers better value and reliability.
We Build Simple Software
Simplicity in software development, likened to a Toyota Corolla's reliability, is crucial. Emphasizing straightforward tools, Pickcode aims for user-friendly experiences. Beware of complex software's pitfalls; prioritize simplicity for better value and reliability.
Fear of over-engineering has killed engineering altogether
The article critiques the tech industry's focus on speed over engineering rigor, advocating for "Napkin Math" and Fermi problems to improve decision-making and project outcomes through basic calculations.
Algorithms We Develop Software By
The article explores software development methodologies that improve coding efficiency, emphasizing daily feature work, code rewriting, the "gun to the head" heuristic, and effective navigation of problem spaces.
The Tool Cache Manifesto
The Tool Cache Manifesto emphasizes correctness in tool caches, advocating for seamless operation, proper management of dependencies, and caution against poor practices, suggesting that modern tools may not require caching.