October 8th, 2024

Practices of Reliable Software Design

The article outlines eight practices for reliable software design, emphasizing off-the-shelf solutions, cost-effectiveness, quick production deployment, simple data structures, and performance monitoring to enhance efficiency and reliability.

Read original articleLink Icon
AppreciationSkepticismFrustration
Practices of Reliable Software Design

The article discusses eight practices for reliable software design, particularly in the context of building an in-memory cache. The author emphasizes the importance of using off-the-shelf solutions when possible, prioritizing cost and reliability over unnecessary features, and quickly moving ideas into production to gather real-world feedback. Simple data structures are recommended to avoid misuse and performance issues, while early resource reservation is suggested to prevent runtime failures. Setting maximum limits on operations helps manage performance and resource usage effectively. The author also advocates for making testing straightforward by allowing command input for verification and embedding performance counters to monitor system behavior. These practices are derived from the author's experiences and aim to enhance software engineering efficiency and reliability.

- Use off-the-shelf solutions to simplify development.

- Prioritize cost and reliability over excessive features.

- Move ideas to production quickly to gather user feedback.

- Employ simple data structures to avoid complexity and bugs.

- Implement performance counters for better system monitoring.

AI: What people are saying
The comments on the article reflect a range of perspectives on software design practices.
  • Redundancy is emphasized as a key principle for achieving reliability in software systems, with suggestions for multiple independent paths to success.
  • There is a strong advocacy for using off-the-shelf solutions rather than custom code, highlighting the long-term benefits of maintainability and cost-effectiveness.
  • Some commenters express concerns about the complexity introduced in software design, questioning the necessity of certain features.
  • Suggestions for additional practices include throwing errors to catch issues early and considering the context when balancing simplicity and complexity in design.
  • There is skepticism about whether adhering strictly to these principles would be sufficient to secure a job in a competitive environment.
Link Icon 15 comments
By @nostrademons - 7 months
There is a bunch of good advice here, but it's missed the most useful principal in my experience, probably because the motivating example is too small in scope:

The way to build reliable software systems is to have multiple independent paths to success.

This is the Erlang "let it crash" strategy restated, but I've also found it embodied in things like the architecture of Google Search, Tandem Computer, Ethereum, RAID 5, the Space Shuttle, etc. Basically, you achieve reliability through redundancy. For any given task, compute the answer multiple times in parallel, ideally in multiple independent ways. If the answer agrees, great, you're done. If not, have some consensus mechanism to detect the true answer. If you can't compute the answer in parallel, or you still don't get one back, retry.

The reason for this is simply math. If you have n different events that must all go right to achieve success, the chance of this happening is x1 * x2 * ... * xn. This product goes to zero very quickly - if you have 20 components connected in series that are all 98% reliable, the chance of success is only 2/3. If instead you have n different events where any one can go right to achieve success, the chance of success is 1 - (1 - y1) * (1 - y2) * ... * (1 - yn). This inverse actually increases as the number of alternate pathways to success goes up and fast. If you have 3 alternatives each of which has just an 80% chance of success, but any of the 3 will work, then doing them all in parallel has a 97% chance of success.

This is why complex software systems that must stay up are built with redundancy, replicas, failover, retries, and other similar mechanisms in place. And the presence of those mechanisms usually trumps anything you can do to increase the reliability of individual components, simply because you get diminishing returns to carefulness. You might spend 100x more resources to go from 90% reliability to 99% reliability, but if you can identify a system boundary and correctness check, you can get that 99% reliability simply by having 2 teams each build a subsystem that is 90% reliable and checking that their answers agree.

By @bruce511 - 7 months
The first point is one that resonates strongly with me. Counter-intuitivly, the first instinct of a programmer should be "buy that, don't write it"

Of course, as a programmer, this is by far not my first instinct. I am a programmer, my function is programming, not purchasing.

Of course buying something is always cheaper (compared to the cost of my time) and will be orders of magnitude cheaper once the costs to maintain written-by-me code is added in.

Things that are bought -tend- to last longer too. If I leave my job I leave behind a bunch of custom code nobody wants to work on. If I leave Redis behind, well, the next guy just carries on running Redis.

I know all this. I advocate for all this. But I'm a programmer, send coders gotta code:) do it's not like we buy everything, I'm still there, still writing.

Hopefully though my emphasis is on adding value. Build things that others will take over one-day. Keep designs clean, and code cleaner.

And if I add one 'practice' to the list; Don't Be Clever. Clever code is hard to read, hard to understand, hard to maintain. Keep all code as simple as it can be. Reliable software is software that mostly isn't trying to be too clever.

By @taeric - 7 months
This misses one of the key things I have seen that really drives reliable software. Actually rely on the software.

It sucks, because nobody likes the idea of the "squeaky wheel getting the grease." At the same time, nobody is surprised that the yard equipment that they haven't used in a year or so is going to need effort to get back to working. The longer it has been since it was relied on to work, the more likely that it won't work.

To that end, I'm not arguing that all things should be the critical path. But the more code you have that isn't regularly exercised, the more likely it will be broken if anything around it changes.

By @l5870uoo9y - 7 months
I would add a ninth practice; throw errors. You find and fix them as opposed to errors that go silently unnoticed in the code base.
By @throwawayha - 7 months
Great points.

But why do we invest so much complexity into outputting html/js/css.

By @SomewhatLikely - 7 months
My first thought upon seeing the prompt:

    If you would build an in-memory cache, how would you do it?

    It should have good performance and be able to hold many entries. 
    Reads are more common than writes. I know how I would do it already, 
    but I’m curious about your approach.
Was to add this requirement since it comes up so often:

    Let's assume that keys accessed follow a power law, so some keys get 
    accessed very frequently and we would like them to have the fastest 
    retrieval of all.
I'm not sure if there are any efficient tweaks to hash tables or b-trees that might help with this additional requirement. Obviously we could make a hash table take way more space than needed to reduce collisions, but with a decent load factor is the answer to just swap frequently accessed keys to the beginning of their probe chain? How do we know it's frequently accessed? Count-Min sketch?

Even with that tweak, the hottest keys will still be scattered around memory. Wouldn't it be best if their entries could fit into fewer pages? So, maybe a much smaller "hot" table containing say the 1,000 most accessed keys. We still want a high load factor to maximize the use of cache pages so perhaps perfect hashing?

By @uzerfcwn - 7 months
It seems like the author had some very specific read and write pattern in mind when they designed for performance, but it's never explicitly stated. The problem setting only stated that "reads are more common than writes", but that's not really saying much when discussing performance. For example, a HTML server commonly has a small set of items that are most frequently read, and successive reads are not very strongly dependent. On the other hand, a PIM system may often get iterative reads correlated on some fuzzy search filter, which will be slow and thrash cache pretty badly if the system is optimized for different access patterns.

When designing software, you first need to nail down the requirements, which I didn't really find in TFA.

By @hamdouni - 7 months
My takeaways for a more general pov :

1. Make or buy

2. Release a MVP

3. Keep it simple

4. Prepare for the worst

5. Make it easy to tests

7. Benchmark, monitor, log...

By @BillLucky - 7 months
Simple but elegant design principles, recommended
By @u8_friedrich - 7 months
> It is much easier to add features to reliable software, than it is to add reliability to featureful software.

Not sure about this tbh. In a lot of cases yeah maybe. But when you are dealing with complicated business logic where there is a lot of bells and whistles required, building a simple reliable version can lead you into a naive implementation that might be reliable but very hard to extend, while making an unstable complicated thing can help you understand the pit falls and you can work back from there into something more reliable. So I think this depends very much on the context.

By @ActionHank - 7 months
Quick mental exercise on this.

If someone posed this question to you in an interview and you used these principles, would you get the job?

Probably not.