August 26th, 2024

A Jenga tower about to collapse: Software erosion is happening all around us

Software erosion results from complex architectures and frequent changes, with developers spending 42% of time on maintenance. A "shift left" approach is crucial for integrating quality assurance early in development.

Read original articleLink Icon
A Jenga tower about to collapse: Software erosion is happening all around us

Software erosion is increasingly prevalent in modern software development, characterized by complex architectures that resemble a precarious Jenga tower. Developers spend a significant portion of their time—42%—on maintenance rather than innovation, leading to frequent outages that disrupt services across various sectors. The root cause of these outages is not a lack of testing but rather the hypercomplexity of software configurations, resulting from numerous changes made by different teams over time. This complexity is exacerbated by pressures to innovate quickly, often leading to shortcuts that introduce further complications. As developers patch issues, they inadvertently create a cycle of instability, which can lead to attrition among engineers and a decline in morale. To combat software erosion, companies must adopt a "shift left" approach, integrating quality assurance early in the development process rather than as an afterthought. This involves thorough testing and understanding of the software architecture to prevent costly fixes later on. Ultimately, addressing software erosion requires a commitment to quality and a reevaluation of development practices to ensure sustainable growth and functionality.

- Software erosion is caused by complex architectures and frequent changes.

- Developers spend over 40% of their time on maintenance, limiting innovation.

- Outages are often due to the instability created by shortcuts and patching.

- A "shift left" approach is essential for integrating quality assurance early in development.

- Companies need to understand their software architecture to prevent future issues.

Link Icon 18 comments
By @gwynforthewyn - 3 months
I’m actually just cynical enough this morning to think this was written by a chat AI that was prompted with “write a few paragraphs of ragebait about software erosion”. The article asserts software erosion is happening for a series of vague reasons like “dependency hell” and that developers are asked to add new features to codebases.

Developers have been adding features to codebases for _decades_. It’s a demonstrably fine activity. The article doesn’t chain together “practice X causes bad effect Y”, it just says themed sentences one after the other that don’t follow a reasoned argument. There aren’t even any personal anecdotes.

There’s so many people writing much better instructive content; it’s a little heartbreaking seeing nonsense like this elevated.

By @MrThoughtful - 3 months

    the average developer spends 42% of their
    work week on maintenance
Indeed, I see that happening all around me when I watch how my friends build their startups. The first few months they are productive, and then they sink deeper and deeper into the quicksand of catching up with changes in their stack.

So far, I have done a somewhat good job of avoiding that. And I have a keen eye on avoiding it for the future.

I think a good stack to use is:

    OS: Debian
    DB: SQLite
    Webserver: Apache
    Backend Language: Python
    Backend Library: Django
    Frontend: HTML + CSS + Javascript
    Frontend library: Handlebars
And no other dependencies.

I called Django a "library" instead of a framework, because I do not create projects via "django-admin startproject myproject" but rather just do "import django" in the files that make use of Django functionality.

In the frontend, the only thing I use a library for is templating content that is dynamically updated on the client. Handlebars does it in a sane way.

This way, I expect that my stack is stable enough for me to keep all projects functional for decades to come. With less than a week of maintenance per year.

By @kkfx - 3 months
The root problem is simply the modern, old, OS++ concept, or the divide et impera commercial concept.

Classic systems was a single application, fully integrated. This design create a slow, incremental evolution and a plethora of small improvements, the commercial design of compartmentalized levels have created a Babel tower of dysfunctional crap mostly trying to punch holes between levels.

An example: a single NixOS home server can do on a very small system what a modern deploy with docker and co can do on an equivalent starship, a simple Emacs buffer, let's say a mail compose one, can allow quickly solving an ODE via Maxima, while a modern Office Suite can't without manual cut and paste and gazillion SLoC more, a Plan 9 mail system do not need to implement complex storage handling and network protocols, all it need is just in the system, mounting someone else remote mailbox, save a file there it's sending a message, reading a file from a mounted filesystem is reading one and the same is for viewing a website. In Gnus a mail, an RSS article and NNTP post are the same, because they are DAMN THE SAME, a damn text with optional extras consisting of a title and a body. That's the power of simplicity we lost to push walled gardens and keep users locked-in and powerless.

The modern commercial IT is simply untenable:

- even Alphabet can't scan the whole web, a distributed YaCy on gazzilion of homeservers can though, and with MUCH LESS iron and costs for the whole humanity;

- nobody can map as in an OSM model, since anybody doing it share everything on every other;

This is the power of FLOSS. It's time to admit that we can't afford a commerce/finance managed nervous system of our societies, simply.

By @EdwardDiego - 3 months
A very vague article about how our entire digging infrastructure is about to collapse, from someone looking to sell you a spade.
By @poikroequ - 3 months
The bigger problem I've seen is high turnover rates in the industry. The people who built and know the system leave. There wasn't a sufficient window for KT (knowledge transfer), so you're left with a bunch of new devs who only have a surface level understanding of the code and architecture. Productivity drops severely because every new feature requires several hours of reading code / reverse engineering. Then these new features often break other things because the devs don't know the intricacies of the system, so many more hours are spent fixing the bugs.
By @irjustin - 3 months
The article actually discredits its own conclusion early on:

> These outages didn’t happen because developers didn’t test software.

The conclusion being:

> How do you get quality code?...Don’t skimp on static code analysis and functional tests, which should be run as new code is written.

But even working from the conclusion backwards, which is "specs+code analysis" will save you from the big scary thing of "software erosion" and "complexity" thusly sparing us all from outages, I disagree.

Specs+analysis are helpful, but they do not magically solve complexity at scale. Crowdstrike sure, would've benefited from testing I agree but so many other large outtages need more than that, which is the disconnect of the article for me.

At some point you need blackbox, chaos monkey level production tests. Bring down your central database, bring down us-east-1. What happens to the business?

I'm not sure if this is valid, but a lot of the savvier tech companies' outtages feel like they're router configurations that lead to cascading traffic issues. But I have no data to back this thought up.

By @lewdev - 3 months
This article is capitalizing on the Crowdstrike incident. It was costly but a mistake. As a software engineer, I just know that's all it is. I don't think there is a upward trend of these mistakes because they are always trying to be careful and sometimes they also get careless. Some additional processes might be added to avoid it, but years later it may happen again somewhere else in another company. I don't think it's because of "software erosion." And the recovery was a costly day or two but it was fixed and we all went back to normal.
By @sumuyuda - 3 months
> These outages didn’t happen because developers didn’t test software.

Funny how there is no mention of how modern tech companies offshored/outsourced and even fired manual QA testers. Developers aren’t testers. Do we expect a civil engineer to test the bridge they created before opening it to the public?

Also, with a move fast and break things mentality, stable and quality software went out the window for a continuous release of broken/buggy software.

By @josefrichter - 3 months
"average developer spends 42% of their work week on maintenance" – is that true (source)? Is that your personal experience too?
By @ornornor - 3 months
Mitigating and coming up with a plan to remedy these issues was my specialty over the 15ish years I wrote software professionally.

All these initiatives and plan always does when it reaches an executive reacting with « it works now why should we spend any money on not making new features? We’re not doing your gold plating we don’t need it »

I eventually got tired of this, ran out of motivation, and quit software engineering.

MBAs who understand nothing about software treat software developers as code monkeys and then we are in this situation.

I’m still bitter about the whole thing and how it completely put me off writing software (which I used to love doing). Some days, I’m cheering for these failures and crashes imagining some exec somewhere will eat a big shit sandwich for causing it. But I’m not kidding myself, I know it’s the software engineers getting blamed and working over time for these outages…

By @eliasson - 3 months
I am not a fan of comparing bad software with erosion or organic decay. It's feels like avoiding responsibility. Software is software, it is made worse by people, nothing else.
By @kennu - 3 months
In my view, this is why cloud exists. You can externalize as much of your software stack as possible to the cloud platform, and only implement and maintain yourself the parts that differentiate. On AWS, this means using Lambda, Step Functions, AppSync, API Gateway, DynamoDB etc. and letting the cloud provider worry about maintaining most of the technology stack.
By @keyle - 3 months

    - And tonight at 11... 
    - Dooooooom!
By @TacticalCoder - 3 months
> the average developer spends 42% of their work week on maintenance

Don't worry: AI will do coding maintenance / bug fixes soon.

By @SebFender - 3 months
this has been the situation in computing in general for many decades.

When I started in the 90's - maintaining different unix distributions was a continuous package game.

Now looking at ops and devs at companies, this has been the ongoing work.

I think it's just an integral part of computing and one of its core challenges...

By @louwrentius - 3 months
Reads like a chat-gpt prompt
By @ungamedplayer - 3 months
Once again we blame developers for corporations not testing deployments and pushing changes live.

Nobody can test everything. Big deployments require big testing.

By @mysal - 3 months
In the current climate of cultural revolution experts are forced to be silent, hand over their authority to mediocre politicians and let everyone commit for the sake of "equity" (meaning: the politicians have an income).

No wonder that the whole system collapses.

To be fair, in the 1990s software wasn't great either, but many things were new and written under enormous time pressure like the Netscape browser.

Linux distributions were best around 2010. Google and Windows were best around that time, too.