August 6th, 2024

Why Heroism Is Bad and What We Can Do to Stop It

Heroism in site reliability engineering can obscure systemic issues, create unrealistic workload expectations, and lead to burnout. Encouraging discussions about service level objectives and allowing failures can improve system reliability.

Read original articleLink Icon
Why Heroism Is Bad and What We Can Do to Stop It

Heroism in the context of site reliability engineering (SRE) refers to individuals stepping in to fill gaps in a system, often at the expense of their own well-being and the team's overall effectiveness. While acts of heroism can sometimes be necessary in emergencies, relying on them long-term is detrimental. Heroism obscures systemic issues, preventing teams from addressing underlying problems and leading to unrealistic expectations regarding workload and system performance. This culture can result in burnout for individuals who engage in excessive work without recognition or promotion. To mitigate heroism, teams should encourage open discussions about realistic service level objectives (SLOs), allow systems to fail to reveal issues, and focus on long-term solutions rather than short-term fixes. By fostering an environment where team members can identify and address systemic problems, organizations can reduce the reliance on heroics and improve overall system reliability.

- Heroism can mask systemic problems, preventing necessary fixes.

- It creates unrealistic expectations for workload and system performance.

- Individuals engaging in heroism may experience burnout and lack career advancement.

- Encouraging open discussions about SLOs can help mitigate heroism.

- Allowing systems to fail can provide valuable insights for improvement.

Link Icon 19 comments
By @fizx - 7 months
Whenever I see these google SRE articles, I kinda reduce the message to "don't be a hero in a cost center in an org with nearly unlimited resources."

If you're in a profit center, you might get rewarded for your risk.

By @tithe - 7 months
I imagine a conversation with this individual as team lead would go something like this:

"So, you worked overtime to save systems across the planet from crashing due to a botched update?"

"Yes, sir. We're 'Site Reliability Engineering', after all."

"And people in airports don't have to sleep on the floors because airlines can actually schedule flights?"

"Yes, sir. Site Reliability Engineering, at its finest, sir!"

"No, you played the hero. That's bad for the team and normally for you, really. You should have let it break."

"But our team...is 'Site Reliability Engineering'?"

"You should have let it break."

"But, Site...Reliability?"

"You're fired."

By @JohnMakin - 7 months
There are people so addicted to this that they will literally create problems out of nowhere so they can pull some heroics and save the day, always with high visibility from management. Seen a person advance pretty far in their career this way. Until management stops incentivizing this behavior, it won't stop. This is a management issue - which seems weird because this writing seems targeted towards IC's.
By @codingwagie - 7 months
You don't want heroes at large companies with top down product management. You need heroes at small innovative startups. This write up is more of a documentation on the stagnant culture inside google
By @Carrok - 7 months
"No matter how many hours they need to work."

"No matter that they need to work evenings and weekends."

I don't call these people heroes, I call them idiots.

Also, not being able to copy/paste text from text slides is a pretty terrible design choice, but we shouldn't be surprised knowing what the source is.

By @Swizec - 7 months
In my experience being the hero is fantastic … until you want to go on vacation, have a sick day, change teams, or get promoted. Hard to promote someone irreplaceable.

Always code (or mentor) yourself out of the job and let others play with your legos. Even if they do it wrong.

By @chrchr - 7 months
The last slide says "let the system break".

I strongly suggest that after that slide, there needs to be a whole series of slides about how to make it so that it's ok to let the system break. If you haven't already done the hard work to make your stuff resilient, "let the system break" is a recipe for blowing up customers, damaging reputations, and hurting people.

By @cbarrick - 7 months
It's wild to me that ICANN allowed .google and other brands to be TLDs.
By @mgaunard - 7 months
Heroism is what you do until you manage to secure the headcount and hire the team that lets you run things smoothly.

In the real world, getting approval for headcount can take 6 months, hiring 3, training another 3.

So you need to sustain heroism for a year without burning out.

By @ralferoo - 7 months
I really dislike the way this slide deck is written. It's rewriting a failure of management (bad project planning, too few people for the workload) and presenting it a failure by all the team members.

"The Hero decides that, despite this, ..."

"No matter what they're told about not doing this."

"The team doesn't realize..."

"Heroism is low risk, and easy to do."

"Help the Hero figure out what they should do instead."

"But the Hero won't let it go."

I suspect the likely scenario that prompted this document to be written was something like a manager facing low morale from his team, and has just been asked to explain why there was a catastrophic failure that he hadn't communicated upwards. Likely, he hadn't been doing his job properly, had no idea how much work his team was actually doing, the team was massively overloaded and worried about the job culls in other departments, worried because their boss kept saying things like "this was due yesterday", and so had been doing everything possible to stop the proverbial hitting the fan... and one day it reached bursting point, and they simply couldn't cope with all the work, despite already being forced to do overtime. Maybe some of them had even quit as a result, and complained to HR about the work-life balance in the team.

But the team leader can't possibly be at fault. This is the management spin on it: it's all the team member's fault, and the poor manager had no idea what was going on, not because he was a terrible manager, but because the team had been deliberately hiding all the work they were doing from him, they didn't want to go home to their wives and kids, but were choosing to spend their evenings working on secret projects to stoke their own egos or deal with their own insecurities, and concealing all the extra work from their managers.

By @saddist0 - 7 months
This article highlights many pitfalls but fails to explain "how to practice heroism effectively".

For instance, a team member might notice a recurring pattern and repeatedly save the SLA by addressing it immediately. While this quick fix is heroic, it should also be escalated for a long-term solution. This way, the hero tackles the immediate issue, and the team ensures that such heroism isn't needed in the future, and so on.

By @xtiansimon - 7 months
What a bombastic title—heroism is bad.

I’m thinking this whole piece is slanted to correct some other toxic or difficult to manage culture issue.

Getting to examples quickly saves the piece. Sounds like there are some gung-ho youths happy to be working at Google and they need some mentoring.

By @nuancebydefault - 7 months
Heroism is a good thing I believe, as long as it is not applied systematically.

Example 1: A client has a deadline and a malfunction or unpredictable limitation of our product is in their critical path. A few people put in collaborate effort, meaning working extra hours a few days, to help them out. Later the customer is happy and the boss throws a celebration drink.

Example 2 : an ICT member got a message that could indicate a security breach over the weekend. He logs in and sees more suspicious activity. He takes first actions (disable all logins/access of certain criteria) and calls head of ICT.

By @andrewla - 7 months
Did any of the commenters read the slideshow? Heroism is bad when it covers systemic problems.

Heroes are great -- SREs who rise to the occasion to prevent horrors are appropriately rewarded and congratulated for their work.

But when a product relies upon heroes to continue operating, you are in a dangerous situation. That's how major outages occur; the hero goes on vacation or decides to let it break this time and the cascade of failures causes huge amounts of damage, where letting the system break much earlier would have made it clear to the development team that there is a major gap in the intrinsic reliability of the system.

By @hulitu - 7 months
> Why Heroism Is Bad and What We Can Do to Stop It

Talk to Hollywood ? /s

By @luqtas - 7 months
could Google stop the "heroism syndrome" and give us the source-code for their deactivated services? even if they aren't parsed to their heroic servers and it's about being self-host-able by non-heroes
By @lutarezj - 7 months
Not sure if heroism is bad.

[1] All teams should have a Jordan, a kobe, a shaquille or a combi. One needs A players and supporting cast. It is not the culture or the org who decides upon the evolution of the heroism. It is the hero who builds a team around him/her. [2] the scrum or agile saga that promotes that all team members should be able to do what all team members do is just excel-minded-nonesense. Cant win championships with only goalkeepers, or only midfielders. Cant prep one to be good in both either during a lifetime.

Probably google wants weat crops that always look alike and are predictable?

By @george1384 - 7 months
Total junk. Don't blame the "hero" for their behavior, blame the management for not thinking ahead and making sure the problems didn't fester, blister, boil over to the point in which babysitting the systems over the weekend became necessary.

No "hero" ever does this work without trying to plan for it ahead of time. "Heroics" are necessary when the system let them down and stop letting long term thinking and planning account for problems.