Why Heroism Is Bad and What We Can Do to Stop It
Heroism in site reliability engineering can obscure systemic issues, create unrealistic workload expectations, and lead to burnout. Encouraging discussions about service level objectives and allowing failures can improve system reliability.
Read original articleHeroism in the context of site reliability engineering (SRE) refers to individuals stepping in to fill gaps in a system, often at the expense of their own well-being and the team's overall effectiveness. While acts of heroism can sometimes be necessary in emergencies, relying on them long-term is detrimental. Heroism obscures systemic issues, preventing teams from addressing underlying problems and leading to unrealistic expectations regarding workload and system performance. This culture can result in burnout for individuals who engage in excessive work without recognition or promotion. To mitigate heroism, teams should encourage open discussions about realistic service level objectives (SLOs), allow systems to fail to reveal issues, and focus on long-term solutions rather than short-term fixes. By fostering an environment where team members can identify and address systemic problems, organizations can reduce the reliance on heroics and improve overall system reliability.
- Heroism can mask systemic problems, preventing necessary fixes.
- It creates unrealistic expectations for workload and system performance.
- Individuals engaging in heroism may experience burnout and lack career advancement.
- Encouraging open discussions about SLOs can help mitigate heroism.
- Allowing systems to fail can provide valuable insights for improvement.
Related
Innovation heroes are a sign of a dysfunctional organization
The article discusses the reliance on "Innovation Heroes" in organizations, highlighting the need for a systematic approach to innovation. It emphasizes the importance of establishing an Innovation Doctrine for sustained competitiveness.
The Programmers' Identity Crisis: how do we use our powers for 'good'?
Reflection on ethical dilemmas faced by programmers, discussing challenges of working for companies with questionable practices. Emphasizes rationalizing involvement with conflicting values in tech industry and suggests navigating dilemmas collectively for positive change.
All metrics are scar tissue (unless they're Business Intelligence)
Managing metrics in Site Reliability Engineering involves emotional ties and past experiences, impacting incident management. Balancing operational goals and emotional attachment is crucial for refining metrics effectively.
If you're in a profit center, you might get rewarded for your risk.
"So, you worked overtime to save systems across the planet from crashing due to a botched update?"
"Yes, sir. We're 'Site Reliability Engineering', after all."
"And people in airports don't have to sleep on the floors because airlines can actually schedule flights?"
"Yes, sir. Site Reliability Engineering, at its finest, sir!"
"No, you played the hero. That's bad for the team and normally for you, really. You should have let it break."
"But our team...is 'Site Reliability Engineering'?"
"You should have let it break."
"But, Site...Reliability?"
"You're fired."
"No matter that they need to work evenings and weekends."
I don't call these people heroes, I call them idiots.
Also, not being able to copy/paste text from text slides is a pretty terrible design choice, but we shouldn't be surprised knowing what the source is.
Always code (or mentor) yourself out of the job and let others play with your legos. Even if they do it wrong.
I strongly suggest that after that slide, there needs to be a whole series of slides about how to make it so that it's ok to let the system break. If you haven't already done the hard work to make your stuff resilient, "let the system break" is a recipe for blowing up customers, damaging reputations, and hurting people.
In the real world, getting approval for headcount can take 6 months, hiring 3, training another 3.
So you need to sustain heroism for a year without burning out.
"The Hero decides that, despite this, ..."
"No matter what they're told about not doing this."
"The team doesn't realize..."
"Heroism is low risk, and easy to do."
"Help the Hero figure out what they should do instead."
"But the Hero won't let it go."
I suspect the likely scenario that prompted this document to be written was something like a manager facing low morale from his team, and has just been asked to explain why there was a catastrophic failure that he hadn't communicated upwards. Likely, he hadn't been doing his job properly, had no idea how much work his team was actually doing, the team was massively overloaded and worried about the job culls in other departments, worried because their boss kept saying things like "this was due yesterday", and so had been doing everything possible to stop the proverbial hitting the fan... and one day it reached bursting point, and they simply couldn't cope with all the work, despite already being forced to do overtime. Maybe some of them had even quit as a result, and complained to HR about the work-life balance in the team.
But the team leader can't possibly be at fault. This is the management spin on it: it's all the team member's fault, and the poor manager had no idea what was going on, not because he was a terrible manager, but because the team had been deliberately hiding all the work they were doing from him, they didn't want to go home to their wives and kids, but were choosing to spend their evenings working on secret projects to stoke their own egos or deal with their own insecurities, and concealing all the extra work from their managers.
For instance, a team member might notice a recurring pattern and repeatedly save the SLA by addressing it immediately. While this quick fix is heroic, it should also be escalated for a long-term solution. This way, the hero tackles the immediate issue, and the team ensures that such heroism isn't needed in the future, and so on.
I’m thinking this whole piece is slanted to correct some other toxic or difficult to manage culture issue.
Getting to examples quickly saves the piece. Sounds like there are some gung-ho youths happy to be working at Google and they need some mentoring.
Example 1: A client has a deadline and a malfunction or unpredictable limitation of our product is in their critical path. A few people put in collaborate effort, meaning working extra hours a few days, to help them out. Later the customer is happy and the boss throws a celebration drink.
Example 2 : an ICT member got a message that could indicate a security breach over the weekend. He logs in and sees more suspicious activity. He takes first actions (disable all logins/access of certain criteria) and calls head of ICT.
Heroes are great -- SREs who rise to the occasion to prevent horrors are appropriately rewarded and congratulated for their work.
But when a product relies upon heroes to continue operating, you are in a dangerous situation. That's how major outages occur; the hero goes on vacation or decides to let it break this time and the cascade of failures causes huge amounts of damage, where letting the system break much earlier would have made it clear to the development team that there is a major gap in the intrinsic reliability of the system.
Talk to Hollywood ? /s
[1] All teams should have a Jordan, a kobe, a shaquille or a combi. One needs A players and supporting cast. It is not the culture or the org who decides upon the evolution of the heroism. It is the hero who builds a team around him/her. [2] the scrum or agile saga that promotes that all team members should be able to do what all team members do is just excel-minded-nonesense. Cant win championships with only goalkeepers, or only midfielders. Cant prep one to be good in both either during a lifetime.
Probably google wants weat crops that always look alike and are predictable?
No "hero" ever does this work without trying to plan for it ahead of time. "Heroics" are necessary when the system let them down and stop letting long term thinking and planning account for problems.
Related
Innovation heroes are a sign of a dysfunctional organization
The article discusses the reliance on "Innovation Heroes" in organizations, highlighting the need for a systematic approach to innovation. It emphasizes the importance of establishing an Innovation Doctrine for sustained competitiveness.
The Programmers' Identity Crisis: how do we use our powers for 'good'?
Reflection on ethical dilemmas faced by programmers, discussing challenges of working for companies with questionable practices. Emphasizes rationalizing involvement with conflicting values in tech industry and suggests navigating dilemmas collectively for positive change.
All metrics are scar tissue (unless they're Business Intelligence)
Managing metrics in Site Reliability Engineering involves emotional ties and past experiences, impacting incident management. Balancing operational goals and emotional attachment is crucial for refining metrics effectively.