August 23rd, 2024

Leveraging AI for efficient incident response

Meta has developed an AI-assisted system for root cause analysis, achieving 42% accuracy by combining heuristic retrieval and LLM ranking, significantly improving investigation efficiency while addressing potential risks through feedback and explainability.

Read original articleLink Icon
Leveraging AI for efficient incident response

Meta has developed an AI-assisted root cause analysis system to enhance the efficiency of incident response and system reliability investigations. This system combines heuristic-based retrieval and large language model (LLM)-based ranking to expedite the identification of root causes during investigations. Testing has demonstrated a 42% accuracy rate in pinpointing root causes at the time of investigation initiation, particularly within their web monorepo. The system addresses the complexities of investigating issues in monolithic repositories, which can involve numerous changes across various teams. By narrowing down potential code changes from thousands to a few hundred using heuristics, and then further refining this list to the top five candidates through LLM ranking, the system significantly streamlines the investigation process. The Llama 2 model was fine-tuned with historical investigation data to enhance its performance. While the integration of AI offers substantial benefits in reducing investigation time, it also poses risks, such as the potential for incorrect root cause suggestions. To mitigate these risks, Meta emphasizes the importance of feedback loops and explainability in their AI systems. Future developments may include automating workflows and proactively identifying potential incidents before code deployment.

- Meta's AI system achieves 42% accuracy in identifying root causes during investigations.

- The system combines heuristic retrieval and LLM ranking to streamline investigations.

- It narrows down potential code changes significantly, improving efficiency.

- Fine-tuning of the Llama 2 model was crucial for enhancing accuracy.

- Meta prioritizes feedback and explainability to mitigate risks associated with AI suggestions.

Link Icon 18 comments
By @LASR - 6 months
We've shifted our oncall incident response over to mostly AI at this point. And it works quite well.

One of the main reasons why this works well is because we feed the models our incident playbooks and response knowledge bases.

These playbooks are very carefully written and maintained by people. The current generation of models are pretty much post-human in following them, performing reasoning and suggesting mitigations.

We tried indexing just a bunch of incident slack channels and result was not great. But with explicit documentation, it works well.

Kind of proves what we already know, garbage in, garbage out. But also, other functions, eg: PM, Design have tried automating their own workflows, but doesn't work as well.

By @donavanm - 6 months
Im really interested in the implied restriction/focus on “code changes.”

IME a very very large number of impacting incidents arent strictly tied to “a” code change, if any at all. It _feels_ like theres an implied solution to tying running version back to deployment rev, to deployment artifacts, and vcs.

Boundary conditions and state changes in the distributed system were the biggest bug bear I ran in to at AWS. Then below that were all of the “infra” style failures like network faults, latency, API quota exhaustion, etc. And for all the cloudformation/cdk/terraform in the world its non trivial to really discover those effects and tie them to a “code change.” Totally ignoring older tools that may be managed via CLI or the ol’ point and click.

By @pants2 - 6 months
> The biggest lever to achieving 42% accuracy was fine-tuning a Llama 2 (7B) model

42% accuracy on a tiny, outdated model - surely it would improve significantly by fine-tuning Llama 3.1 405B!

By @nyellin - 6 months
We've open sourced something with similar goals that you can use today: https://github.com/robusta-dev/holmesgpt/

We're taking a slightly different angle than what Facebook published, in that we're primarily using tool calling and observability data to run investigations.

What we've released really shines at surfacing up relevant observability data automatically, and we're soon planning to add the change-tracking elements mentioned in the Facebook post.

If anyone is curious, I did a webinar with PagerDuty on this recently.

By @mafribe - 6 months
The paper goes out of its way not to compare the 42% figure with anything. Is "42% within the top 5 suggestions" good or bad?

How would an experienced engineer score on the same task?

By @TheBengaluruGuy - 6 months
Interesting. Just a few weeks back, I was reading about their previous work https://atscaleconference.com/the-evolution-of-aiops-at-meta... -- didn't realise there's more work!

Also, some more researches in the similar space by other enterprises:

Microsoft: https://yinfangchen.github.io/assets/pdf/rcacopilot_paper.pd...

Salesforce: https://blog.salesforceairesearch.com/pyrca/

Personal plug: I'm building a self-service AIOps platform for engineering teams (somewhat similar to this work by Meta). If you're looking to read more about it, visit -- https://docs.drdroid.io/docs/doctor-droid-aiops-platform

By @MOARDONGZPLZ - 6 months
I would love if they leveraged AI to detect AI on the regular Facebook feed. I visit occasionally and it’s just a wasteland of unbelievable AI content with tens of thousands of bot (I assume…) likes. Makes me sick to my stomach and I can’t even browse.
By @aray07 - 6 months
I do think AI will automate a lot of the grunt work involved with incidents and make the life of on-call engineers better.

We are currently working on this at: https://github.com/opslane/opslane

We are starting by tackling adding enrichment to your alerts.

By @benreesman - 6 months
Way back in the day on FB Ads we trained a GBDT on a bunch of features extracted from the diff that had been (post-hoc) identified as the cause of a SEV.

Unlike a modern LLM (or most any non-trivial NN), a GBDT’s feature importance is defensively rigorous.

After floating the results to a few folks up the chain we burned it and forget where.

By @BurningFrog - 6 months
PSA:

9 times out of 10, you can and should write "using" instead of "leveraging".

By @AeZ1E - 6 months
nice to see meta investing in AI investigation tools! but 42% accuracy doesn't sound too impressive to me... maybe there's still some fine-tuning needed for better results? glad to hear about the progress though!
By @ketzo - 6 months
This is really cool. My optimistic take on GenAI, at least with regard to software engineering, is that it seems like we're gonna have a lot of the boring / tedious parts of our jobs get a lot easier!
By @coding123 - 6 months
AI 1: This user is suspicious, lock account

User: Ahh, got locked out, contact support and wait

AI 2: The user is not suspicious, unlock account

User: Great, thank you

AI 1: This account is suspicious, lock account

By @_pdp_ - 6 months
I will be more interested to understand how they deal with injection attacks. Any alert where the attacker controls some parts of the text that ends up in the model could be used to either evade it worse use it to hack it. Slack had an issue like that recently.
By @devneelpatel - 6 months
This is exactly what we do at OneUptime.com. Show you AI generated possible Incident remediation based on your data + telemetry + code. All of this is 100% open-source.
By @minkles - 6 months
I'm going to point out the obvious problem here: 42% RC identification is shit.

That means the first person on the call doing the triage has a 58% chance of being fed misinformation and bias which they have to distinguish from reality.

Of course you can't say anything about an ML model being bad that you are promoting for your business.