Leveraging AI for efficient incident response
Meta has developed an AI-assisted system for root cause analysis, achieving 42% accuracy by combining heuristic retrieval and LLM ranking, significantly improving investigation efficiency while addressing potential risks through feedback and explainability.
Read original articleMeta has developed an AI-assisted root cause analysis system to enhance the efficiency of incident response and system reliability investigations. This system combines heuristic-based retrieval and large language model (LLM)-based ranking to expedite the identification of root causes during investigations. Testing has demonstrated a 42% accuracy rate in pinpointing root causes at the time of investigation initiation, particularly within their web monorepo. The system addresses the complexities of investigating issues in monolithic repositories, which can involve numerous changes across various teams. By narrowing down potential code changes from thousands to a few hundred using heuristics, and then further refining this list to the top five candidates through LLM ranking, the system significantly streamlines the investigation process. The Llama 2 model was fine-tuned with historical investigation data to enhance its performance. While the integration of AI offers substantial benefits in reducing investigation time, it also poses risks, such as the potential for incorrect root cause suggestions. To mitigate these risks, Meta emphasizes the importance of feedback loops and explainability in their AI systems. Future developments may include automating workflows and proactively identifying potential incidents before code deployment.
- Meta's AI system achieves 42% accuracy in identifying root causes during investigations.
- The system combines heuristic retrieval and LLM ranking to streamline investigations.
- It narrows down potential code changes significantly, improving efficiency.
- Fine-tuning of the Llama 2 model was crucial for enhancing accuracy.
- Meta prioritizes feedback and explainability to mitigate risks associated with AI suggestions.
Related
Llama 3.1: Our most capable models to date
Meta has launched Llama 3.1 405B, an advanced open-source AI model supporting diverse languages and extended context length. It introduces new features like Llama Guard 3 and aims to enhance AI applications with improved models and partnerships.
Big tech wants to make AI cost nothing
Meta has open-sourced its Llama 3.1 language model for organizations with fewer than 700 million users, aiming to enhance its public image and increase product demand amid rising AI infrastructure costs.
Bypassing Meta's Llama Classifier: A Simple Jailbreak
Robust Intelligence discovered a vulnerability in Meta's Prompt-Guard-86M model, allowing prompt injections to bypass safety measures. The exploit significantly reduced detection accuracy, prompting Meta to work on a fix.
Meta's AI safety system defeated by the space bar
Meta's AI safety system, Prompt-Guard-86M, designed to prevent prompt injection attacks, has been found vulnerable, allowing attackers to bypass safeguards, raising concerns about AI reliability in sensitive applications.
One of the main reasons why this works well is because we feed the models our incident playbooks and response knowledge bases.
These playbooks are very carefully written and maintained by people. The current generation of models are pretty much post-human in following them, performing reasoning and suggesting mitigations.
We tried indexing just a bunch of incident slack channels and result was not great. But with explicit documentation, it works well.
Kind of proves what we already know, garbage in, garbage out. But also, other functions, eg: PM, Design have tried automating their own workflows, but doesn't work as well.
IME a very very large number of impacting incidents arent strictly tied to “a” code change, if any at all. It _feels_ like theres an implied solution to tying running version back to deployment rev, to deployment artifacts, and vcs.
Boundary conditions and state changes in the distributed system were the biggest bug bear I ran in to at AWS. Then below that were all of the “infra” style failures like network faults, latency, API quota exhaustion, etc. And for all the cloudformation/cdk/terraform in the world its non trivial to really discover those effects and tie them to a “code change.” Totally ignoring older tools that may be managed via CLI or the ol’ point and click.
42% accuracy on a tiny, outdated model - surely it would improve significantly by fine-tuning Llama 3.1 405B!
We're taking a slightly different angle than what Facebook published, in that we're primarily using tool calling and observability data to run investigations.
What we've released really shines at surfacing up relevant observability data automatically, and we're soon planning to add the change-tracking elements mentioned in the Facebook post.
If anyone is curious, I did a webinar with PagerDuty on this recently.
How would an experienced engineer score on the same task?
Also, some more researches in the similar space by other enterprises:
Microsoft: https://yinfangchen.github.io/assets/pdf/rcacopilot_paper.pd...
Salesforce: https://blog.salesforceairesearch.com/pyrca/
Personal plug: I'm building a self-service AIOps platform for engineering teams (somewhat similar to this work by Meta). If you're looking to read more about it, visit -- https://docs.drdroid.io/docs/doctor-droid-aiops-platform
We are currently working on this at: https://github.com/opslane/opslane
We are starting by tackling adding enrichment to your alerts.
Unlike a modern LLM (or most any non-trivial NN), a GBDT’s feature importance is defensively rigorous.
After floating the results to a few folks up the chain we burned it and forget where.
9 times out of 10, you can and should write "using" instead of "leveraging".
User: Ahh, got locked out, contact support and wait
AI 2: The user is not suspicious, unlock account
User: Great, thank you
AI 1: This account is suspicious, lock account
That means the first person on the call doing the triage has a 58% chance of being fed misinformation and bias which they have to distinguish from reality.
Of course you can't say anything about an ML model being bad that you are promoting for your business.
Related
Llama 3.1: Our most capable models to date
Meta has launched Llama 3.1 405B, an advanced open-source AI model supporting diverse languages and extended context length. It introduces new features like Llama Guard 3 and aims to enhance AI applications with improved models and partnerships.
Big tech wants to make AI cost nothing
Meta has open-sourced its Llama 3.1 language model for organizations with fewer than 700 million users, aiming to enhance its public image and increase product demand amid rising AI infrastructure costs.
Bypassing Meta's Llama Classifier: A Simple Jailbreak
Robust Intelligence discovered a vulnerability in Meta's Prompt-Guard-86M model, allowing prompt injections to bypass safety measures. The exploit significantly reduced detection accuracy, prompting Meta to work on a fix.
Meta's AI safety system defeated by the space bar
Meta's AI safety system, Prompt-Guard-86M, designed to prevent prompt injection attacks, has been found vulnerable, allowing attackers to bypass safeguards, raising concerns about AI reliability in sensitive applications.