Looming Liability Machines (LLMs)
The use of Large Language Models for root cause analysis in cloud incidents raises concerns about undermining human expertise, leading to superficial analyses, systemic failures, and risks from unexpected automated behaviors.
Read original articleThe discussion revolves around the use of Large Language Models (LLMs) for root cause analysis (RCA) in cloud incidents. While LLMs can match incidents to handlers and predict root causes, the author expresses concern that relying on LLMs may undermine the development of human expertise in RCA. RCA is a critical process that requires a deep understanding of complex systems and the interplay of various factors, as highlighted by safety engineering expert Nancy Leveson. The author fears that companies might prioritize cost-cutting by substituting LLMs for skilled engineers, leading to superficial analyses and systemic failures. Additionally, the potential for "automation surprise," where automated systems behave unexpectedly, poses risks, especially if users lack a comprehensive understanding of LLM capabilities. The author cites AWS's integration of LLMs for software upgrades, noting the lack of critical feedback on this approach, which raises concerns about overlooking potential operational issues. The overarching message is a caution against over-reliance on LLMs, advocating for a balanced approach that maintains human expertise in safety and reliability.
- LLMs may undermine the development of human expertise in root cause analysis.
- Root cause analysis requires a deep understanding of complex systems and their interactions.
- Over-reliance on LLMs could lead to superficial analyses and systemic failures.
- Automation surprise can occur when automated systems behave unexpectedly.
- Critical feedback on LLM applications in industry is necessary to identify potential risks.
- Many believe LLMs lack the necessary understanding of complex systems, making them unsuitable for accurate RCA.
- There are concerns about the potential for LLMs to produce misleading or superficial analyses, which could undermine human expertise.
- Commenters emphasize the importance of human verification of LLM outputs, especially in critical processes.
- Some argue that the adoption of LLMs in RCA is premature and could lead to dangerous outcomes.
- There is a call for additional error control mechanisms to enhance the reliability of LLM outputs in complex scenarios.
LLMs are good at predicting the next word in written language. They are generative; they make new text given a prompt. LLMs do not have base sets of facts about how complex systems work, and do not attempt to reason over a corpus of evidence and facts. as a result, I would expect that an LLM might concoct an interesting story about why such a failure occurred, and it might even be a convincing story if it happened to weave bits of context, accurately into the storyline. It might even, purely randomly, generate a story that actually correctly diagnosed the root cause of the failure, but that would be coincidental based on the similarity of the prompt to text of similar postmortem discussions that were part of its training set.
If you had an extremely detailed postmortem document, then I would expect LLM‘s to do a very good job of summarizing such document.
But I don’t see why an LLM is an appropriate tool for analyzing failures in complex systems; just as I don’t see a hammer being a very effective tool for tightening bolts.
Right now, I am concerned that the relative ease that modern frameworks provide to author LLM based applications, is leading many people to optimistically include LLM technology in attempts to solve problems that it doesn’t seem particularly well suited to solve.
This means that in most cases, these RCAs are the output of a long and over engineered incident review process that was designed to impress the higher echelon.
The problem is, that in a decently sized corporation, you have tens to hundreds of daily fuck ups (also known as "incidents") that completely suck out the free time out of engineers that have to navigate the long game of post incident management process.
The utilisation of LLMs on these cases are just engineered solution to the problem of organisational bureaucracy.
The rate at which llm/llm compound systems can produce output > the rate at which humans can verify the output
I think it follows that we should not use llms for anything critical.
The gunghoe adoption and hamfisting of llms into critical processes, like an AWs migration to Java 17, or root cause analysis is plainly premature, naive, and dangerous.
Even if this accuracy is 95% then in a complex system the probability of getting to the right answer diminishes with each new step being added. This is also the key tenet of an agentic system.
While the analysis in the blog is excellent but an answer needs to be found. A layer on top of LLMs for error control/check.
As an analogy, in the analog to digital transmission stack of OSI, an error correction mechanism such as frame check sequence (fcs) detects transmission errors in the data link layer.
If for example you have an LLM agent that's effectively "solved" every security flaw your software may encounter for the next 50 years, unless it can simultaneously impart 50 years of training to the people who rely on the software, it's done nothing but introduce us to more complex flaws that we would need approximately 49 more years of experience to tackle ourselves.
But you don't offload it in the sense that you expect the tool to completely take the wheel.
You ask it for suggestions to inform a human. If the suggestions turn out to only be a distraction in your environment then you abandon the tool.
For plenty of environments the suggestions will be hugely useful and save you valuable time during an ongoing outage.