August 25th, 2024

Looming Liability Machines (LLMs)

The use of Large Language Models for root cause analysis in cloud incidents raises concerns about undermining human expertise, leading to superficial analyses, systemic failures, and risks from unexpected automated behaviors.

Read original article

ConcernSkepticismFrustration

The discussion revolves around the use of Large Language Models (LLMs) for root cause analysis (RCA) in cloud incidents. While LLMs can match incidents to handlers and predict root causes, the author expresses concern that relying on LLMs may undermine the development of human expertise in RCA. RCA is a critical process that requires a deep understanding of complex systems and the interplay of various factors, as highlighted by safety engineering expert Nancy Leveson. The author fears that companies might prioritize cost-cutting by substituting LLMs for skilled engineers, leading to superficial analyses and systemic failures. Additionally, the potential for "automation surprise," where automated systems behave unexpectedly, poses risks, especially if users lack a comprehensive understanding of LLM capabilities. The author cites AWS's integration of LLMs for software upgrades, noting the lack of critical feedback on this approach, which raises concerns about overlooking potential operational issues. The overarching message is a caution against over-reliance on LLMs, advocating for a balanced approach that maintains human expertise in safety and reliability.

- LLMs may undermine the development of human expertise in root cause analysis.

- Root cause analysis requires a deep understanding of complex systems and their interactions.

- Over-reliance on LLMs could lead to superficial analyses and systemic failures.

- Automation surprise can occur when automated systems behave unexpectedly.

- Critical feedback on LLM applications in industry is necessary to identify potential risks.

Leveraging AI for efficient incident response

Meta has developed an AI-assisted system for root cause analysis, achieving 42% accuracy by combining heuristic retrieval and LLM ranking, significantly improving investigation efficiency while addressing potential risks through feedback and explainability.

AI: What people are saying

The discussion around the use of Large Language Models (LLMs) for root cause analysis (RCA) in cloud incidents reveals significant skepticism and concern among commenters.

Many believe LLMs lack the necessary understanding of complex systems, making them unsuitable for accurate RCA.
There are concerns about the potential for LLMs to produce misleading or superficial analyses, which could undermine human expertise.
Commenters emphasize the importance of human verification of LLM outputs, especially in critical processes.
Some argue that the adoption of LLMs in RCA is premature and could lead to dangerous outcomes.
There is a call for additional error control mechanisms to enhance the reliability of LLM outputs in complex scenarios.

11 comments

By @efitz - 8 months

Why should I think that LLMs would be good at the task of analyzing a cloud incident and determining root cause?

LLMs are good at predicting the next word in written language. They are generative; they make new text given a prompt. LLMs do not have base sets of facts about how complex systems work, and do not attempt to reason over a corpus of evidence and facts. as a result, I would expect that an LLM might concoct an interesting story about why such a failure occurred, and it might even be a convincing story if it happened to weave bits of context, accurately into the storyline. It might even, purely randomly, generate a story that actually correctly diagnosed the root cause of the failure, but that would be coincidental based on the similarity of the prompt to text of similar postmortem discussions that were part of its training set.

If you had an extremely detailed postmortem document, then I would expect LLM‘s to do a very good job of summarizing such document.

But I don’t see why an LLM is an appropriate tool for analyzing failures in complex systems; just as I don’t see a hammer being a very effective tool for tightening bolts.

Right now, I am concerned that the relative ease that modern frameworks provide to author LLM based applications, is leading many people to optimistically include LLM technology in attempts to solve problems that it doesn’t seem particularly well suited to solve.

By @charleslmunger - 8 months

The statement about "4,500 developer-years of work" is insane to me. Java is one of the most backward compatible languages period - other than hashmap iteration order a while back, it's hard to think of what could require that astronomical quantity of engineering effort to upgrade. Do they actually budget over a billion dollars to upgrade Java versions, or is this like "this amazing tool, sed, saved us infinity developer years by replacing strings at one quadrillionth the cost of a $500k human editing each text file by hand"

By @m1keil - 8 months

Folks let's be real. While the tech industry borrows terms and procedures from mature and inherently riskier industries like Aviation, 99% of the tech companies don't share the same risk profile.

This means that in most cases, these RCAs are the output of a long and over engineered incident review process that was designed to impress the higher echelon.

The problem is, that in a decently sized corporation, you have tens to hundreds of daily fuck ups (also known as "incidents") that completely suck out the free time out of engineers that have to navigate the long game of post incident management process.

The utilisation of LLMs on these cases are just engineered solution to the problem of organisational bureaucracy.

By @sickblastoise - 8 months

Here’s a simple rule, based on the fact no one has shown that an llm or a compound llm system can produce an output that doesn’t need to be verified for correctness by a human across any input:

The rate at which llm/llm compound systems can produce output > the rate at which humans can verify the output

I think it follows that we should not use llms for anything critical.

The gunghoe adoption and hamfisting of llms into critical processes, like an AWs migration to Java 17, or root cause analysis is plainly premature, naive, and dangerous.

By @noduerme - 8 months

Forget RCA, we should think bigger! Putting LLMs in charge of nuclear weapons could completely eliminate the root causes of accidents worldwide!

By @indus - 8 months

Most LLMs have an accuracy benchmark for controlled questions & answers.

Even if this accuracy is 95% then in a complex system the probability of getting to the right answer diminishes with each new step being added. This is also the key tenet of an agentic system.

While the analysis in the blog is excellent but an answer needs to be found. A layer on top of LLMs for error control/check.

As an analogy, in the analog to digital transmission stack of OSI, an error correction mechanism such as frame check sequence (fcs) detects transmission errors in the data link layer.

By @zebomon - 8 months

This articles speak directly to what I believe will become the unfortunate and inconvenient reality of LLMs' core limitation: at some point, someone has to 1) benefit from what's been generated and 2) (and more importantly) know they're benefiting from what's been generated.

If for example you have an LLM agent that's effectively "solved" every security flaw your software may encounter for the next 50 years, unless it can simultaneously impart 50 years of training to the people who rely on the software, it's done nothing but introduce us to more complex flaws that we would need approximately 49 more years of experience to tackle ourselves.

By @ryoshu - 8 months

Why would you ever use a non-deterministic model for a deterministic function?

By @rboyd - 8 months

FTA: "If we offload the RCA learning/categorization part to the LLM (whatever that means), we wouldn't be able to make much progress in the enhancing reliability and safety part."

But you don't offload it in the sense that you expect the tool to completely take the wheel.

You ask it for suggestions to inform a human. If the suggestions turn out to only be a distraction in your environment then you abandon the tool.

For plenty of environments the suggestions will be hugely useful and save you valuable time during an ongoing outage.

Looming Liability Machines (LLMs)

Related

Leveraging AI for efficient incident response

Related

Leveraging AI for efficient incident response