July 3rd, 2024

AI Agents That Matter

The article addresses challenges in evaluating AI agents and proposes solutions for their development. It emphasizes the importance of rigorous evaluation practices to advance AI agent research and highlights the need for reliability and improved benchmarking practices.

Read original articleLink Icon
AI Agents That Matter

The article discusses the challenges in evaluating AI agents and proposes solutions to improve their development and effectiveness. AI agents are systems that use large language models (LLMs) to perform real-world tasks like booking flights or fixing software bugs. The goal is to create assistants like Siri or Alexa that can handle complex tasks accurately and reliably. However, current evaluation practices have pitfalls that lead to agents performing well on benchmarks but not being useful in practice. The paper suggests implementing cost-controlled evaluations, jointly optimizing accuracy and cost, distinguishing model and downstream benchmarking, preventing shortcuts in agent benchmarks, and improving standardization and reproducibility. The authors emphasize the need for rigorous evaluation practices to advance AI agent research and development. Despite challenges, they are cautiously optimistic about the future of AI agents, highlighting the importance of addressing reliability issues and rethinking benchmarking practices to drive progress in the field.

Link Icon 7 comments
By @crystal_revenge - 5 months
> The term agent has been used by AI researchers without a formal definition [1]

> [1] In traditional AI, agents are defined entities that perceive and act upon their environment, but that definition is less useful in the LLM era — even a thermostat would qualify as an agent under that definition.

I'm a huge believer in the power of agents, but this kind of complete ignorance of the history of AI gets frustrating. This statement belies a gross misunderstanding of how simple agents have been viewed.

If you're serious about agents then Minsky's The Society of the Mind should be on your desk. From the opening chapter:

> We want to explain intelligence as a combination of simpler things. This means that we must be sure to check, at every step, that none of our agents is, itself, intelligent... Accordingly, whenever we find that an agent has to do anything complicated, we'll replace it with a subsociety of agents that do simpler things.

Instead this write up completely ignores the logic of one of the seminal writings on this topic (and it's okay to disagree with Minsky, I sure do, but you need to at least acknowledge this) and immediately thinks the future of agents must be immensely complex.

Automatic thermostats existed in the early days of research on agents, and the key to a thermostat being an agent is it's ability to communicate with other agents automatically, and collectively perform complex actions.

By @behnamoh - 5 months
Agentic behavior is instilled in RLHF'd models [0, 1]. In this paper, "agency" is defined in terms of the LLM thinking like it's a person (agent) with consistent thoughts (outputs). The problem is this also severely limits the LLM's ability to break out from the norm and explore unknown paths (thoughts).

Most (if not all) agent frameworks use GPT-4, Claude Opus, etc. models which are heavily RLHF'd.

[0]: https://arxiv.org/abs/2406.05587 [1]: https://news.ycombinator.com/item?id=40702617

By @dmezzetti - 5 months
Agents have been recast as "Agentic Workflows". This is the trendy new term. The problem is that it's a complex solution and not a place to start. Newcomers read an article/blog/post/paper and get fixated on this solution.

Non-technical stakeholders also get fixated on this idea of AI agents autonomously working together. Can we save money? Perhaps even replace some people? Without a solid base of reality and a wide imagination, we can see how that conclusion can be drawn.

While agents may have a place, we in the AI space will fall into a credibility loop if this is pushed as the answer. There are plenty of wins for an organization with no "AI" in place. Retrieval Augmented Generation (RAG) is hard in it's own right but there is a reasonable path to success now.

Otherwise, expect disappointment. Then the whole space will be lumped together as a failure.

By @htrp - 5 months
> The North Star of this field is to build assistants like Siri or Alexa and get them to actually work — handle complex tasks, accurately interpret users’ requests, and perform reliably.

Controversial opinion there, especially given the hand tuning that those two go through