July 12th, 2024

Cradle: Empowering Foundation Agents Towards General Computer Control

The Cradle framework enables foundation agents to interact with software using human-like actions. It includes six modules for input understanding, action planning, and memory storage. Cradle demonstrates versatility in tasks but faces challenges in spatial perception.

Read original articleLink Icon
Cradle: Empowering Foundation Agents Towards General Computer Control

The Cradle framework aims to empower foundation agents to perform diverse computer tasks using a unified interface that mimics human interactions with computers. By utilizing screenshots as input and keyboard and mouse actions as output, Cradle enables agents to interact with any software without relying on specific APIs. The framework consists of six key modules to facilitate understanding input, planning actions, and memory storage for past experiences. Experimental results demonstrate Cradle's ability to generalize across various tasks, including completing missions in games like Red Dead Redemption 2, managing cities in Cities: Skylines, and conducting profitable transactions in Dealer's Life 2. While Cradle shows impressive performance in many scenarios, challenges remain in tasks requiring precise spatial perception and interaction with complex software interfaces. Overall, Cradle represents a significant step towards developing generalist agents capable of mastering a wide range of computer tasks without the need for specialized APIs.

Link Icon 8 comments
By @eamag - 3 months
I'm looking through the code, does it mean that the authors wrote all basic skills themselves and let LLM choose from them? So this approach can't be generalised, can it?

https://github.com/BAAI-Agents/Cradle/pull/44/files#diff-3f3...

By @Art9681 - 3 months
Fantastic. This is why efforts to defeat web scrapers will ultimately prove futile unless the human/computer interfaces require constant biometric authentication. I imagine in some dark timeline, content will not be displayed unless the finger touching the trackpad is a human finger. Or the keyboard keys wont register unless they detect a fingerprint or other bio signatures. Same thing with online multiplayer games. Only approved controllers that have some future tech that constantly polls that fingerprint pressing the buttons to ensure it is a human.

Perhaps something like the eye tracking tech in modern vehicles to ensure you're paying attention if the lane assist is turned on.

Of course, that would be awful. But what other recourse is there?

By @Fripplebubby - 3 months
Very cool work. Note that this was completed using a vanilla ChatGPT-4o model, all the magic dust is prompting, the dataflow between stages (info gathering, self-reflection, task inference, skill curation, action planning), and some tooling like added object detection / bounding boxes / icon detection.

Also, neither here nor there but I enjoyed the discussion in the paper about how the model had a surprisingly low performance on sending an email in Outlook because while it well-understood the task and how to send an email, Outlook's UI still managed to confuse it - can relate.

By @lucianbr - 3 months
What happened to Robotic Process Automation? Wasn't that supposed to be this?
By @cs702 - 3 months
Wow, this looks amazing.

The authors have developed Cradle, a multimodal-LLM-powered agent framework with six modules: Information Gathering, Self-Reflection, Task Inference, Skill Curation, Action Planning, and Memory. Once Cradle has processed high-level instructions, its inputs are sequences of computer screenshots. Its output is executable code for low-level keyboard and mouse control, enabling Cradle to interact with any software and complete long-horizon complex tasks without relying on any built-in APIs:

       Oversimplified Big-Picture Diagram

                 +------------+
                 |   Cradle   |    executable code
  screenshots -> |(high-level | -> for controlling
                 |  planning) |    keyboard & mouse
                 +------------+
The authors' experiments show what to me looks like impressive generalization and performance across software applications, successfully operating daily software like Chrome and Outlook, and across commercial video games: It is able to follow the main storyline and complete 40-minute-long missions in Red Dead Redemption 2, create a city of a thousand people in Cities: Skylines, farm and harvest parsnips in Stardew Valley, and trade and bargain to make a profit in Dealer's Life 2.

There are of course many caveats -- the technology is still in its infant stage -- but still, I'm impressed at how quickly things are progressing.

We sure live in interesting times!

By @yawnxyz - 3 months
weird nitpick — they keep mentioning LMM, but do they mean LLMs?
By @dinkblam - 3 months
i am waiting for version 2.0 "Enclave"