July 12th, 2024

Cradle: Empowering Foundation Agents Towards General Computer Control

The Cradle framework enables foundation agents to interact with software using human-like actions. It includes six modules for input understanding, action planning, and memory storage. Cradle demonstrates versatility in tasks but faces challenges in spatial perception.

Read original article

Cradle: Empowering Foundation Agents Towards General Computer Control

The Cradle framework aims to empower foundation agents to perform diverse computer tasks using a unified interface that mimics human interactions with computers. By utilizing screenshots as input and keyboard and mouse actions as output, Cradle enables agents to interact with any software without relying on specific APIs. The framework consists of six key modules to facilitate understanding input, planning actions, and memory storage for past experiences. Experimental results demonstrate Cradle's ability to generalize across various tasks, including completing missions in games like Red Dead Redemption 2, managing cities in Cities: Skylines, and conducting profitable transactions in Dealer's Life 2. While Cradle shows impressive performance in many scenarios, challenges remain in tasks requiring precise spatial perception and interaction with complex software interfaces. Overall, Cradle represents a significant step towards developing generalist agents capable of mastering a wide range of computer tasks without the need for specialized APIs.

CRIU, a project to implement checkpoint/restore functionality for Linux

CRIU is a Linux tool for freezing and saving container/application states, enabling live migration and snapshots. Integrated into software like Docker, it offers CLI, RPC, and C API for checkpointing. Various resources and events showcase its capabilities and development progress.

SceneCraft: An LLM Agent for Synthesizing 3D Scenes as Blender Code

SceneCraft is an advanced Large Language Model (LLM) Agent converting text to 3D scenes in Blender. It excels in spatial planning, asset arrangement, and scene refinement, surpassing other LLM agents in performance and human feedback.

The Abstraction and Reasoning Corpus

The GitHub repository for ARC-AGI provides task data and a testing interface for solving tasks involving input/output pairs within 3 trials. Users can access the tasks and detailed instructions on the repository.

Anthropic: Collaborate with Claude on Projects

Claude.ai introduces Projects feature for Pro and Team users to organize chats, enhance collaboration, and create artifacts like code snippets. North Highland reports productivity gains. Future updates prioritize user-friendly enhancements.

Show HN: Crawlee for Python – a web scraping and browser automation library

Crawlee for Python is a powerful web scraping and browser automation library with features like scaling, proxy management, and Playwright integration. It's open source, supports Python 3.9+, and aids in efficient web scraping.

8 comments

By @eamag - 3 months

I'm looking through the code, does it mean that the authors wrote all basic skills themselves and let LLM choose from them? So this approach can't be generalised, can it?

https://github.com/BAAI-Agents/Cradle/pull/44/files#diff-3f3...

By @Art9681 - 3 months

Fantastic. This is why efforts to defeat web scrapers will ultimately prove futile unless the human/computer interfaces require constant biometric authentication. I imagine in some dark timeline, content will not be displayed unless the finger touching the trackpad is a human finger. Or the keyboard keys wont register unless they detect a fingerprint or other bio signatures. Same thing with online multiplayer games. Only approved controllers that have some future tech that constantly polls that fingerprint pressing the buttons to ensure it is a human.

Perhaps something like the eye tracking tech in modern vehicles to ensure you're paying attention if the lane assist is turned on.

Of course, that would be awful. But what other recourse is there?

By @Fripplebubby - 3 months

Very cool work. Note that this was completed using a vanilla ChatGPT-4o model, all the magic dust is prompting, the dataflow between stages (info gathering, self-reflection, task inference, skill curation, action planning), and some tooling like added object detection / bounding boxes / icon detection.

Also, neither here nor there but I enjoyed the discussion in the paper about how the model had a surprisingly low performance on sending an email in Outlook because while it well-understood the task and how to send an email, Outlook's UI still managed to confuse it - can relate.

By @lucianbr - 3 months

What happened to Robotic Process Automation? Wasn't that supposed to be this?

By @cs702 - 3 months

Wow, this looks amazing.

The authors have developed Cradle, a multimodal-LLM-powered agent framework with six modules: Information Gathering, Self-Reflection, Task Inference, Skill Curation, Action Planning, and Memory. Once Cradle has processed high-level instructions, its inputs are sequences of computer screenshots. Its output is executable code for low-level keyboard and mouse control, enabling Cradle to interact with any software and complete long-horizon complex tasks without relying on any built-in APIs:

       Oversimplified Big-Picture Diagram

                 +------------+
                 |   Cradle   |    executable code
  screenshots -> |(high-level | -> for controlling
                 |  planning) |    keyboard & mouse
                 +------------+

The authors' experiments show what to me looks like impressive generalization and performance across software applications, successfully operating daily software like Chrome and Outlook, and across commercial video games: It is able to follow the main storyline and complete 40-minute-long missions in Red Dead Redemption 2, create a city of a thousand people in Cities: Skylines, farm and harvest parsnips in Stardew Valley, and trade and bargain to make a profit in Dealer's Life 2.

There are of course many caveats -- the technology is still in its infant stage -- but still, I'm impressed at how quickly things are progressing.

We sure live in interesting times!

By @yawnxyz - 3 months

weird nitpick — they keep mentioning LMM, but do they mean LLMs?

By @dinkblam - 3 months

i am waiting for version 2.0 "Enclave"

Cradle: Empowering Foundation Agents Towards General Computer Control

Related

CRIU, a project to implement checkpoint/restore functionality for Linux

SceneCraft: An LLM Agent for Synthesizing 3D Scenes as Blender Code

The Abstraction and Reasoning Corpus

Anthropic: Collaborate with Claude on Projects

Show HN: Crawlee for Python – a web scraping and browser automation library

Related

CRIU, a project to implement checkpoint/restore functionality for Linux

SceneCraft: An LLM Agent for Synthesizing 3D Scenes as Blender Code

The Abstraction and Reasoning Corpus

Anthropic: Collaborate with Claude on Projects

Show HN: Crawlee for Python – a web scraping and browser automation library