Cradle: Empowering Foundation Agents Towards General Computer Control
The Cradle framework enables foundation agents to interact with software using human-like actions. It includes six modules for input understanding, action planning, and memory storage. Cradle demonstrates versatility in tasks but faces challenges in spatial perception.
Read original articleThe Cradle framework aims to empower foundation agents to perform diverse computer tasks using a unified interface that mimics human interactions with computers. By utilizing screenshots as input and keyboard and mouse actions as output, Cradle enables agents to interact with any software without relying on specific APIs. The framework consists of six key modules to facilitate understanding input, planning actions, and memory storage for past experiences. Experimental results demonstrate Cradle's ability to generalize across various tasks, including completing missions in games like Red Dead Redemption 2, managing cities in Cities: Skylines, and conducting profitable transactions in Dealer's Life 2. While Cradle shows impressive performance in many scenarios, challenges remain in tasks requiring precise spatial perception and interaction with complex software interfaces. Overall, Cradle represents a significant step towards developing generalist agents capable of mastering a wide range of computer tasks without the need for specialized APIs.
Related
CRIU, a project to implement checkpoint/restore functionality for Linux
CRIU is a Linux tool for freezing and saving container/application states, enabling live migration and snapshots. Integrated into software like Docker, it offers CLI, RPC, and C API for checkpointing. Various resources and events showcase its capabilities and development progress.
SceneCraft: An LLM Agent for Synthesizing 3D Scenes as Blender Code
SceneCraft is an advanced Large Language Model (LLM) Agent converting text to 3D scenes in Blender. It excels in spatial planning, asset arrangement, and scene refinement, surpassing other LLM agents in performance and human feedback.
The Abstraction and Reasoning Corpus
The GitHub repository for ARC-AGI provides task data and a testing interface for solving tasks involving input/output pairs within 3 trials. Users can access the tasks and detailed instructions on the repository.
Anthropic: Collaborate with Claude on Projects
Claude.ai introduces Projects feature for Pro and Team users to organize chats, enhance collaboration, and create artifacts like code snippets. North Highland reports productivity gains. Future updates prioritize user-friendly enhancements.
Show HN: Crawlee for Python – a web scraping and browser automation library
Crawlee for Python is a powerful web scraping and browser automation library with features like scaling, proxy management, and Playwright integration. It's open source, supports Python 3.9+, and aids in efficient web scraping.
https://github.com/BAAI-Agents/Cradle/pull/44/files#diff-3f3...
Perhaps something like the eye tracking tech in modern vehicles to ensure you're paying attention if the lane assist is turned on.
Of course, that would be awful. But what other recourse is there?
Also, neither here nor there but I enjoyed the discussion in the paper about how the model had a surprisingly low performance on sending an email in Outlook because while it well-understood the task and how to send an email, Outlook's UI still managed to confuse it - can relate.
The authors have developed Cradle, a multimodal-LLM-powered agent framework with six modules: Information Gathering, Self-Reflection, Task Inference, Skill Curation, Action Planning, and Memory. Once Cradle has processed high-level instructions, its inputs are sequences of computer screenshots. Its output is executable code for low-level keyboard and mouse control, enabling Cradle to interact with any software and complete long-horizon complex tasks without relying on any built-in APIs:
Oversimplified Big-Picture Diagram
+------------+
| Cradle | executable code
screenshots -> |(high-level | -> for controlling
| planning) | keyboard & mouse
+------------+
The authors' experiments show what to me looks like impressive generalization and performance across software applications, successfully operating daily software like Chrome and Outlook, and across commercial video games: It is able to follow the main storyline and complete 40-minute-long missions in Red Dead Redemption 2, create a city of a thousand people in Cities: Skylines, farm and harvest parsnips in Stardew Valley, and trade and bargain to make a profit in Dealer's Life 2.There are of course many caveats -- the technology is still in its infant stage -- but still, I'm impressed at how quickly things are progressing.
We sure live in interesting times!
Related
CRIU, a project to implement checkpoint/restore functionality for Linux
CRIU is a Linux tool for freezing and saving container/application states, enabling live migration and snapshots. Integrated into software like Docker, it offers CLI, RPC, and C API for checkpointing. Various resources and events showcase its capabilities and development progress.
SceneCraft: An LLM Agent for Synthesizing 3D Scenes as Blender Code
SceneCraft is an advanced Large Language Model (LLM) Agent converting text to 3D scenes in Blender. It excels in spatial planning, asset arrangement, and scene refinement, surpassing other LLM agents in performance and human feedback.
The Abstraction and Reasoning Corpus
The GitHub repository for ARC-AGI provides task data and a testing interface for solving tasks involving input/output pairs within 3 trials. Users can access the tasks and detailed instructions on the repository.
Anthropic: Collaborate with Claude on Projects
Claude.ai introduces Projects feature for Pro and Team users to organize chats, enhance collaboration, and create artifacts like code snippets. North Highland reports productivity gains. Future updates prioritize user-friendly enhancements.
Show HN: Crawlee for Python – a web scraping and browser automation library
Crawlee for Python is a powerful web scraping and browser automation library with features like scaling, proxy management, and Playwright integration. It's open source, supports Python 3.9+, and aids in efficient web scraping.