Notes on Anthropic's Computer Use Ability
Anthropic's Sonnet 3.5 introduces a "Computer Use" feature for enhanced computer interaction, enabling tasks like internet searches and spreadsheet creation, though it relies on screenshots and incurs high costs.
Read original articleAnthropic has introduced significant updates to its AI models, specifically the Sonnet 3.5, which features a new capability called "Computer Use." This allows the model to interact with computers by understanding images and determining the coordinates of components, enabling it to perform tasks like moving cursors, clicking, and typing. The model excels in various applications, such as internet searches, creating spreadsheets, and filling out forms, but it still relies on screenshots and is not suitable for real-time tasks. The setup for using Computer Use requires access to Anthropic's API and involves running a demo container. Testing revealed that while the model can successfully execute commands, it can be expensive and slow, with costs reaching around $30 for minor tasks. Despite its limitations, the future of AI agents appears promising, with expectations for further advancements in computer interaction capabilities from Anthropic and other AI labs.
- Anthropic's Sonnet 3.5 introduces a new "Computer Use" feature for enhanced interaction with computers.
- The model can perform tasks like internet searches and spreadsheet creation but relies on screenshots.
- Setting up the model requires access to Anthropic's API and involves running a demo container.
- Current limitations include high costs and slow performance, making it less practical for everyday use.
- Future developments in AI agents are anticipated, with potential advancements from other AI labs.
Related
Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku
Anthropic has released Claude 3.5 Sonnet, improving coding and tool use, and Claude 3.5 Haiku, matching previous performance. A new "computer use" feature is in beta, enhancing automation.
Computer Use API Documentation
Anthropic's Claude 3.5 Sonnet model now includes a beta "computer use" feature for desktop interaction, allowing task execution with safety measures advised due to associated risks, especially online.
Initial explorations of Anthropic's new Computer Use capability
Anthropic has launched the Claude 3.5 Sonnet model and a "computer use" API mode, enhancing desktop interaction with coordinate support while addressing safety concerns and performance improvements in coding tasks.
Claude Sonnet 3.5.1 and Haiku 3.5
Anthropic launched Claude Sonnet 3.5.1 and Haiku 3.5, featuring improved performance and a new computer use function in beta, while highlighting safety concerns and encouraging cautious use for low-risk tasks.
Claude Computer Use – Is Vision the Ultimate API?
The article reviews Anthropic's Claude Computer, noting its strengths in screen reading and navigation but highlighting challenges in recognizing screen reading moments and managing application states, requiring further advancements.
On one hand, it has really helped me with prototyping incredibly fast.
On the other, it is prohibitively expensive today. Essentially you pay per click, in some cases per keystroke. I tried to get it to find a flight for me. So it opened the browser, navigated to Google Flights, entered the origin, destination etc. etc. By the time it saw a price, there had already been more than a dozen LLM calls. And then it crashed due to a rate limit issue. By the time I got a list of flight recommendations it was already $5.
But I think this is intended to be an early demo of what will be possible in the future. And they were very explicit that it's a beta: all of this feedback above will help them make it better. Very quickly it will get more efficient, less expensive, more reliable.
So overall I'm optimistic to see where this goes. There are SO many applications for this once it's working really well.
There will be enormous push towards steering these software agents towards similarly shady practices instead of making them act in the true interest of the user. The ads will be built into the weights of the model or something.
I wonder why leveraging accessibility tools for this wouldn't have been a better option. Browsers and operating systems both have pretty comprehensive tooling for accessibility tools like screen readers, and the whole point of those tools is to act as a middle man to programmatically interpret and interact with what's on screen.
Agents won't get anywhere because any user process you want to automate is better done by creating APIs and creating a proper guaranteed interface. Any automated "computer use" will always be a one-off, absurdly expensive, and completely impractical.
Ideally it would be given a persona and a list of use cases, try to accomplish each task and save the state where you/it failed.
Something like a Chrome lighthouse but for usability. Bonus point if it can highlight what part of my documentation is using mismatched terminology making it difficult for newcomers to understand what button I am referring to.
1. Pixels and screenshots (video really) and keyboard/mouse events is definitely the purest and most proper way to get agents working in the long term, but it's not practical today. Cost and speed are big obvious issues, but accuracy is also low. I found that GTP4o (08-06) is just plain bad at coordinates and bounding boxes and naively feeding it screenshots just doesn't work. As a practical example, another comment mentions trying to get a list of flight recommendations from Claude computer use and it costing $5, if my agent is up for that task (haven't tested this), it would cost $0.10-$0.25.
2. "feature engineering" helps a lot right now. Explicitly highlighting things and giving the model extra context and instructions on how to use that context, how to augment the info it sees on screenshots etc. It's hard to understand things like hover text, show/hide buttons, etc from pure pixels.
3. You have to heavily constrain and prompt the model to get it to do the right thing now, but when it does it, it feels magic.
4. It makes naive, but quite understandable mistakes. The kinds of mistakes a novice user might make and it seems really hard to get this working. A mechanism to correct itself and learn is probably the better approach rather than trying to make it work right from the get-go in every situation. Again, when you see the agent fail, try again and succeed the second time based on the failure of the previous action, it's pretty magical. The first time it achieved its objective, I just started laughing out loud. I don't know if I've ever laughed at a program I've written before.
It's been very interesting working on this. If traditional software is like building legos, this one is more like training a puppy. Different, but still fun. I also wonder how temporary this type of work is, I'm clearly doing a lot of manual work to augment the model's many weaknesses, but also models will get substantially better. At the same time, I can definitely see useful, practical computer use from model improvements being 2-3 years away.
Does this already exist? If not, would the benefits be lower than I think, or would the costs be higher than I think?
Why didn't this project start with https://huggingface.co/meta-llama/Llama-3.2-11B-Vision
For the folks who are more savvy on the Docker / Linux front...
1. Did Anthropic have to write its own "control" for the mouse and keyboard? I've tried using `xdotool` and related things in the past and they were very unreliable.
2. I don't want to dismiss the power and innovation going into this model, but...
(a) Why didn't Adept or someone else focused on RPA build this?
(b) How much of this is standard image recognition and fine-tuning a vision model to a screen, versus something more fundamental?
Our own personal digital squire.
Then eventually we become assistants to AI.
1) I tried using it for QA for my SaaS but agent failed multiple times to fill out a simple form, ending with it saying the task was successfully completed.
2) It couldn’t scrape contact information from a website where the details weren’t even that hidden.
3) I also tried sending a message on Discord, but it refused, saying it couldn’t do so on someone else's behalf.
I mean, I’m excited for what the future holds, but right now, it’s not even in beta.
Related
Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku
Anthropic has released Claude 3.5 Sonnet, improving coding and tool use, and Claude 3.5 Haiku, matching previous performance. A new "computer use" feature is in beta, enhancing automation.
Computer Use API Documentation
Anthropic's Claude 3.5 Sonnet model now includes a beta "computer use" feature for desktop interaction, allowing task execution with safety measures advised due to associated risks, especially online.
Initial explorations of Anthropic's new Computer Use capability
Anthropic has launched the Claude 3.5 Sonnet model and a "computer use" API mode, enhancing desktop interaction with coordinate support while addressing safety concerns and performance improvements in coding tasks.
Claude Sonnet 3.5.1 and Haiku 3.5
Anthropic launched Claude Sonnet 3.5.1 and Haiku 3.5, featuring improved performance and a new computer use function in beta, while highlighting safety concerns and encouraging cautious use for low-risk tasks.
Claude Computer Use – Is Vision the Ultimate API?
The article reviews Anthropic's Claude Computer, noting its strengths in screen reading and navigation but highlighting challenges in recognizing screen reading moments and managing application states, requiring further advancements.