October 25th, 2024

Notes on Anthropic's Computer Use Ability

Anthropic's Sonnet 3.5 introduces a "Computer Use" feature for enhanced computer interaction, enabling tasks like internet searches and spreadsheet creation, though it relies on screenshots and incurs high costs.

Read original articleLink Icon
Notes on Anthropic's Computer Use Ability

Anthropic has introduced significant updates to its AI models, specifically the Sonnet 3.5, which features a new capability called "Computer Use." This allows the model to interact with computers by understanding images and determining the coordinates of components, enabling it to perform tasks like moving cursors, clicking, and typing. The model excels in various applications, such as internet searches, creating spreadsheets, and filling out forms, but it still relies on screenshots and is not suitable for real-time tasks. The setup for using Computer Use requires access to Anthropic's API and involves running a demo container. Testing revealed that while the model can successfully execute commands, it can be expensive and slow, with costs reaching around $30 for minor tasks. Despite its limitations, the future of AI agents appears promising, with expectations for further advancements in computer interaction capabilities from Anthropic and other AI labs.

- Anthropic's Sonnet 3.5 introduces a new "Computer Use" feature for enhanced interaction with computers.

- The model can perform tasks like internet searches and spreadsheet creation but relies on screenshots.

- Setting up the model requires access to Anthropic's API and involves running a demo container.

- Current limitations include high costs and slow performance, making it less practical for everyday use.

- Future developments in AI agents are anticipated, with potential advancements from other AI labs.

Link Icon 21 comments
By @acrooks - 4 months
I've built a couple of experiments using it so far and it has been really interesting.

On one hand, it has really helped me with prototyping incredibly fast.

On the other, it is prohibitively expensive today. Essentially you pay per click, in some cases per keystroke. I tried to get it to find a flight for me. So it opened the browser, navigated to Google Flights, entered the origin, destination etc. etc. By the time it saw a price, there had already been more than a dozen LLM calls. And then it crashed due to a rate limit issue. By the time I got a list of flight recommendations it was already $5.

But I think this is intended to be an early demo of what will be possible in the future. And they were very explicit that it's a beta: all of this feedback above will help them make it better. Very quickly it will get more efficient, less expensive, more reliable.

So overall I'm optimistic to see where this goes. There are SO many applications for this once it's working really well.

By @bonoboTP - 4 months
This kind of stuff is an existential threat to ad-based business models and upselling. If users no longer browse the web themselves, you can't show them ads. It's a monumental, Earth-shattering problem for behemoth like Google but also normal websites. Lots of websites (such as booking.com) rely on shady practices to mislead users and upsell them etc. If you have a dispassionate, smart computer agent doing the transaction, it will only buy what's needed to accomplish the task.

There will be enormous push towards steering these software agents towards similarly shady practices instead of making them act in the true interest of the user. The ads will be built into the weights of the model or something.

By @_heimdall - 4 months
I'm all for the MVP approach and shipping quickly, though I'm really surprised they went with image recognition and tooling for injecting mouse/keyboard events for automating human tasks.

I wonder why leveraging accessibility tools for this wouldn't have been a better option. Browsers and operating systems both have pretty comprehensive tooling for accessibility tools like screen readers, and the whole point of those tools is to act as a middle man to programmatically interpret and interact with what's on screen.

By @imranq - 4 months
This is basically RPA with LLMs. And RPA is basically the worst possible solution to any problem.

Agents won't get anywhere because any user process you want to automate is better done by creating APIs and creating a proper guaranteed interface. Any automated "computer use" will always be a one-off, absurdly expensive, and completely impractical.

By @VBprogrammer - 4 months
Well, this just opened up a new phase in the captcha wars.
By @belval - 4 months
The product I would like to see out of this is a way to automate UI QA.

Ideally it would be given a persona and a list of use cases, try to accomplish each task and save the state where you/it failed.

Something like a Chrome lighthouse but for usability. Bonus point if it can highlight what part of my documentation is using mismatched terminology making it difficult for newcomers to understand what button I am referring to.

By @martythemaniak - 4 months
I've been been hacking on a web browsing agent the last few weeks and it's given me some decent understanding of what it'd take to get this working. My approach has been to make it general-purpose enough so that I describe the mechanics of surfing the web, without building in specific knowledge about tasks or website. Some things I've learned.

1. Pixels and screenshots (video really) and keyboard/mouse events is definitely the purest and most proper way to get agents working in the long term, but it's not practical today. Cost and speed are big obvious issues, but accuracy is also low. I found that GTP4o (08-06) is just plain bad at coordinates and bounding boxes and naively feeding it screenshots just doesn't work. As a practical example, another comment mentions trying to get a list of flight recommendations from Claude computer use and it costing $5, if my agent is up for that task (haven't tested this), it would cost $0.10-$0.25.

2. "feature engineering" helps a lot right now. Explicitly highlighting things and giving the model extra context and instructions on how to use that context, how to augment the info it sees on screenshots etc. It's hard to understand things like hover text, show/hide buttons, etc from pure pixels.

3. You have to heavily constrain and prompt the model to get it to do the right thing now, but when it does it, it feels magic.

4. It makes naive, but quite understandable mistakes. The kinds of mistakes a novice user might make and it seems really hard to get this working. A mechanism to correct itself and learn is probably the better approach rather than trying to make it work right from the get-go in every situation. Again, when you see the agent fail, try again and succeed the second time based on the failure of the previous action, it's pretty magical. The first time it achieved its objective, I just started laughing out loud. I don't know if I've ever laughed at a program I've written before.

It's been very interesting working on this. If traditional software is like building legos, this one is more like training a puppy. Different, but still fun. I also wonder how temporary this type of work is, I'm clearly doing a lot of manual work to augment the model's many weaknesses, but also models will get substantially better. At the same time, I can definitely see useful, practical computer use from model improvements being 2-3 years away.

By @nilstycho - 4 months
It seems like a cheaper intermediate capability would be to give Claude the ability to SSH to your computer or to a cloud container. That would unlock a lot of possibilities, without incurring the cost of the vision model or the difficulty of cursor manipulation.

Does this already exist? If not, would the benefits be lower than I think, or would the costs be higher than I think?

By @elif - 4 months
If it's only downside is cost, and cost is prohibitively expensive for all practical uses,

Why didn't this project start with https://huggingface.co/meta-llama/Llama-3.2-11B-Vision

By @Jayakumark - 4 months
Any idea on how does Sonnet does this, is the image annotated with bounding boxes on text boxes etc. along with its coordinates before sending to sonnet and it responds with box name back or co-ordinate back or ? is SAM2 used for segmenting everything before sending to sonnet ?
By @cl42 - 4 months
I really, really like this new product/API offering. Still crashes quite a bit for me and obviously makes mistakes, but shows what's possible.

For the folks who are more savvy on the Docker / Linux front...

1. Did Anthropic have to write its own "control" for the mouse and keyboard? I've tried using `xdotool` and related things in the past and they were very unreliable.

2. I don't want to dismiss the power and innovation going into this model, but...

(a) Why didn't Adept or someone else focused on RPA build this?

(b) How much of this is standard image recognition and fine-tuning a vision model to a screen, versus something more fundamental?

By @ko_pivot - 4 months
At the end of the day, the fundamental dynamic here is human creativity. We are taking a tool, the LLM, and stretching it to its limit. That’s great, but that doesn’t mean we are close to AGI. It means we are AGI.
By @alt-glitch - 4 months
I wonder if I can hook up `scrcpy` with this and give it control over an Android. Can it drag the mouse? That'd be needed to navigate the phone at least.
By @azinman2 - 4 months
I saw those demoed yesterday. The model was asked to create a cool visualization. It ultimately tried to install steamlit and go its page, only to find its own Claude software running streamlit, so as part of debugging it killed itself. Not ready to let that go wild on my own computer!
By @namanyayg - 4 months
What are some good use cases for this? Something that a business can be built around
By @sys32768 - 4 months
So robotic process automation gains intelligence and we can train an AI intern to assist with tasks.

Our own personal digital squire.

Then eventually we become assistants to AI.

By @guzik - 4 months
I’m not sure if anyone else has really tried, but I’ve tested it a few times and never hit meaningful results.

1) I tried using it for QA for my SaaS but agent failed multiple times to fill out a simple form, ending with it saying the task was successfully completed.

2) It couldn’t scrape contact information from a website where the details weren’t even that hidden.

3) I also tried sending a message on Discord, but it refused, saying it couldn’t do so on someone else's behalf.

I mean, I’m excited for what the future holds, but right now, it’s not even in beta.

By @sheepscreek - 4 months
Appreciate the TL;DR as I got what I was looking for. Burning $30 for just trying it out doesn’t make it sound so promising at the moment.
By @Karthikeya - 4 months
didn't see anyone actually going into production with any of this stuff, man, this hype cycle just continues.
By @utkarsh-dixit - 4 months
This is such a idiotic hype-cycle, they just fine-tuned a model over vision API. I really don't understand why everyone is loosing their mind over this