October 24th, 2024

Claude Computer Use – Is Vision the Ultimate API?

The article reviews Anthropic's Claude Computer, noting its strengths in screen reading and navigation but highlighting challenges in recognizing screen reading moments and managing application states, requiring further advancements.

Read original articleLink Icon
Claude Computer Use – Is Vision the Ultimate API?

The article discusses the author's experience with Anthropic's Computer Use API, highlighting its strengths and weaknesses. The Claude Computer, based on Claude 3.5, excels in understanding computer interactions through screenshots, making it effective in screen reading and navigation. It performs well with function calls, often preferring them over manual clicks. However, it struggles with recognizing when to read the screen, leading to potential errors in task execution. The AI also has difficulty fetching data efficiently and remembering the state of applications, particularly when dealing with modals and popups. The author emphasizes the importance of providing Claude with as much system state information as possible to enhance its performance. Additionally, the need for the AI to handle uncertainty is noted as a significant challenge in agent development. The author concludes that while Claude Computer represents a step towards true agent behavior, further advancements are necessary to fully realize its potential.

- Claude Computer excels in screen reading and navigation through screenshots.

- It struggles with recognizing when to read the screen and remembering application states.

- Providing system state information can improve Claude's performance.

- Handling uncertainty remains a key challenge in AI agent development.

- Claude Computer is a step towards achieving true agent behavior, but more advancements are needed.

Link Icon 16 comments
By @CharlieDigital - 4 months
Vision is the ultimate API.

The historical progression from text to still images to audio to moving images will hold true for AI as well.

Just look at OpenAI's progression as well from LLM to multi-modal to the realtime API.

A co-worker almost 20 years ago said something interesting to me as we were discussing Al Gore's CurrentTV project: the history of information is constrained by "bandwidth". He mentioned how broadcast television went from 72 hours of "bandwidth" (3 channels x 24h) per day to now having so much bandwidth that we could have a channel with citizen journalists. Of course, this was also the same time that YouTube was taking off.

The pattern holds true for AI.

AI is going to create "infinite bandwidth".

By @viraptor - 4 months
Some time ago I made a prediction that accessibility is the ultimate API for the UI agents, but unfortunately multimodal capabilities went the other way. But we can still change the course:

This is a great place for people to start caring about accessibility annotations. All serious UI toolkits allow you to tell the computer what's on the screen. This allows things like Windows Automation https://learn.microsoft.com/en-us/windows/win32/winauto/entr... to see a tree of controls with labels and descriptions without any vision/OCR. It can be inspected by apps like FlauiInspect https://github.com/FlaUI/FlaUInspect?tab=readme-ov-file#main... But see how the example shows a statusbar with (Text "UIA3" "")? It could've been (Text "UIA3" "Current automation interface") instead for both a good tooltip and an accessibility label.

Now we can kill two birds with one stone - actually improve the accessibility of everything and make sure custom controls adhere to the framework as well, and provide the same data to the coming automation agents. The text description will be much cheaper than a screenshot to process. Also it will help my work with manually coded app automation, so that's a win-win-win.

As a side effect, it would also solve issues with UI weirdness. Have you ever had windows open something on a screen which is not connected anymore? Or under another window? Or minimised? Screenshots won't give enough information here to progress.

By @simonw - 4 months
If you want to try out Computer Use (awful name) in a relatively safe environment the Docker container Anthropic provide here is very easy to start running (provided you have Docker setup, I used it with Docker Deaktop for Mac): https://github.com/anthropics/anthropic-quickstarts/tree/mai...
By @unglaublich - 4 months
Vision here means "2d pixel space".

The ultimate API is "all the raw data you can acquire from your environment".

By @pabe - 4 months
I don't think vision is the ultimate API. It wasn't with "traditional" RPA and it won't with more advanced AI-RPA. It's inefficient. If you want something to be used by a bot, write an interface for a bot. I'd make an exception for end2end testing.
By @downWidOutaFite - 4 months
Vision is a crappy interface for computers but I think it could be a useful weapon against all the extremely "secure" platforms that refuse to give you access to your own data and refuse to interoperate with anything outside their militarized walled gardens.
By @tomatohs - 4 months
> It is very helpful to give it things like:

- A list of applications that are open - Which application has active focus - What is focused inside the application - Function calls to specifically navigate those applications, as many as possible

We’ve found the same thing while building the client for testdriver.ai. This info is in every request.

By @sharpshadow - 4 months
In this context Windows Recall makes total sense now from a AI learning perspective for them.

It’s actually a super cool development and I’m very exiting already to let my computer use any software like a pro infront of me. Paint me canvas of a savanna sunset with animals silhouette, produce me a track of uk garage house, etc. everything with all the layers and elements in the software not just an finished output.

By @throwup238 - 4 months
Vision plus accessibility metadata is the ultimate API. I see little reason that poorly designed flat UIs are going to confuse LLMs any less than humans, especially when they’re missing from the training data like most internal apps or the documentation on the web is out of date. Even a basic dump of ARIA attributes or the hierarchy from OS accessibility APIs can help a lot.
By @echoangle - 4 months
Am I the only one thinking this is an awful way for AI to do useful stuff for you? Why would I train an AI to use a GUI? Wouldn’t it be better to just have the AI learn API docs and use that? I don’t want the AI to open my browser, open google maps and search for Shawarma, I want the AI to call a google api and give me the result.
By @m3kw9 - 4 months
No, Vision in this case is a brute force way for the AI to interact with our current world because we designed the interface for human vision. In the future, AI creates the UI and their control will be low level most likely at the model level as even business logic+UI will be generated live.
By @PreInternet01 - 4 months
Counterpoint: no, it's just more hype.

Doing real-time OCR on 1280x1024 bitmaps has been possible for... the last decade or so? Sure, you can now do it on 4K or 8K bitmaps, but that's just an incremental improvement.

Fact is, full-screen OCR coupled with innovations like "Google" has not lead to "ultimate" productivity improvements, and as impressive as OpenAI et al may appear right now, the impact of these technologies will end up roughly similar.

(Which is to say: the landscape will change, but not in a truly fundamental way. What you're seeing demonstrated right now is, roughly speaking, the next Clippy, which, believe it or not, was hyped to a similar extent around the time it was introduced...)

By @freediver - 4 months
And text is the ultimate API to human brain! ;)

https://www.youtube.com/watch?v=Zctp972y_Eg

By @cheevly - 4 months
No, language is the ultimate API.
By @throwaway19972 - 4 months
I'd imagine you'd get higher quality leveraging accessibility integrations.