Claude Computer Use – Is Vision the Ultimate API?
The article reviews Anthropic's Claude Computer, noting its strengths in screen reading and navigation but highlighting challenges in recognizing screen reading moments and managing application states, requiring further advancements.
Read original articleThe article discusses the author's experience with Anthropic's Computer Use API, highlighting its strengths and weaknesses. The Claude Computer, based on Claude 3.5, excels in understanding computer interactions through screenshots, making it effective in screen reading and navigation. It performs well with function calls, often preferring them over manual clicks. However, it struggles with recognizing when to read the screen, leading to potential errors in task execution. The AI also has difficulty fetching data efficiently and remembering the state of applications, particularly when dealing with modals and popups. The author emphasizes the importance of providing Claude with as much system state information as possible to enhance its performance. Additionally, the need for the AI to handle uncertainty is noted as a significant challenge in agent development. The author concludes that while Claude Computer represents a step towards true agent behavior, further advancements are necessary to fully realize its potential.
- Claude Computer excels in screen reading and navigation through screenshots.
- It struggles with recognizing when to read the screen and remembering application states.
- Providing system state information can improve Claude's performance.
- Handling uncertainty remains a key challenge in AI agent development.
- Claude Computer is a step towards achieving true agent behavior, but more advancements are needed.
Related
Claude by Anthropic Now on Google Play Store
Claude by Anthropic is an AI assistant app on Google Play, offering instant answers, task collaboration, and job delegation. Users can access a free basic version or upgrade to a paid Pro plan for more features. Reviews highlight its clean interface and suggest adding voice mode and app integration. Claude prioritizes data privacy and encryption, making it a top free productivity app for AI assistance.
Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku
Anthropic has released Claude 3.5 Sonnet, improving coding and tool use, and Claude 3.5 Haiku, matching previous performance. A new "computer use" feature is in beta, enhancing automation.
Computer Use API Documentation
Anthropic's Claude 3.5 Sonnet model now includes a beta "computer use" feature for desktop interaction, allowing task execution with safety measures advised due to associated risks, especially online.
Initial explorations of Anthropic's new Computer Use capability
Anthropic has launched the Claude 3.5 Sonnet model and a "computer use" API mode, enhancing desktop interaction with coordinate support while addressing safety concerns and performance improvements in coding tasks.
Claude Sonnet 3.5.1 and Haiku 3.5
Anthropic launched Claude Sonnet 3.5.1 and Haiku 3.5, featuring improved performance and a new computer use function in beta, while highlighting safety concerns and encouraging cautious use for low-risk tasks.
The historical progression from text to still images to audio to moving images will hold true for AI as well.
Just look at OpenAI's progression as well from LLM to multi-modal to the realtime API.
A co-worker almost 20 years ago said something interesting to me as we were discussing Al Gore's CurrentTV project: the history of information is constrained by "bandwidth". He mentioned how broadcast television went from 72 hours of "bandwidth" (3 channels x 24h) per day to now having so much bandwidth that we could have a channel with citizen journalists. Of course, this was also the same time that YouTube was taking off.
The pattern holds true for AI.
AI is going to create "infinite bandwidth".
This is a great place for people to start caring about accessibility annotations. All serious UI toolkits allow you to tell the computer what's on the screen. This allows things like Windows Automation https://learn.microsoft.com/en-us/windows/win32/winauto/entr... to see a tree of controls with labels and descriptions without any vision/OCR. It can be inspected by apps like FlauiInspect https://github.com/FlaUI/FlaUInspect?tab=readme-ov-file#main... But see how the example shows a statusbar with (Text "UIA3" "")? It could've been (Text "UIA3" "Current automation interface") instead for both a good tooltip and an accessibility label.
Now we can kill two birds with one stone - actually improve the accessibility of everything and make sure custom controls adhere to the framework as well, and provide the same data to the coming automation agents. The text description will be much cheaper than a screenshot to process. Also it will help my work with manually coded app automation, so that's a win-win-win.
As a side effect, it would also solve issues with UI weirdness. Have you ever had windows open something on a screen which is not connected anymore? Or under another window? Or minimised? Screenshots won't give enough information here to progress.
The ultimate API is "all the raw data you can acquire from your environment".
- A list of applications that are open - Which application has active focus - What is focused inside the application - Function calls to specifically navigate those applications, as many as possible
We’ve found the same thing while building the client for testdriver.ai. This info is in every request.
It’s actually a super cool development and I’m very exiting already to let my computer use any software like a pro infront of me. Paint me canvas of a savanna sunset with animals silhouette, produce me a track of uk garage house, etc. everything with all the layers and elements in the software not just an finished output.
Doing real-time OCR on 1280x1024 bitmaps has been possible for... the last decade or so? Sure, you can now do it on 4K or 8K bitmaps, but that's just an incremental improvement.
Fact is, full-screen OCR coupled with innovations like "Google" has not lead to "ultimate" productivity improvements, and as impressive as OpenAI et al may appear right now, the impact of these technologies will end up roughly similar.
(Which is to say: the landscape will change, but not in a truly fundamental way. What you're seeing demonstrated right now is, roughly speaking, the next Clippy, which, believe it or not, was hyped to a similar extent around the time it was introduced...)
Related
Claude by Anthropic Now on Google Play Store
Claude by Anthropic is an AI assistant app on Google Play, offering instant answers, task collaboration, and job delegation. Users can access a free basic version or upgrade to a paid Pro plan for more features. Reviews highlight its clean interface and suggest adding voice mode and app integration. Claude prioritizes data privacy and encryption, making it a top free productivity app for AI assistance.
Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku
Anthropic has released Claude 3.5 Sonnet, improving coding and tool use, and Claude 3.5 Haiku, matching previous performance. A new "computer use" feature is in beta, enhancing automation.
Computer Use API Documentation
Anthropic's Claude 3.5 Sonnet model now includes a beta "computer use" feature for desktop interaction, allowing task execution with safety measures advised due to associated risks, especially online.
Initial explorations of Anthropic's new Computer Use capability
Anthropic has launched the Claude 3.5 Sonnet model and a "computer use" API mode, enhancing desktop interaction with coordinate support while addressing safety concerns and performance improvements in coding tasks.
Claude Sonnet 3.5.1 and Haiku 3.5
Anthropic launched Claude Sonnet 3.5.1 and Haiku 3.5, featuring improved performance and a new computer use function in beta, while highlighting safety concerns and encouraging cautious use for low-risk tasks.