Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku
Anthropic has released Claude 3.5 Sonnet, improving coding and tool use, and Claude 3.5 Haiku, matching previous performance. A new "computer use" feature is in beta, enhancing automation.
Read original articleAnthropic has announced the release of an upgraded Claude 3.5 Sonnet and a new model, Claude 3.5 Haiku. The Claude 3.5 Sonnet features significant improvements in coding capabilities, achieving a performance increase on the SWE-bench Verified benchmark from 33.4% to 49.0%. It also excels in agentic tool use tasks, with notable gains in both retail and airline domains. The Claude 3.5 Haiku model matches the performance of the previous largest model, Claude 3 Opus, while maintaining similar costs and speeds. A new capability, "computer use," is now in public beta, allowing Claude to interact with computer interfaces like a human, which developers can utilize for various tasks. Early adopters, including Asana and Replit, are exploring this feature for complex workflows. While the computer use feature is experimental and may have limitations, it shows promise for automating repetitive tasks and enhancing software development processes. The Claude 3.5 Sonnet is available for all users, while Claude 3.5 Haiku will be released later this month. Anthropic emphasizes the importance of responsible deployment and has implemented safety measures to mitigate potential risks associated with this new capability.
- Claude 3.5 Sonnet shows significant improvements in coding and tool use tasks.
- Claude 3.5 Haiku offers competitive performance at a similar cost to previous models.
- The new "computer use" feature allows Claude to interact with computer interfaces.
- Early feedback from developers indicates strong potential for automating complex workflows.
- Anthropic is committed to responsible deployment and safety measures for new capabilities.
Related
Claude 3.5 Sonnet
Claude 3.5 Sonnet, the latest in the model family, excels in customer support, coding, and humor comprehension. It introduces Artifacts on Claude.ai for real-time interactions, prioritizing safety and privacy. Future plans include Claude 3.5 Haiku and Opus, emphasizing user feedback for continuous improvement.
Anthropic: Expanding Access to Claude for Government
Anthropic expands AI models Claude 3 Haiku and Sonnet for government users via AWS Marketplace, emphasizing responsible AI deployment and tailored service agreements to enhance citizen services and policymaking.
The new Claude Android app brings the power of Claude,includingClaude 3.5
The Claude Android app, launched on July 16, 2024, features Claude 3.5 Sonnet model. It offers multi-platform support, image analysis, real-time translation, and advanced reasoning. Users can utilize Claude for tasks like contract analysis, market research, and language translation seamlessly.
Outage for Anthropic's Claude 3.5 Sonnet
Anthropic is experiencing a partial outage affecting Claude 3.5 Sonnet and Opus services, with elevated error rates. Traffic for free users has been rerouted, and improvements are noted.
Anthropic publishes the 'system prompts' that make Claude tick
Anthropic has published system prompts for its Claude AI models to enhance transparency, outlining their capabilities and limitations, and positioning itself as an ethical player in the AI industry.
- Many users express excitement about the enhanced coding capabilities and automation features, with some noting impressive performance benchmarks.
- Concerns about the implications of AI controlling computers, including privacy issues and potential for misuse, are frequently mentioned.
- There is confusion regarding the naming conventions of the models, with users questioning why updates are not labeled as new versions.
- Some users highlight the potential for AI to automate various tasks, while others express skepticism about its reliability and safety.
- Overall, there is a mix of enthusiasm for the advancements and caution regarding the ethical implications of such powerful AI capabilities.
As someone building AI SaaS products, I used to have the position that directly integrating with APIs is going to get us most of the way there in terms of complete AI automation.
I wanted to take at stab at this problem and started researching some daily busineses and how they use software.
My brother-in-law (who is a doctor) showed me the bespoke software they use in his practice. Running on Windows. Using MFC forms.
My accountant showed me Cantax - a very powerful software package they use to prepare tax returns in Canada. Also on Windows.
I started to realize that pretty much most of the real world runs on software that directly interfaces with people, without clearly defined public APIs you can integrate into. Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.
Boy was I wrong.
I am glad they did this, since it is a powerful connector to these types of real-world business use cases that are super-hairy, and hence very worthwhile in automating.
Computer use API documentation: https://docs.anthropic.com/en/docs/build-with-claude/compute...
Computer Use Demo: https://github.com/anthropics/anthropic-quickstarts/tree/mai...
> I apologize, but it seems we're having some issues installing Chromium. As an alternative, since I can see Firefox is already installed on the system, would you like to use Firefox instead? Or would you prefer me to try a different approach to installing Chromium?
It seems unable to figure out how to install chormium. Sees the Firefox icon out there and just want to use it. You know, like a real lazy human being.
It seems to insistent when it needs to, though
> I clicked on Firefox and it should start opening. Let me take another screenshot to see if the window has appeared:
> Firefox seems to be taking a moment to start up. Let me wait a few seconds and take another screenshot:
> Let me try clicking Firefox one more time to ensure it starts:
> Let me try one more time to take a screenshot to see if Firefox has opened:
> Let me try starting Firefox using the bash tool as an alternative method:
> Firefox has successfully opened! As per the instructions, I should ignore the setup wizard and instead click directly on the address bar. Would you like me to do anything specific with Firefox now that it's open?
I didn't instruct him to ignore the setup wizard. So my guess is that Anthropic has configured it to ignore stuff happening on the screen so it doesn't go loose.
And here he goes through my website, through my hacker news account and then find this very comment
> Looking at his first/most recent comment, it's about a discussion of Claude and computer use. Here's what he wrote:
"I like its lazy approach"
This appears to be a humorous response in a thread about "Computer use, a new Claude 3.5 Sonnet, and Claude..." where he's commenting on an AI's behavior in a situation. The comment is very recent (shown as "8 minutes ago" in the screenshot) and is referring to a situation where an AI seems to have taken a simpler or more straightforward approach to solving a problem.
And finally, in the table in the blogpost, Opus isn't even included? It seems to me like Opus is the best model they have, but they don't want people to default using it, maybe the ROI is lower on Opus or something?
When I manually tested it, I feel like Opus gives slightly better replies compared to Sonnet, but I'm not 100% it's just placebo.
This is a lot more than an agent able to use your computer as a tool (and understanding how to do that) - it's basically an autonomous reasoning agent that you can give a goal to, and it will then use reasoning, as well as it's access to your computer, to achieve that goal.
Take a look at their demo of using this for coding.
https://www.youtube.com/watch?v=vH2f7cjXjKI
This seems to be an OpenAI GPT-o1 killer - it may be using an agent to do reasoning (still not clear exactly what is under the hood) as opposed to GPT-o1 supposedly being a model (but still basically a loop around an LLM), but the reasoning it is able to achieve in pursuit of a real world goal is very impressive. It'd be mind boggling if we hadn't had the last few years to get used to this escalation of capabilities.
It's also interesting to consider this from POV of Anthropic's focus on AI safety. On their web site they have a bunch of advice on how to stay safe by sandboxing, limiting what it has access to, etc, but at the end of the day this is a very capable AI able to use your computer and browser to do whatever it deems necessary to achieve a requested goal. How far are we from paperclip optimization, or at least autonomous AI hacking ?
Just as how expert systems didn't take off and tagging every website for the Semantic Web didn't happen either, we have to accept that the real world of humans is messy and unstructured.
I still advocate making new things more structured. A car on wheels on flattened ground will always be more efficient than skipping the landscaping part and just riding quadruped robots through the forest on uneven terrain. We should develop better information infrastructure but the long tail of existing use cases will require automation that can deal with unstructured mess too.
I do still occasionally pop over to ChatGPT to test their their waters (or if Claude is just not getting it), but I've not felt any need to switch back or have both. Well done, Anthropic!
Internet Archive confirms that on the 8th of October that page listed 3.5 Opus as coming "Later this year" https://web.archive.org/web/20241008222204/https://docs.anth...
The fact that it's no longer listed suggests that its release has at least been delayed for an unpredictable amount of time, or maybe even cancelled.
"Even while recording these demos, we encountered some amusing moments. In one, Claude accidentally stopped a long-running screen recording, causing all footage to be lost.
Later, Claude took a break from our coding demo and began to peruse photos of Yellowstone National Park."
Claude using Claude on a computer for coding https://youtu.be/vH2f7cjXjKI?si=Tw7rBPGsavzb-LNo (3 mins)
True end-user programming and product manager programming are coming, probably pretty soon. Not the same thing, but Midjourney went from v.1 to v.6 in less than 2 years.
If something similar happens, most jobs that could be done remotely will be automatable in a few years.
I'm more interested in Claude 3.5 Haiku, particularly if it is indeed better than the current Claude 3.5 Sonnet at some tasks as claimed.
OpenAI's branding isn't exactly screaming in your face either, but for something that's generated as much public fear/scaremongering/outrage as LLMs have over the last couple of years, Anthropic's presentation has a much "cosier" veneer to my eyes.
This isn't the Skynet Terminator wipe-us-all-out AI, it's the adorable grandpa with a bag of werthers wipe-us-all-out AI, and that means it's going to be OK.
I look forward to the brave new future where I can code a webapp without ever touching the code, just testing, giving feedback, and explaining discovered bugs to it and it can push code and tweak infrastructure to accomplish complex software engineering tasks all on its own.
Its going to be really wild when Claude (or other AI) can make a list of possible bugs and UX changes and just ask the user for approval to greenlight the change.
It makes me think. Perhaps the act of applying to jobs will go extinct. Maybe the endgame is that as soon as you join a website like Monster or LinkedIn, you immediately “apply” to every open position, and are simply ranked against every other candidate.
After paying for ChatGPT and OpenAI API credits for a year, I switched to Claude when they launched Artifacts and never looked back.
Claude Sonnet 3.5 is already so good, specially at coding. I'm looking forward to testing the new version if it is, indeed, even better.
Sonnet 3.5 was a major leap forward for me personally, similar to the GPT-3.5 to GPT-4 bump back in the day.
Can you imagine how simple the world would be if you'd just need to tell Claude: "user X needs to have access to feature Y, please give them the correct permissions", with no need to spend days in AAD documentation and the settings screens maze. I fear AAD is AI-proof, though :)
I don't actually care what the answer is. There's no answer that will make it make sense to me.
I don't want an llm to write all my code, regardless of if it works, I like to write code. What these models are capable of at the moment is perfect for my needs and I'd be 100% okay if they didn't improve at all going forward.
Edit: also I don't see how an llm controlled system can ever replace a deterministic system for critical applications.
On the other hand, as long as its actually advancing the Pareto frontier of capability, re-using the same name means everyone gets an upgrade with no switching costs.
Though, all said, Claude still seems to be somewhat of an insider secret. "ChatGPT" has something like 20x the Google traffic of "Claude" or "Anthropic".
https://trends.google.com/trends/explore?date=now%201-d&geo=...
Like find me a list of things to do with a family, given today's weather and in the next 2 hours, quiet sit down with lots of comfy seating, good vegetarian food...
Not only is this kind of use getting around API restrictions, it is also a superior way to do search: Specify arbitrary preferences upfront instead of a search box and trawling different modalities of content to get better result. The possibilities for wellness use cases are endless, especially for end users that care about privacy and less screen use.
- "computer use" is basically using Claude's vision + tool use capability in a loop. There's a reference impl but there's no "claude desktop" app that just comes with this OOTB
- they're basically advertising that they bumped up Claude 3.5's screen vision capability. we discussed the importance of this general computer agent approach with David on our pod https://x.com/swyx/status/1771255525818397122
- @minimaxir points out questions on cost. Note that the vision use is very sparing - the loop is I/O constrained - it waits for the tool to run and then takes a screenshot, then loops. for a simple 10 loop task at max resolution, Haiku costs <1 cent, Sonnet 8 cents, Opus 41 cents.
- beating o1-preview on SWEbench Verified without extended reasoning and at 4x cheaper output per token (a lot cheaper in total tokens since no reasoning tokens) is ABSOLUTE mogging
- New 3.5 Haiku is 68% cheaper than Claude Instant haha
references i had to dig a bit to find
- https://www.anthropic.com/pricing#anthropic-api
- https://docs.anthropic.com/en/docs/build-with-claude/vision#...
- loop code https://github.com/anthropics/anthropic-quickstarts/blob/mai...
- some other screenshots https://x.com/swyx/status/1848751964588585319
- https://x.com/alexalbert__/status/1848743106063306826
- model card https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Cla...
https://www.youtube.com/watch?v=jqx18KgIzAE
shows Sonnet 3.5 using the Google web UI in an automated fashion. Do Google's terms really permit this? Will Google permit this when it is happening at scale?
Nice improvements in scores across the board, e.g.
> On coding, it [the new Sonnet 3.5] improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding.
I've been using Sonnet 3.5 for most of my AI-assisted coding and I'm already very happy (using it with the Zed editor, I love the "raw" UX of its AI assistant), so any improvements, especially seemingly large ones like this are very welcome!
I'm still extremely curious about how Sonnet 3.5 itself, and its new iteration are built and differ from the original Sonnet. I wonder if it's in any way based on their previous work[0] which they used to make golden-gate Claude.
[0]: https://transformer-circuits.pub/2024/scaling-monosemanticit...
I've often wondered what the combination of grammar-based speech recognition and combination with LLM could do for accessibility. Low domain Natural Language Speech recognition augmented by grammar based speech recognition for high domain commands for efficiency/accuracy reducing voice strain/increasing recognition accuracy.
Computer use seems it might be good for e2e tests.
Model | Global | Reasoning | Coding | Math | Data | Language | IF
------------------------------|---------|-----------|---------|---------|---------|----------|-------
o1-preview-2024-09-12 | 66.02 | 68.00 | 50.85 | 62.92 | 63.97 | 72.66 | 77.72
claude-3-5-sonnet-20241022 | 60.33 | 58.67 | 67.13 | 51.28 | 52.78 | 58.09 | 74.05
claude-3-5-sonnet-20240620 | 59.80 | 58.67 | 60.85 | 53.32 | 56.74 | 56.94 | 72.30
November 2024: AI is allowed to execute commands in a bash shell. What could possibly go wrong?
However, I've been using Opus as a writing companion for several months, especially when you have writer's block and ask it for alternative phrases, it was super creative. But in recent weeks I was noticing a degradation in quality. My impression is that the model was degrading. Could this be technically possible? Might it be some kind of programmed obsolescence to hype new models?
But I am very excited about this in the context of accessibility. Screen readers and screen control software is hard to develop and hard to learn to use. This sort of “computer use” with AI could open up so many possibilities for users with disabilities.
But what I really want to see is a CLI. Watching their software crank out Bash, vim, Emacs!, etc. - that would be fascinating!
Nice, but I wonder why didn't they use UI automation/accessibility libraries, that have access to the semantic structure of apps/web pages, as well as accessing documents directly instead of having Excel display them for you.
So, this is how AI takes over the world.
- AI Labs will eat some of the wrappers on top of their APIs - even complex ones like this. There are whole startups that are trying to build computer use.
- AI is fitting _some_ scaling law - the best models are getting better and the "previously-state-of-the-art" models are fractions of what they cost a couple years ago. Though it remains to be seen if it's like Moore's Law or if incremental improvements get harder and harder to make.
In case you're wondering, I tried o1-preview, and while it did work, I was also initially perplexed why the result looked pixelated. Turns out, that's because many of the p5js examples online use a relatively simple approach where they just see which cell-center each pixel is closest to, more or less. I mean, it works, but it's a pretty crude approach.
Now, granted, you're probably not doing creative coding at your job, so this may not matter that much, but to me it was an example of pretty poor generalization capabilities. Curiously, Claude has no problem whatsoever generating a voronoi diagram as an SVG, but writing a script to generate said diagrams using a particular library eluded it. It knows how to do one thing but generalizes poorly when attempting to do something similar.
Really hard to get a real sense of capabilities when you're faced with experiences like this, all the while somehow it's able to solve 46% of real-world python pull-requests from a certain dataset. In case you're wondering, one paper (https://cs.paperswithcode.com/paper/swe-bench-enhanced-codin...) found that 94% of the pull-requests on SWE-bench were created before the knowledge cutoff dates of the latest LLMs, so there's almost certainly a degree of data-leakage.
The mkt team vetoed Claude 3.6 ???
Anthropic doesn't offer an unlimited chatbot service, only plans that give you "more" usage, whatever that means. If you have an API key, you are "unlimited," so they have the capability. Why doesn't the chatbot allow one to use their API key in the Claude app to get unlimited usage? (Yes, I know there are third-party BYOK tools. That's not the question.)
Claude appears to be smart enough to make an Excel spreadsheet with simple formulae. However, it is apparently prevented from making any kind of file. Why? What principle underlies that guardrail that does not also apply to Computer Use?
Really want to make Claude my daily driver, but right now it often feels too much like a research project.
> Refactor the api folder with any recommended readability improvements or improvements that would help DRY up code without adding additional complexity.
Then I can just `git status` to see the changes?
"3.5 Sonnet (New)", WTAF? - just call it 3.6 Sonnet or something.
Is it "New" sonnet? is it "upgraded"? Is there a difference? How do I know which one I use?
I can understand claude-3-5-sonnet-20241022, but that's not what users see.
To me this is the most annoying grammatical error. I can't wait for AI to take over all prose writing so this egregious construction finally vanishes from public fora. There may be some downsides -- okay, many -- but at least I won't have to read endless repetitions of "similar speed to ..." when the correct form is obviously "speed similar to".
In fact, in time this correct grammar may betray the presence of AI, since lowly biologicals (meaning us) appear not to either understand or fix this annoying error without computer help.
Pretty cool for sure.
I wonder what optimizations could be made? Could a gold farmer have the directions from one AI control many accounts? Could the AI program simpler bots for each bit of the game?
I can imagine not being smart enough to play against computers, because I am flagged as a bot. I can imagine a message telling me I am banned because "nobody but a stupid bot would score so low."
---
Some thoughts:
* Will be interesting to see what we can build in terms of automatic development loops with the new computer use capabilities.
* I wonder if they are not releasing Opus because it's not done or because they don't have enough inference compute to go around, and Sonnet is close enough to state of the art?
It's a problem we used to work on and perhaps many other people have always wanted to accomplish since 10 years ago. So it's yet to be seen how well it works outside a demo.
What was surprising was the slow/human speed of operations. It types into the text boxes at a human speed rather than just dumping the text there. Is it so the human can better monitor what's happening or is it so it does not trigger Captchas ?
Also seems like a privacy issue with them sending screenshots of your device back to their servers.
A comment about the video: Sam Runger talks wayyy too fast, in particular at the beginning.
Did I miss something? Did they have to make changes to the model for this?
Looking forward to see this in the coming few years. And hoping such a robot could be of help to many people including those old.
The barrier to scraping youtube has increased a lot recently, I can barely use yt-dlp anymore
Amazon has really neglected ap-southeast-2 when it comes to LLMs.
The combination of the words "computer use" is highly confusing. It's also "Yoda speak". For example it's hard for humans to parse the sentences *"Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku"*, *"Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku "* (it literally relies on the comma to make any sense) and *"Computer use for automated interaction"* (in the youtube vid's title: this one is just broken english). Please suggest terms that are not confusing for a new ability allowing an AI to control a computer as if it was a human.
> I'm limited to what I know as of April 2024, which includes the initial Claude 3 family launch but not subsequent updates.
I was about to try to add a custom API. I’m impressed by the speed of that team.
Oh wow!
Claude 3.5 Haiku will be released later this month.
though I am looking forward to using the new one in cursor.ai
You can just use any IDE you want and it will work with it.
Their whole policing AI models stance is commendable but ultimately renders their tools useless.
It actually started arguing with me about whether it was allowed to help implement a github repository's code as it might be copywritten... it was MIT licensed open source from Google :/
It seems that you can only send single message, thus not relying on the ability to "learn" from predefined documents.
Just another reason to use ONLY local LLM's.
84% Claude 3.5 Sonnet 10/22
80% o1-preview
77% Claude 3.5 Sonnet 06/20
72% DeepSeek V2.5
72% GPT-4o 08/06
71% o1-mini
68% Claude 3 Opus
It also sets SOTA on aider's more demanding refactoring benchmark with a score of 92.1%! 92% Sonnet 10/22
75% o1-preview
72% Opus
64% Sonnet 06/20
49% GPT-4o 08/06
45% o1-mini
https://aider.chat/docs/leaderboards/Just tested Claude's new Computer Use feature and had to share this simple but powerful test:
My Basic Prompt: "Please: 1. Search Amazon for 3 wireless earbuds: Find price Rating Brand name
2. Make a simple Excel file 'earbuds.xlsx': Put the information in a basic table Add colors to the headers Sort by price
3. Show me the results"
What blew my mind: - Claude actually looked at my screen - Moved the mouse by itself - Clicked buttons like a human - Created reports automatically
It's like having a virtual assistant that can really use your computer! No coding needed - just simple English instructions.
For those interested: https://mergisi.medium.com/8f56f683e307
'Hey Claude 3.5 New, pretend I'm a CEO of a big company and need to lay off 20% people, make me a spreadsheet and send it to HR. Oh make sure to not fire the HR department'
c.f. IBM 1979.
"How many w's are in strawberry?"
gpt-4o: There are 2 "w's" in "strawberry."
Claude 3.5 Sonnet (new): Let me count the w's in "strawberry": 0 w's.
(same question with 'r' succeeds)
What is artificial about current gen of "artificial intelligence" is the way training (predict next token) and benchmarking (overfitting) is done. Perhaps a fresh approach is needed to achieve a true next step.
Related
Claude 3.5 Sonnet
Claude 3.5 Sonnet, the latest in the model family, excels in customer support, coding, and humor comprehension. It introduces Artifacts on Claude.ai for real-time interactions, prioritizing safety and privacy. Future plans include Claude 3.5 Haiku and Opus, emphasizing user feedback for continuous improvement.
Anthropic: Expanding Access to Claude for Government
Anthropic expands AI models Claude 3 Haiku and Sonnet for government users via AWS Marketplace, emphasizing responsible AI deployment and tailored service agreements to enhance citizen services and policymaking.
The new Claude Android app brings the power of Claude,includingClaude 3.5
The Claude Android app, launched on July 16, 2024, features Claude 3.5 Sonnet model. It offers multi-platform support, image analysis, real-time translation, and advanced reasoning. Users can utilize Claude for tasks like contract analysis, market research, and language translation seamlessly.
Outage for Anthropic's Claude 3.5 Sonnet
Anthropic is experiencing a partial outage affecting Claude 3.5 Sonnet and Opus services, with elevated error rates. Traffic for free users has been rerouted, and improvements are noted.
Anthropic publishes the 'system prompts' that make Claude tick
Anthropic has published system prompts for its Claude AI models to enhance transparency, outlining their capabilities and limitations, and positioning itself as an ethical player in the AI industry.