October 22nd, 2024

Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

Anthropic has released Claude 3.5 Sonnet, improving coding and tool use, and Claude 3.5 Haiku, matching previous performance. A new "computer use" feature is in beta, enhancing automation.

Read original articleLink Icon
ExcitementSkepticismCuriosity
Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

Anthropic has announced the release of an upgraded Claude 3.5 Sonnet and a new model, Claude 3.5 Haiku. The Claude 3.5 Sonnet features significant improvements in coding capabilities, achieving a performance increase on the SWE-bench Verified benchmark from 33.4% to 49.0%. It also excels in agentic tool use tasks, with notable gains in both retail and airline domains. The Claude 3.5 Haiku model matches the performance of the previous largest model, Claude 3 Opus, while maintaining similar costs and speeds. A new capability, "computer use," is now in public beta, allowing Claude to interact with computer interfaces like a human, which developers can utilize for various tasks. Early adopters, including Asana and Replit, are exploring this feature for complex workflows. While the computer use feature is experimental and may have limitations, it shows promise for automating repetitive tasks and enhancing software development processes. The Claude 3.5 Sonnet is available for all users, while Claude 3.5 Haiku will be released later this month. Anthropic emphasizes the importance of responsible deployment and has implemented safety measures to mitigate potential risks associated with this new capability.

- Claude 3.5 Sonnet shows significant improvements in coding and tool use tasks.

- Claude 3.5 Haiku offers competitive performance at a similar cost to previous models.

- The new "computer use" feature allows Claude to interact with computer interfaces.

- Early feedback from developers indicates strong potential for automating complex workflows.

- Anthropic is committed to responsible deployment and safety measures for new capabilities.

AI: What people are saying
The release of Claude 3.5 Sonnet and the new "computer use" feature has generated significant discussion among users.
  • Many users express excitement about the enhanced coding capabilities and automation features, with some noting impressive performance benchmarks.
  • Concerns about the implications of AI controlling computers, including privacy issues and potential for misuse, are frequently mentioned.
  • There is confusion regarding the naming conventions of the models, with users questioning why updates are not labeled as new versions.
  • Some users highlight the potential for AI to automate various tasks, while others express skepticism about its reliability and safety.
  • Overall, there is a mix of enthusiasm for the advancements and caution regarding the ethical implications of such powerful AI capabilities.
Link Icon 132 comments
By @LASR - 4 months
This is actually a huge deal.

As someone building AI SaaS products, I used to have the position that directly integrating with APIs is going to get us most of the way there in terms of complete AI automation.

I wanted to take at stab at this problem and started researching some daily busineses and how they use software.

My brother-in-law (who is a doctor) showed me the bespoke software they use in his practice. Running on Windows. Using MFC forms.

My accountant showed me Cantax - a very powerful software package they use to prepare tax returns in Canada. Also on Windows.

I started to realize that pretty much most of the real world runs on software that directly interfaces with people, without clearly defined public APIs you can integrate into. Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.

Boy was I wrong.

I am glad they did this, since it is a powerful connector to these types of real-world business use cases that are super-hairy, and hence very worthwhile in automating.

By @marsh_mellow - 4 months
By @csomar - 4 months
I like its lazy approach

> I apologize, but it seems we're having some issues installing Chromium. As an alternative, since I can see Firefox is already installed on the system, would you like to use Firefox instead? Or would you prefer me to try a different approach to installing Chromium?

It seems unable to figure out how to install chormium. Sees the Firefox icon out there and just want to use it. You know, like a real lazy human being.

It seems to insistent when it needs to, though

> I clicked on Firefox and it should start opening. Let me take another screenshot to see if the window has appeared:

> Firefox seems to be taking a moment to start up. Let me wait a few seconds and take another screenshot:

> Let me try clicking Firefox one more time to ensure it starts:

> Let me try one more time to take a screenshot to see if Firefox has opened:

> Let me try starting Firefox using the bash tool as an alternative method:

> Firefox has successfully opened! As per the instructions, I should ignore the setup wizard and instead click directly on the address bar. Would you like me to do anything specific with Firefox now that it's open?

I didn't instruct him to ignore the setup wizard. So my guess is that Anthropic has configured it to ignore stuff happening on the screen so it doesn't go loose.

And here he goes through my website, through my hacker news account and then find this very comment

> Looking at his first/most recent comment, it's about a discussion of Claude and computer use. Here's what he wrote:

"I like its lazy approach"

This appears to be a humorous response in a thread about "Computer use, a new Claude 3.5 Sonnet, and Claude..." where he's commenting on an AI's behavior in a situation. The comment is very recent (shown as "8 minutes ago" in the screenshot) and is referring to a situation where an AI seems to have taken a simpler or more straightforward approach to solving a problem.

By @diggan - 4 months
I still feel like the difference between Sonnet and Opus is a bit unclear. Somewhere on Anthropic's website it says that Opus is the most advanced, but on other parts it says Sonnet is the most advanced and also the fastest. The UI doesn't make the distinction clear either. Then on Perplexity, Perplexity says that Opus is the most advanced, compared to Sonnet.

And finally, in the table in the blogpost, Opus isn't even included? It seems to me like Opus is the best model they have, but they don't want people to default using it, maybe the ROI is lower on Opus or something?

When I manually tested it, I feel like Opus gives slightly better replies compared to Sonnet, but I'm not 100% it's just placebo.

By @HarHarVeryFunny - 4 months
The "computer use" ability is extremely impressive!

This is a lot more than an agent able to use your computer as a tool (and understanding how to do that) - it's basically an autonomous reasoning agent that you can give a goal to, and it will then use reasoning, as well as it's access to your computer, to achieve that goal.

Take a look at their demo of using this for coding.

https://www.youtube.com/watch?v=vH2f7cjXjKI

This seems to be an OpenAI GPT-o1 killer - it may be using an agent to do reasoning (still not clear exactly what is under the hood) as opposed to GPT-o1 supposedly being a model (but still basically a loop around an LLM), but the reasoning it is able to achieve in pursuit of a real world goal is very impressive. It'd be mind boggling if we hadn't had the last few years to get used to this escalation of capabilities.

It's also interesting to consider this from POV of Anthropic's focus on AI safety. On their web site they have a bunch of advice on how to stay safe by sandboxing, limiting what it has access to, etc, but at the end of the day this is a very capable AI able to use your computer and browser to do whatever it deems necessary to achieve a requested goal. How far are we from paperclip optimization, or at least autonomous AI hacking ?

By @bonoboTP - 4 months
I've been saying this is coming for a long time, but my really smart SWE friend who is nevertheless not in the AI/ML space dismissed it as a stupid roundabout way of doing things. That software should just talk via APIs. No matter how much I argued regarding legacy software/websites and how much functionality is really only available through GUI, it seems some people are really put off by this type of approach. To me, who is more embedded in the AI, computer vision, robotics world, the fuzziness of day-to-day life is more apparent.

Just as how expert systems didn't take off and tagging every website for the Semantic Web didn't happen either, we have to accept that the real world of humans is messy and unstructured.

I still advocate making new things more structured. A car on wheels on flattened ground will always be more efficient than skipping the landscaping part and just riding quadruped robots through the forest on uneven terrain. We should develop better information infrastructure but the long tail of existing use cases will require automation that can deal with unstructured mess too.

By @LVB - 4 months
Not specific to this update, but I wanted to chime in with just how useful Claude has been, and relatively better than ChatGPT and GitHub copilot for daily use. I've been pro for maybe 6 months. I'm not a power user leveraging their API or anything. Just the chat interface, though with ever more use of Projects, lately. I use it every day, whether for mundane answers or curiosities, to "write me this code", to general consultation on a topic. It has replaced search in a superior way and I feel hugely productive with it.

I do still occasionally pop over to ChatGPT to test their their waters (or if Claude is just not getting it), but I've not felt any need to switch back or have both. Well done, Anthropic!

By @simonw - 4 months
Claude 3.5 Opus is no longer mentioned at all on https://docs.anthropic.com/en/docs/about-claude/models

Internet Archive confirms that on the 8th of October that page listed 3.5 Opus as coming "Later this year" https://web.archive.org/web/20241008222204/https://docs.anth...

The fact that it's no longer listed suggests that its release has at least been delayed for an unpredictable amount of time, or maybe even cancelled.

By @gzer0 - 4 months
One of the funnier things during training with the new API (which can control your computer) was this:

"Even while recording these demos, we encountered some amusing moments. In one, Claude accidentally stopped a long-running screen recording, causing all footage to be lost.

Later, Claude took a break from our coding demo and began to peruse photos of Yellowstone National Park."

[0] https://x.com/AnthropicAI/status/1848742761278611504

By @nopinsight - 4 months
This needs more discussion:

Claude using Claude on a computer for coding https://youtu.be/vH2f7cjXjKI?si=Tw7rBPGsavzb-LNo (3 mins)

True end-user programming and product manager programming are coming, probably pretty soon. Not the same thing, but Midjourney went from v.1 to v.6 in less than 2 years.

If something similar happens, most jobs that could be done remotely will be automatable in a few years.

By @simonw - 4 months
I wrote up some of my own notes on Computer Use here: https://simonwillison.net/2024/Oct/22/computer-use/
By @minimaxir - 4 months
From the computer use video demo, that's a lot of API calls. Even though Claude 3.5 Sonnet is relatively cheap for its performance, I suspect computer use won't be. It's a very good idea that Anthropic upfront that it isn't perfect. And it's guaranteed that there will be a viral story where Claude will accidentally delete something important with it.

I'm more interested in Claude 3.5 Haiku, particularly if it is indeed better than the current Claude 3.5 Sonnet at some tasks as claimed.

By @highwaylights - 4 months
Completely irrelevant, and it might just be me, but I really like Anthropic's understated branding.

OpenAI's branding isn't exactly screaming in your face either, but for something that's generated as much public fear/scaremongering/outrage as LLMs have over the last couple of years, Anthropic's presentation has a much "cosier" veneer to my eyes.

This isn't the Skynet Terminator wipe-us-all-out AI, it's the adorable grandpa with a bag of werthers wipe-us-all-out AI, and that means it's going to be OK.

By @cwkoss - 4 months
Claude is amazing. The project documents functionality makes it a clear leader ahead of ChatGPT and I have found it to be the clear leader in coding assistance over the past few months. Web automation is really exciting.

I look forward to the brave new future where I can code a webapp without ever touching the code, just testing, giving feedback, and explaining discovered bugs to it and it can push code and tweak infrastructure to accomplish complex software engineering tasks all on its own.

Its going to be really wild when Claude (or other AI) can make a list of possible bugs and UX changes and just ask the user for approval to greenlight the change.

By @TaylorAlexander - 4 months
And today I realized that despite it being an extremely common activity, we don’t really have a word for “using the computer” which is distinct from “computing”. It’s funny because AI models are always “using a computer” but now they can “use your computer.”
By @janalsncm - 4 months
Reminds me of the rise in job application bots. People are applying to thousands of jobs using automated tools. It’s probably one of the inevitable use cases of this technology.

It makes me think. Perhaps the act of applying to jobs will go extinct. Maybe the endgame is that as soon as you join a website like Monster or LinkedIn, you immediately “apply” to every open position, and are simply ranked against every other candidate.

By @trzy - 4 months
Pretty cool! I use Claude 3.5 to control a robot (ARKit/iOS based) and it does surprisingly well in the real world: https://youtu.be/-iW3Vzzr3oU?si=yzu2SawugXMGKlW9
By @hugocbp - 4 months
Great work by Anthropic!

After paying for ChatGPT and OpenAI API credits for a year, I switched to Claude when they launched Artifacts and never looked back.

Claude Sonnet 3.5 is already so good, specially at coding. I'm looking forward to testing the new version if it is, indeed, even better.

Sonnet 3.5 was a major leap forward for me personally, similar to the GPT-3.5 to GPT-4 bump back in the day.

By @alentred - 4 months
If "computer use" feature is able to find it's way in Azure, AAD/Entra, SharePoint settings, etc. - it has a chance of becoming a better user interface for Microsoft products. :)

Can you imagine how simple the world would be if you'd just need to tell Claude: "user X needs to have access to feature Y, please give them the correct permissions", with no need to spend days in AAD documentation and the settings screens maze. I fear AAD is AI-proof, though :)

By @KingOfCoders - 4 months
I have been a paying ChatGPT customer for a long time (since the very beginning). Last week I've compared ChatGPT to Claude and the results (to my eye) were better, the output better structured and the canvas works better. I'm on the edge of jumping ship.
By @astrange - 4 months
I think this is good evidence that people's jobs are not being replaced by AI, because no AI would give the product a confusing name like "new Claude 3.5 Sonnet".
By @015a - 4 months
Why on god's green earth is it not just called Claude 3.6 Sonnet. Or Claude 4 Sonnet.

I don't actually care what the answer is. There's no answer that will make it make sense to me.

By @TechDebtDevin - 4 months
Not that I'm scared of this update but I'd probably be alright with pausing llm development today, atleast in regard to producing code.

I don't want an llm to write all my code, regardless of if it works, I like to write code. What these models are capable of at the moment is perfect for my needs and I'd be 100% okay if they didn't improve at all going forward.

Edit: also I don't see how an llm controlled system can ever replace a deterministic system for critical applications.

By @pradn - 4 months
Great progress from Anthropic! They really shouldn't change models from under the hood, however. A name should refer to a specific set of model weights, more or less.

On the other hand, as long as its actually advancing the Pareto frontier of capability, re-using the same name means everyone gets an upgrade with no switching costs.

Though, all said, Claude still seems to be somewhat of an insider secret. "ChatGPT" has something like 20x the Google traffic of "Claude" or "Anthropic".

https://trends.google.com/trends/explore?date=now%201-d&geo=...

By @devinprater - 4 months
Maybe LLM's helping blind people like me play video games that aren't accessible to us normally, is getting closer!
By @lr1970 - 4 months
I am curious why "upgraded Claude 3.5 Sonnet" instead of simply Claude 3.6 Sonnet? Minor version increment is a standard way of versioning update. Am i missing something or it is just Anthropic marketing?
By @ramesh31 - 4 months
Claude is absurdly better at coding tasks than OpenAI. Like it's not even close. Particularly when it comes to hallucinations. Prompt for prompt, I see Claude being rock solid and returning fully executable code, with all the correct imports, while OpenAI struggles to even complete the task and will make up nonexistent libraries/APIs out of whole cloth.
By @itissid - 4 months
This can power one of my favorite use-cases.

Like find me a list of things to do with a family, given today's weather and in the next 2 hours, quiet sit down with lots of comfy seating, good vegetarian food...

Not only is this kind of use getting around API restrictions, it is also a superior way to do search: Specify arbitrary preferences upfront instead of a search box and trawling different modalities of content to get better result. The possibilities for wellness use cases are endless, especially for end users that care about privacy and less screen use.

By @swyx - 4 months
my quick notes on Computer Use:

- "computer use" is basically using Claude's vision + tool use capability in a loop. There's a reference impl but there's no "claude desktop" app that just comes with this OOTB

- they're basically advertising that they bumped up Claude 3.5's screen vision capability. we discussed the importance of this general computer agent approach with David on our pod https://x.com/swyx/status/1771255525818397122

- @minimaxir points out questions on cost. Note that the vision use is very sparing - the loop is I/O constrained - it waits for the tool to run and then takes a screenshot, then loops. for a simple 10 loop task at max resolution, Haiku costs <1 cent, Sonnet 8 cents, Opus 41 cents.

- beating o1-preview on SWEbench Verified without extended reasoning and at 4x cheaper output per token (a lot cheaper in total tokens since no reasoning tokens) is ABSOLUTE mogging

- New 3.5 Haiku is 68% cheaper than Claude Instant haha

references i had to dig a bit to find

- https://www.anthropic.com/pricing#anthropic-api

- https://docs.anthropic.com/en/docs/build-with-claude/vision#...

- loop code https://github.com/anthropics/anthropic-quickstarts/blob/mai...

- some other screenshots https://x.com/swyx/status/1848751964588585319

- https://x.com/alexalbert__/status/1848743106063306826

- model card https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Cla...

By @bhouston - 4 months
Is there an easy way to use Claude as a Co-Pilot in VS Code? If it is better at coding, it would be great to have it integrated.
By @zone411 - 4 months
It improves to 25.9 over the previous version of Claude 3.5 Sonnet (24.4) on NYT Connections: https://github.com/lechmazur/nyt-connections/.
By @vok - 4 months
This "Computer use" demo:

https://www.youtube.com/watch?v=jqx18KgIzAE

shows Sonnet 3.5 using the Google web UI in an automated fashion. Do Google's terms really permit this? Will Google permit this when it is happening at scale?

By @gumboshoes - 4 months
For me, one of the more useful steps on macOS will be when local AI can manipulate anything that has an Apple Script library. The hooks are there and decently documented. For meta purposes, having AI work with a third-party app like Keyboard Maestro or Raycast will even further expand the pre-built possibilities without requiring the local AI to reinvent steps or tools at the time of each prompt.
By @cube2222 - 4 months
This looks quite fantastic!

Nice improvements in scores across the board, e.g.

> On coding, it [the new Sonnet 3.5] improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding.

I've been using Sonnet 3.5 for most of my AI-assisted coding and I'm already very happy (using it with the Zed editor, I love the "raw" UX of its AI assistant), so any improvements, especially seemingly large ones like this are very welcome!

I'm still extremely curious about how Sonnet 3.5 itself, and its new iteration are built and differ from the original Sonnet. I wonder if it's in any way based on their previous work[0] which they used to make golden-gate Claude.

[0]: https://transformer-circuits.pub/2024/scaling-monosemanticit...

By @FloatArtifact - 4 months
It will interesting to see how this evolves. UI automation use case is different from accessibility do to latency requirement. latency matters a lot for accessibility not so much for ui automation testing apparatus.

I've often wondered what the combination of grammar-based speech recognition and combination with LLM could do for accessibility. Low domain Natural Language Speech recognition augmented by grammar based speech recognition for high domain commands for efficiency/accuracy reducing voice strain/increasing recognition accuracy.

https://github.com/dictation-toolbox/dragonfly

By @cynicalpeace - 4 months
This bolsters my opinion that OpenAI is falling rapidly behind. Presumably due to Sam's political machinations rather than hard-driving technical vision, at least that's what it seems like, outside looking in.

Computer use seems it might be good for e2e tests.

By @lossolo - 4 months
Livebench updated

https://livebench.ai

  Model                         | Global  | Reasoning | Coding  | Math    | Data    | Language | IF

  ------------------------------|---------|-----------|---------|---------|---------|----------|-------

  o1-preview-2024-09-12          | 66.02   | 68.00     | 50.85   | 62.92   | 63.97   | 72.66    | 77.72

  claude-3-5-sonnet-20241022     | 60.33   | 58.67     | 67.13   | 51.28   | 52.78   | 58.09    | 74.05

  claude-3-5-sonnet-20240620     | 59.80   | 58.67     | 60.85   | 53.32   | 56.74   | 56.94    | 72.30
By @urbandw311er - 4 months
> we have provided three tools > bash shell

November 2024: AI is allowed to execute commands in a bash shell. What could possibly go wrong?

By @Hizonner - 4 months
Can this solve CAPTCHAs for me? It's starting to get to the point where limited biological brains can't do them.
By @mercacona - 4 months
I'm giving the new Sonnet a chance, although for my use as a writing companion so far, Opus has been king among all the models I've tried.

However, I've been using Opus as a writing companion for several months, especially when you have writer's block and ask it for alternative phrases, it was super creative. But in recent weeks I was noticing a degradation in quality. My impression is that the model was degrading. Could this be technically possible? Might it be some kind of programmed obsolescence to hype new models?

By @freetonik - 4 months
Fascinating. Though I expect people to be concerned about privacy implications of sending screenshots of the desktop, similar to the backlash Microsoft has received about their AI products. Giving the remote service actual control of the mouse and keyboard is a whole another level!

But I am very excited about this in the context of accessibility. Screen readers and screen control software is hard to develop and hard to learn to use. This sort of “computer use” with AI could open up so many possibilities for users with disabilities.

By @mmooss - 4 months
Of course there's great inefficiency in having the Claude software control a computer with a human GUI mediating everything, but it's necessary for many uses right now given how much we do where only human interfaces are easily accessible. If something like it takes off, I expect interfaces for AI software would be published, standardized, etc. Your customers may not buy software that lacks it.

But what I really want to see is a CLI. Watching their software crank out Bash, vim, Emacs!, etc. - that would be fascinating!

By @turnsout - 4 months
Wow, there's a whole industry devoted to what they're calling "Computer Use" (Robotic Process Automation, or RPA). I wonder how those folks are viewing this.
By @torginus - 4 months
Claude's current ability to use computers is imperfect. Some actions that people perform effortlessly—scrolling, dragging, zooming—currently present challenges for Claude and we encourage developers to begin exploration with low-risk tasks.

Nice, but I wonder why didn't they use UI automation/accessibility libraries, that have access to the semantic structure of apps/web pages, as well as accessing documents directly instead of having Excel display them for you.

By @sedatk - 4 months
> developers can direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking buttons, and typing text.

So, this is how AI takes over the world.

By @ford - 4 months
Seems like both:

- AI Labs will eat some of the wrappers on top of their APIs - even complex ones like this. There are whole startups that are trying to build computer use.

- AI is fitting _some_ scaling law - the best models are getting better and the "previously-state-of-the-art" models are fractions of what they cost a couple years ago. Though it remains to be seen if it's like Moore's Law or if incremental improvements get harder and harder to make.

By @jatins - 4 months
How does the computer use work -- Is this a desktop app they are providing that can do actions on your computer? Didn't see any such mention in the post
By @Bjorkbat - 4 months
Tried my standard go-to for testing, asked it to generate a voronoi diagram using p5js. For the sake of job security I'm relieved to see it still can't do a relatively simple task with ample representation in the Google search results. Granted, p5js is kind of niche, but not terribly so. It's arguably the most popular library for creating coding.

In case you're wondering, I tried o1-preview, and while it did work, I was also initially perplexed why the result looked pixelated. Turns out, that's because many of the p5js examples online use a relatively simple approach where they just see which cell-center each pixel is closest to, more or less. I mean, it works, but it's a pretty crude approach.

Now, granted, you're probably not doing creative coding at your job, so this may not matter that much, but to me it was an example of pretty poor generalization capabilities. Curiously, Claude has no problem whatsoever generating a voronoi diagram as an SVG, but writing a script to generate said diagrams using a particular library eluded it. It knows how to do one thing but generalizes poorly when attempting to do something similar.

Really hard to get a real sense of capabilities when you're faced with experiences like this, all the while somehow it's able to solve 46% of real-world python pull-requests from a certain dataset. In case you're wondering, one paper (https://cs.paperswithcode.com/paper/swe-bench-enhanced-codin...) found that 94% of the pull-requests on SWE-bench were created before the knowledge cutoff dates of the latest LLMs, so there's almost certainly a degree of data-leakage.

By @Centigonal - 4 months
They should just adopt Apple "version numbers:" Claude Sonnet (Late 2024).
By @mtgentry - 4 months
What are the licensing implications of this? If I’m Google, I’d be pissed that my software is being used without a human there looking at the ads.
By @flockonus - 4 months
Are these ppl are aware that they can bump minor versions?

The mkt team vetoed Claude 3.6 ???

By @bbor - 4 months
Ok I know that we're in the post-nerd phase of computers, but version numbers are there for a reason. 3.6, please? 3.5.1??
By @runako - 4 months
I really don't get their model. They have very advanced models, but the service overall seems to be a jumble of priorities. Some examples:

Anthropic doesn't offer an unlimited chatbot service, only plans that give you "more" usage, whatever that means. If you have an API key, you are "unlimited," so they have the capability. Why doesn't the chatbot allow one to use their API key in the Claude app to get unlimited usage? (Yes, I know there are third-party BYOK tools. That's not the question.)

Claude appears to be smart enough to make an Excel spreadsheet with simple formulae. However, it is apparently prevented from making any kind of file. Why? What principle underlies that guardrail that does not also apply to Computer Use?

Really want to make Claude my daily driver, but right now it often feels too much like a research project.

By @hubraumhugo - 4 months
I've seen quite a few YC startups working on AI-powered RPA, and now it looks like a foundational model player is directly competing in their space. It will be interesting to see whether Anthropic will double down on this or leave it to third-party developers to build commercial applications around it.
By @joshuamcginnis - 4 months
Is there anything out there yet that will let me issue the command:

> Refactor the api folder with any recommended readability improvements or improvements that would help DRY up code without adding additional complexity.

Then I can just `git status` to see the changes?

By @attentive - 4 months
They need to work on their versioning.

"3.5 Sonnet (New)", WTAF? - just call it 3.6 Sonnet or something.

Is it "New" sonnet? is it "upgraded"? Is there a difference? How do I know which one I use?

I can understand claude-3-5-sonnet-20241022, but that's not what users see.

By @abc-1 - 4 months
I tried to get it to translate a document and it stopped after a few paragraphs and asked if I wanted it to keep going. This is not appropriate for my use case and it kept doing this even though I explicitly told it not to. The old version did not do this.
By @lutusp - 4 months
> "... and similar speed to the previous generation of Haiku."

To me this is the most annoying grammatical error. I can't wait for AI to take over all prose writing so this egregious construction finally vanishes from public fora. There may be some downsides -- okay, many -- but at least I won't have to read endless repetitions of "similar speed to ..." when the correct form is obviously "speed similar to".

In fact, in time this correct grammar may betray the presence of AI, since lowly biologicals (meaning us) appear not to either understand or fix this annoying error without computer help.

By @submeta - 4 months
That’s too much control for my taste. I don’t want anthropic to see my screen. I rather prefer a VS Code with integrated Claude. A version that can see all my dev files in a given folder. I don’t need it to run Chrome for me.
By @bluelightning2k - 4 months
This is what the Rabbit "large action model" pretended to be. Wouldn't be surprised to see them switch to this and claim they were never lying about their capabilities because it works now.

Pretty cool for sure.

By @RecycledEle - 4 months
How long until it is profitable the tell a cheap AI to "win this game by collecting resources and advancing in-game" and then sell the account on eBay?

I wonder what optimizations could be made? Could a gold farmer have the directions from one AI control many accounts? Could the AI program simpler bots for each bit of the game?

I can imagine not being smart enough to play against computers, because I am flagged as a bot. I can imagine a message telling me I am banned because "nobody but a stupid bot would score so low."

By @amai - 4 months
Finally a general tool to solve captchas for my web scrapers.
By @wesleyyue - 4 months
If anyone would like to try the new Sonnet in VSCode. I just updated https://double.bot to the new Sonnet. (disclaimer: I am the cofounder/creator)

---

Some thoughts:

* Will be interesting to see what we can build in terms of automatic development loops with the new computer use capabilities.

* I wonder if they are not releasing Opus because it's not done or because they don't have enough inference compute to go around, and Sonnet is close enough to state of the art?

By @gerash - 4 months
The "computer use" demos are interesting.

It's a problem we used to work on and perhaps many other people have always wanted to accomplish since 10 years ago. So it's yet to be seen how well it works outside a demo.

What was surprising was the slow/human speed of operations. It types into the text boxes at a human speed rather than just dumping the text there. Is it so the human can better monitor what's happening or is it so it does not trigger Captchas ?

By @throwaway0123_5 - 4 months
This is incredibly cool but it seems like the potential damage from a "hallucination" in this mode is considerable, especially when they provide examples of it going very far off-track (looking up Yellowstone pictures). Would basically need constant monitoring for me not to be paranoid it did something stupid.

Also seems like a privacy issue with them sending screenshots of your device back to their servers.

By @maestrae - 4 months
anybody know how the hell they're combating / gonna combat captcha's, cloudflare blocking, etc. I remember playing in this space on a toy project and being utterly frustrated by anti-scraping. Maybe one good thing that will come out of this AI boom is that companies will become nicer to scrapers? Or maybe, they'll just cut sweetheart deals?
By @29decibel - 4 months
I am surprised it uses macOS as the demo, as I thought it would be harder to control vs Ubuntu. But maybe at the same time, macOS is the most predictable/reliable desktop environment? I noticed that they use virtual environment for the demo, curious how do they build that along with docker, is that leveraging the latest virtualization framework from Apple?
By @Tepix - 4 months
Interesting stuff, i look forward to future developments.

A comment about the video: Sam Runger talks wayyy too fast, in particular at the beginning.

By @msoad - 4 months
I skimmed through the computer use code. It's possible to build this with other AI providers too. For instance you can asks ChatGPT API to call functions for click and scroll and type with specific parameters and execute them using OS's APIs (A11y APIs usually)

Did I miss something? Did they have to make changes to the model for this?

By @fernly - 4 months
Imagine the possibilities for cyber-crime. Surely you could program it to log in to a financial institution and transfer money. And if you had a list of user names and passwords from some large info breach? You could automate a LOT of transfers in a short amount of time...
By @tammer - 4 months
This demo is impressive although my initial reaction is a sort of grief that I wasn't born in the timeline where Alan Kay's vision of object-oriented computing was fully realized -- then we wouldn't have to manually reconcile wildly heterogeneous data formats and interfaces in the first place!
By @aprilthird2021 - 4 months
OpenAI must be scared at this point. Anthropic is clobbering them at the high end of the market and Meta is providing free AIs at the low end. OpenAI is pretty soon going to be in the valueless middle fighting with tons of other companies for relevance
By @alok-g - 4 months
Next stop after 'Computer Use' -- Multimodal input from a robot's sensors and generating various signals to control its actions.

Looking forward to see this in the coming few years. And hoping such a robot could be of help to many people including those old.

By @myprotegeai - 4 months
How long until "computer use" is tricked into entering PII or PHI into an attackers website?
By @wewtyflakes - 4 months
I wonder if OpenAI will fast follow; usually they're the ones to throw down the gauntlet. That being said, you can play around with OpenAI with a similar architecture of vision + agent + exec + loop using Donobu, though it is constrained to web browsers.
By @lairv - 4 months
Offtopic but youtube doesn't allow me to view the embedded video, with a "Sign in to confirm you’re not a bot" message. I need to open a dedicated youtube tab to watch it

The barrier to scraping youtube has increased a lot recently, I can barely use yt-dlp anymore

By @abraxas - 4 months
Hopefully the coding improvements are meaningful because I find that as a coding assistant o1-preview beats it (at least the Claude 3.5 that was available yesterday) but I like Claude's demeanor more (I know this sounds crazy but it matters a bit to me)
By @nwnwhwje - 4 months
Any comments on alignment with Anthropic's missions. Last time I checked Anthropic is about building SOTA as that is the only way to do safety research. Making money ans useful stuff commercially is a means to an end.
By @smcleod - 4 months
I wonder when it'll actually be available in the Bedrock AU region, because as of right now we're still stuck using mid-range models from a year ago.

Amazon has really neglected ap-southeast-2 when it comes to LLMs.

By @brid - 4 months
Looks like visual understanding of diagrams is improved significantly! For example, it was on par with Chat GPT 4o and Gemini 1.5 in parsing an ERD for a conceptual model, but now far excels over the others.
By @bergutman - 4 months
They need to get the price of 3.5 Haiku down. It's about 2x 4o-mini.
By @m3kw9 - 4 months
I suspect they are gonna need some local offload capabilities for Computer Use, the repeated screen reading can definitely be done locally on modern machines, otherwise the cost maybe impractical.
By @thecolorgreen - 4 months
This looks really similar to rabbit's Large Action Model (LAM). Cool!

https://www.rabbit.tech/rabbit-os

By @TacticalCoder - 4 months
One suggestion, use the following prompt at a LLM:

    The combination of the words "computer use" is highly confusing. It's also "Yoda speak". For example it's hard for humans to parse the sentences *"Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku"*, *"Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku "* (it literally relies on the comma to make any sense) and *"Computer use for automated interaction"* (in the youtube vid's title: this one is just broken english). Please suggest terms that are not confusing for a new ability allowing an AI to control a computer as if it was a human.
By @amai - 4 months
This "computer use" feature is obviously perfect for automating GUI tests. Will it work on screenshots of mobile devices like smartphones/tables, also?
By @Maynor - 4 months
Join PeachLive and input my invitation code 6B94HL to get 20 free coins! Enjoy live video chat at {invitationUrl}
By @kingkongjaffa - 4 months
Interestingly new claude only knows content up to:

> I'm limited to what I know as of April 2024, which includes the initial Claude 3 family launch but not subsequent updates.

By @mclau156 - 4 months
Did they just invent a new world of warcraft or runescape bot?
By @punnerud - 4 months
Cursor AI already have the option to switch to using claude-3-5-sonnet-20241022 in the chat box.

I was about to try to add a custom API. I’m impressed by the speed of that team.

By @Alifatisk - 4 months
> Claude 3.5 Haiku matches the performance of Claude 3 Opus

Oh wow!

By @lostmsu - 4 months
Can anyone share a .http or curl or anything similar based session with computer tool use? Docker containers make me cry.
By @taytus - 4 months
Computer use won't allow you to log in to social media accounts, even if it is your account and credentials. Bummer.
By @robertkoss - 4 months
Does anyone know how I could check whether my Claude Sonnet version that I am using in the UI has been updated already?
By @crazystar - 4 months
Looks like it just takes a screenshot and can't scroll so it might miss things.

Claude 3.5 Haiku will be released later this month.

By @myprotegeai - 4 months
We are approaching FSD for the computer, with all of the lofty promises, and all of the horrible accidents.
By @iamsanteri - 4 months
I love how they don't seem to be calling it "AgenticAI" or something like that.
By @throwvc3 - 4 months
What I'd like to know is whether prompt caching is available to Claude on AWS Bedrock now.
By @vivekkairi - 4 months
aider benchmarks for claude 3.5 new are impressive. From 77.4% to 83.5% beating o1-preview.
By @netcraft - 4 months
since they didnt rev the version, does this mean if we were using 3.5 today its just automatically using the new version? That doesnt seem great from a change management perspective

though I am looking forward to using the new one in cursor.ai

By @2-3-7-43-1807 - 4 months
wow, i almost got worried but the cute music and the funny little monster on the desk convinced me that this all just fun and dandy and all will be good. the future is coming and we'll all be much more happy :)
By @bilsbie - 4 months
Does this make cursor obsolete?

You can just use any IDE you want and it will work with it.

By @veggieWHITES - 4 months
While I was initially impressed with it's context window, I got so sick of fighting with Claude about what it was allowed to answer I quit my subscription after 3 months.

Their whole policing AI models stance is commendable but ultimately renders their tools useless.

It actually started arguing with me about whether it was allowed to help implement a github repository's code as it might be copywritten... it was MIT licensed open source from Google :/

By @brcmthrowaway - 4 months
This is bad news for SWEs!
By @esseti - 4 months
I checked the docs but did not find it out. Cloude has API as the GPT Assistant? with also the ability to give a set of documents to work with?

It seems that you can only send single message, thus not relying on the ability to "learn" from predefined documents.

By @tylerchilds - 4 months
computer use is really going to highlight how fragmented the desktop ecosystem is, but also this definitely paints more context on how microsoft wants to use their screenshot ai
By @iknownthing - 4 months
Can Claude create and run a CI/CD pipeline now from a prompt?
By @jonesn11 - 4 months
How does one get access to it without using the API??
By @ta93754829 - 4 months
eventually, we'll be able to eliminate the intermediate "computer", and just let the ai render everything we need to interact with
By @efields - 4 months
Captchas are toast.
By @netcraft - 4 months
im unclear, is haiku supposed to be similar to 4o-mini in usecase/cost/performance? If not, do they have an analog?
By @ta8645 - 4 months
Still can't use their services. They still require a phone number for some reason. What about those of us who don't have one?
By @mathiasrw - 4 months
Just to confirm: did they just release a model with the exact same name as the previous one?
By @jerrygoyal - 4 months
does anyone know what are some use cases for "computer use"?
By @nbzso - 4 months
Just a question: For this thingy to work, I must give the provider access to my computer? Good luck. :)

Just another reason to use ONLY local LLM's.

By @geniium - 4 months
This is amazing
By @g9yuayon - 4 months
Is it just me who feels that Anthropic has been innovating faster than ChatGPT in the past year?
By @Maynor - 4 months
6B94HL
By @postalcoder - 4 months
and i was just planning to go to sleep…
By @dtquad - 4 months
Now I am really curious how to programmatically create a sandboxed compute environment to do a self-hosted "Computer use" and see how well other models, including self-hosted Ollama models, can do this.
By @mannycalavera42 - 4 months
new VBA version just landed
By @anotherpaulg - 4 months
The new Sonnet tops aider's code editing leaderboard at 84.2%. Using aider's "architect" mode it sets the SOTA at 85.7% (with DeepSeek as the "editor" model).

  84% Claude 3.5 Sonnet 10/22
  80% o1-preview
  77% Claude 3.5 Sonnet 06/20
  72% DeepSeek V2.5
  72% GPT-4o 08/06
  71% o1-mini
  68% Claude 3 Opus
It also sets SOTA on aider's more demanding refactoring benchmark with a score of 92.1%!

  92% Sonnet 10/22
  75% o1-preview
  72% Opus
  64% Sonnet 06/20
  49% GPT-4o 08/06
  45% o1-mini
https://aider.chat/docs/leaderboards/
By @theflyestpilot - 4 months
cries in UiPath
By @HanClinto - 4 months
Why not rev the numbers? "3.5" vs. "3.5 New" feels weird -- is there a particular reason why Anthropic doesn't want to call this 3.6 (or even 3.5.1)?
By @jampekka - 4 months
It's quite sad that application interoperability requires parsing bitmaps instead of exchanging structured information. Feels like a devastating failure in how we do computing.
By @mergisi - 4 months
My First Experience with Claude Computer Use - It's Mind-Blowing!

Just tested Claude's new Computer Use feature and had to share this simple but powerful test:

My Basic Prompt: "Please: 1. Search Amazon for 3 wireless earbuds: Find price Rating Brand name

2. Make a simple Excel file 'earbuds.xlsx': Put the information in a basic table Add colors to the headers Sort by price

3. Show me the results"

What blew my mind: - Claude actually looked at my screen - Moved the mouse by itself - Clicked buttons like a human - Created reports automatically

It's like having a virtual assistant that can really use your computer! No coding needed - just simple English instructions.

For those interested: https://mergisi.medium.com/8f56f683e307

By @baq - 4 months
Scary stuff.

'Hey Claude 3.5 New, pretend I'm a CEO of a big company and need to lay off 20% people, make me a spreadsheet and send it to HR. Oh make sure to not fire the HR department'

c.f. IBM 1979.

By @freediver - 4 months
Both new Sonnet and gpt-4o still fail at a simple:

"How many w's are in strawberry?"

gpt-4o: There are 2 "w's" in "strawberry."

Claude 3.5 Sonnet (new): Let me count the w's in "strawberry": 0 w's.

(same question with 'r' succeeds)

What is artificial about current gen of "artificial intelligence" is the way training (predict next token) and benchmarking (overfitting) is done. Perhaps a fresh approach is needed to achieve a true next step.