June 27th, 2024

Claude 3.5 Sonnet

Anthropic introduces Claude Sonnet 3.5, a fast and cost-effective large language model with new features like Artifacts. Human tests show significant improvements. Privacy and safety evaluations are conducted. Claude 3.5 Sonnet's impact on engineering and coding capabilities is explored, along with recursive self-improvement in AI development.

Read original articleLink Icon
Claude 3.5 Sonnet

The article discusses the introduction of Claude Sonnet 3.5 as a leading large language model (LLM) by Anthropic. The model is highlighted for its speed, cost-effectiveness, and new features like Artifacts for enhanced user experience. Updates on larger and smaller models, Claude Opus 3.5 and Claude Haiku 3.5, are expected later this year. The article also mentions human evaluation tests showing significant improvements in various tasks compared to prior models. Privacy measures and safety evaluations by the UK Artificial Intelligence Safety Institute are emphasized. The potential impact of Claude 3.5 Sonnet on engineering work and coding capabilities is discussed, with insights into its performance in coding tasks and the future implications for programming. The article touches on the concept of recursive self-improvement in AI development and its implications for accelerating engineering work. Overall, the focus is on the advancements in LLM technology and the potential benefits and considerations associated with these developments.

Related

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

The article discusses the release of open-source Llama3 70B model, highlighting its performance compared to GPT-4 and Claude3 Opus. It emphasizes training enhancements, data quality, and the competition between open and closed-source models.

Testing Generative AI for Circuit Board Design

Testing Generative AI for Circuit Board Design

A study tested Large Language Models (LLMs) like GPT-4o, Claude 3 Opus, and Gemini 1.5 for circuit board design tasks. Results showed varied performance, with Claude 3 Opus excelling in specific questions, while others struggled with complexity. Gemini 1.5 showed promise in parsing datasheet information accurately. The study emphasized the potential and limitations of using AI models in circuit board design.

Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]

Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]

The video discusses limitations of large language models in AI, emphasizing genuine understanding and problem-solving skills. A prize incentivizes AI systems showcasing these abilities. Adaptability and knowledge acquisition are highlighted as crucial for true intelligence.

Anthropic: Collaborate with Claude on Projects

Anthropic: Collaborate with Claude on Projects

Claude.ai introduces Projects feature for Pro and Team users to organize chats, enhance collaboration, and create artifacts like code snippets. North Highland reports productivity gains. Future updates prioritize user-friendly enhancements.

Anthropic: Expanding Access to Claude for Government

Anthropic: Expanding Access to Claude for Government

Anthropic expands AI models Claude 3 Haiku and Sonnet for government users via AWS Marketplace, emphasizing responsible AI deployment and tailored service agreements to enhance citizen services and policymaking.

Link Icon 32 comments
By @tkgally - 7 months
This article was written before Anthropic added the Projects feature for Pro and Team users [1].

I spent some time yesterday experimenting with Projects, and, like Artifacts, it looks really useful. I like the idea of being able to have multiple projects going simultaneously, each with its own reference materials. I don’t need to use it in a team, but I can see how that could be useful, too.

The one problem I see is that the total context window for each project might start to seem too small pretty quickly. I assume, though, that Anthropic’s context windows will be getting larger as time goes on.

I wonder what other features Anthropic has in the works for Claude. My personal wish is for a voice interface, something like what OpenAI announced in May but has now put off until later this year.

[1] https://www.anthropic.com/news/projects

By @cortesi - 7 months
Claude 3.5 Sonnet's coding abilities are incredibly impressive. I think it lets an expert programmer move more than twice as fast. There are limits - to produce high quality code, not copy-and-paste pablum, you have to be able to give detailed step-by-step directions and critically evaluate the results. This means you can't produce code better than you would have written by yourself, you can only do it much faster.

As an experiment, I produced a set of bindings to Anthropic's API pair-programming with Claude. The project is of pretty good quality, and includes advanced features like streaming and type-safe definitions of tools. More than 95% of the code and docs was written by Claude, under close direction from me. The project is here:

https://github.com/cortesi/misanthropy

And I've shared part of the conversation that produced it in a video here:

https://twitter.com/cortesi/status/1806135130446307340

By @alastairr - 7 months
my 2p worth - my work involves a lot of summarisation, recommendation from a user preference statement. I've been able to do this with 4o / opus, but the consistency wasn't there, which required complex prompting chains to stabilise.

What I'm seeing with Sonnet 3.5 is a night-and-day step up in consistency. The responses don't seem to be that different in capability of opus / 4o when they respond well, it just does it with rock-solid consistency. That sounds a bit dull, but it's a huge step forward for me and I suspect for others.

By @m0zzie - 7 months
Can anyone comment on its coding ability?

Considering cancelling my subscription with OpenAI as I was previously using GPT-4 quite heavily as a multiplier for myself, guiding it and editing outputs as required, but GPT-4o feels significantly worse for this use case. It is certainly better in many other areas, but its coding ability is not great.

I tried to revert back to standard GPT-4 but it is now so slow to respond (higher load?) that it breaks my mental flow, so I'm exploring other options.

By @liquidise - 7 months
As someone building an AI company right now, my quick Pro/Con for 4o vs Claude 3.5:

Claude: subjectively sounds more human to me, and really nails data questions that 4o is lackluster at

4o: far better assistant logic reasoning. I can trivially break Claude's assistant (system prompt) instructions within the user prompt, where 4o succeeds in all of these tests.

Pricing and output speed, for our purposes, are functionally identical. Exciting to have a competitor in the space already who stands to keep openai honest.

By @thethirdone - 7 months
> You can say ‘the recent jumps are relatively small’ or you can notice that (1) there is an upper bound at 100 rapidly approaching for this set of benchmarks, and (2) the releases are coming quickly one after another and the slope of the line is accelerating despite being close to the maximum.

The graph does not look like it is accelerating. I actually struggle to imagine what about it convinced the author the progress is accelerating.

I would be very interested in a more detailed graph that shows individual benchmarks because it should be possible to see some benchmarks effectively be beaten and get a good idea of where all of the other benchmarks are on that trend. The 100 % upper bound is likely very hard to approach, but I don't know if the limit is like 99%, 95% or 90% for most benchmarks.

By @dagaci - 7 months
Apparently my account was banned on Anthropic Sonnet after a "Automatic review". I'm 100% sure i did not make any "unsafe" queries, I've litterally only briefly tested and that was weeks ago.

+1 OpenAI Subscription -1 Anthropic Sonnet->sudden-death-automatic-review-system

By @aappleby - 7 months
I don't need an AI to write code for me, but it is _astoundingly_ helpful to have it summarize various design options and new technology stacks without me having to scavenge Google for the obscure corner-cases I care about.

I have an idea for a project that involves streaming 3 giabits of data per second from a USB 3.0 device out over a 10 gig Ethernet connection, and it was able to compare/contrast various levels of support for high-bandwidth USB 3 and Ethernet in multiple frameworks and languages.

And the whole conversation, with code examples, cost me 3 _cents_ of Anthropic credits.

My new fear is when people start asking AIs "Hey AI, here is my codebase, my org chart, and commit histories for all my employees - how can I reduce the number of humans I need to employ to get this project done?"

By @lowyek - 7 months
Must be fun working on cutting edge competitive stuff for all these 3 major teams. It's exciting to live in this times and see this all unfold in our eyes.
By @andrewstuart - 7 months
AI programming would be really useful if it moved towards me being able to make a fixed set of statements about the software, those statements are preserved permanently in the source code somehow, and the AI ensures that those statements remain true.

Its frustrating to work with the AI to implement something only to realise within a few interactions that it has forgotten or lost track of something I deemed to be a key requirement.

Surely the future of software has to start to include declarative statement prompts as part of the source code.

By @alach11 - 7 months
Claude 3.5 Sonnet took a solid lead on our internal benchmarks over gpt-4-turbo for extraction tasks against large documents.

It may not be great for every workflow, but it certainly hits a sweet spot for intelligence x cost on most of my workflows.

By @redkrc - 6 months
اريدك ان تجعل هذه الصفحة اكبر واكثر صفحة احترافية في التاريخ اكثر من موقع ابل واقوي من اقوي البراندات العالمية اريدك ان تضيف مزايا احترافية جداجدا ليصبح الموقع والتطبيق رقم 1 في مجال تسجيل الاوزان والجيم واكتب انت جميع الاكواد لانه لا خبرة لي في البرمجة اطلاقا ولا استطيع ممكن ان ياخذ ذلك مني سنينا اتمني ان تشارك في هذا العمل الانساني الخيري
By @silisili - 7 months
Well, I went to try it, but it requires a phone number for some bizarre reason. Fine, gave it my primary number, a google voice number I've had for a decade, and it won't accept it. That's the end of my Claude journey, forever.

If you want me to try your service, try using some flow with less friction than sandpaper, folks.

By @JCM9 - 7 months
This is highlighting what has happened with all forms of ML. Give a baseline set of folks the same dataset and they will end up with a model that performs about the same. Companies are one-upping each other but it’s very back and forth and just a case of release date. These models will become a complete commodity. The thing that could be proprietary is the data used to train them, which could lead to a sustained better model performance. The barrier to entry here is super high given training costs, but the ML skills are still a commodity.
By @tonyoconnell - 7 months
I have been using Claude Sonnet with Artifacts along with Vercel V0 to build Sveltekit pages and components really well. I create a UI in V0 and then simply copy the JSX into Claude and tell it to convert to Sveltekit. It creates the +page.svelte +page.server.ts and all the components almost perfectly.
By @SubiculumCode - 7 months
I'm fairly impressed with Sonnet's one shot scripting performance for my use cases. However, I was using it to help me diagnoses a gnome key ring and ssh issue I was having, and it suggested that I 'rm -rf' my keyring files to test it's solution out. A little drastic..maybe mv my password file first? Anyway, it sometimes seems even more cocky than last gen, and less careful by default
By @renewiltord - 7 months
Claude Sonnet is freaking amazing. I used to have a safety test[0] that Claude failed. But it was a bogus safety test and fortunately someone here told me so and I immediately subscribed to it. It's amazing. The other day I ported a whole node.js script to Python with it. It was not flawless but it was pretty damned good. Such a mechanical process and I just had to review. Loved it.

0: https://news.ycombinator.com/item?id=39607069

By @nojvek - 7 months
I cancelled my OpenAI membership and using more and more of Claude. Sonnet is pretty fast and cheaper than 4-o.

I'm legit elated that a smaller player is able to compete with large behemoths like OpenAI and Google. (I know they have Amazon backing them, but their team is much smaller. OpenAI is ~1000 employees now).

I'm building on top of their api. It's neat. I wish them the best.

By @Filligree - 7 months
I’ve been wanting to test it, but the API console wants me to fill in a VAT field. Not sure how to get one of those as an individual.
By @boyka - 7 months
These models are clearly great with language, be it natural language or code. However, I wonder where the expectation comes from that a static stochastic parrot should be able to compute arbitrary first order logic (in a series of one-shot next word predictions). Could any expert elaborate on how this would be solved by a transformer model?
By @GaggiX - 7 months
The incredible ability of Claude 3.5 Sonnet to create coherent SVG makes me wonder if the LLM was not just pretrained on text. Vision capabilities are usually added later using a vision encoder that does not affect the LLM's knowledge of the visual world, but in this case the LLM clearly has quite a strong understanding of the visual world.
By @thmixc - 7 months
Claude 3.5 Sonnet can solve the farmer and sheep problem with two small changes to the prompt: 1. change the word "person" to "human". 2. change the word "trips" to "trip or trips". (Claude is probably assuming that the answer has to be in multiple trips because of the word "trips")
By @sdwr - 7 months
Are there any tools out there that expose GPT or Claude to a codebase, and let it write PRs (semi) autonomously?
By @andrewchambers - 7 months
I wonder if openai they have a response ready or if they just are tackling other business problems like ios integration now and the seemingly postponed low latency chat launch. Either way I am looking forward to claude 3.5 opus.

It did seem slightly odd to me that openai made their supposedly best model free.

By @mrcwinn - 7 months
I’ve thoroughly enjoyed the product overall much more than ChatGPT. I do wish it had voice input that rivaled what OpenAI previewed. Excited for 3.5 Opus. For now I’ve canceled OpenAI subscription and removed the app in favor of Claude.
By @Davidzheng - 7 months
Still not as good at math as gpt4o from benchmarks and also my experience
By @erdemo - 7 months
As developer Claude code generator 2x better than gpt4o, of course it subjunctive but Claude much consistent for me.
By @willsmith72 - 7 months
am i right in that it has no online capabilities? that's a pretty big issue for me
By @nyellin - 7 months
Interesting. We're finding that it's more verbose than GPT-4o (and often ignores instructions to give output in a specific format), but does better when faced with ambiguity and potentially confusing situations.

Here is an example using http://github.com/robusta-dev/holmesgpt/ where ask it to troubleshoot a payment processing workload in a Kubernetes cluster vs the GPT-4o output underneath:

  holmes ask --model="bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0" "what is wrong with the payment processor?"
  AI: To investigate what's wrong with the payment processor, we'll need to gather more information about it. First, let's try to find the relevant Kubernetes resources related to the payment processor.

  Running `kubectl get -A --show-labels pod | grep payment`

  Thank you for your question about the payment processor. initial search, I've found two pods related to payment processing. Let's investigate the one that seems to be having issues:

  1. In the default namespace, there's a pod named "payment-processing-worker-5bb5bd75cd-x4g99" which is in a CrashLoopBackOff state. This indicates that the pod is repeatedly crashing and restarting.

  2. In the sock-shop namespace, there's a pod named "payment-ff86cd6f8-rs47b" which appears to be running normally.

  Let's focus on the problematic pod in the default namespace. To get more information about what's wrong, we'll need to look at its logs and description.

  Running `kubectl describe pod payment-processing-worker-5bb5bd75cd-x4g99 -n default`

  AI: Based on the information gathered, it appears that there is an issue with the payment processor in the Kubernetes cluster. Let me explain what's wrong:

  1 The payment processing worker pod is in a CrashLoopBackOff state:
    • Pod name: payment-processing-worker-5bb5bd75cd-x4g99
    • Namespace: default
    • Status: CrashLoopBackOff
  2 The pod is continuously restarting and failing:
    • Restart Count: 242
    • The container is in a "Waiting" state with the reason "CrashLoopBackOff"
  3 The root cause of the issue appears to be related to an environment variable:
    • The container is running a bash script that checks for the presence of the DEPLOY_ENV environment variable.
    • If DEPLOY_ENV is not set, the container exits immediately after printing an error message.
    • The command being run is:

       if [[ -z "${DEPLOY_ENV}" ]]; then echo Environment variable DEPLOY_ENV is undefined ; else while true; do echo hello; sleep 10;done; fi

  4 The environment section in the pod description shows:

    Environment: <none>

   This indicates that no environment variables are set for the container.
  5 The container is exiting with a status of "Completed" and an exit code of 0, which suggests that it's running the script, finding that DEPLOY_E

  To fix this issue:

  1 The DEPLOY_ENV environment variable needs to be set for the payment-processing-worker deployment.
  2 You should update the deployment configuration to include this environment variable.
  3 After setting the environment variable, the pod should stop crashing and enter a running state where it prints "hello" every 10 seconds.

  To resolve this, you'll need to update the deployment configuration for the payment-processing-worker. You can do this by editing the deployment and adding the DEPLOY_ENV environment variable with an appropriate value.

  Would you like me to show you how to update the deployment with the required environment variable?

Note that it explicitly picks up on the ambiguity in the question and notes that there are 2 payment related workloads and one is healthy.

GPT-4o:

  $ holmes ask "what is wrong with the payment processor?"

  Running `kubectl get -A --show-labels pod | grep payment-processor`                     
  Running `kubectl get -A --show-labels deployment | grep payment-processor`                    
  Running `kubectl get -A --show-labels service | grep payment-processor`              
  Running `kubectl get -A --show-labels pod | grep payment`
  Running `kubectl describe pod payment-processing-worker-5bb5bd75cd-x4g99 -n default`
  Running `kubectl logs payment-processing-worker-5bb5bd75cd-x4g99 -n default --previous`
  
  AI: payment-processing-worker-5bb5bd75cd-x4g99 in namespace default crashes due to missing DEPLOY_ENV environment variable.

  Fix: Define DEPLOY_ENV in the deployment spec.

  Example patch:

  env:
   - name: DEPLOY_ENV
     value: "production"
By @xixixao - 7 months
I tried Sonnet vs GPT 4 just now with:

> Given a body with momentum B and forques F, what is the differential of applying the forques to the momentum in PGA?

Claude gave a wrong answer, ChatGPT gave a correct one. I’m sticking with ChatGPT.