October 25th, 2024

Cerebras Inference now 3x faster: Llama3.1-70B breaks 2,100 tokens/s

Cerebras has enhanced its Inference platform, achieving a threefold speed increase for the Llama 3.1-70B model, now processing 2,100 tokens per second, benefiting various industries with real-time AI applications.

Read original article

Cerebras Inference now 3x faster: Llama3.1-70B breaks 2,100 tokens/s

Cerebras has announced a significant update to its Inference platform, achieving a performance boost of three times for the Llama 3.1-70B model, now processing at 2,100 tokens per second. This speed is 16 times faster than the leading GPU solutions and 8 times faster than smaller models on GPUs. The update is seen as a major advancement in AI application development, enabling faster and more responsive applications across various sectors, including pharmaceuticals and voice AI. The improvements stem from optimized software and hardware, including enhanced kernels and asynchronous operations, which maximize the capabilities of the Wafer Scale Engine. The platform's speed allows for more complex reasoning and faster response times, making it suitable for real-time applications. Companies like GlaxoSmithKline and LiveKit are already leveraging this technology to enhance their AI capabilities. The update not only maintains model precision but also promises to expand model selection and API features in the future.

- Cerebras Inference now processes Llama 3.1-70B at 2,100 tokens per second, a threefold increase in speed.

- The performance is significantly faster than leading GPU solutions, enhancing real-time AI applications.

- Optimizations include improved software and hardware integration, maximizing the Wafer Scale Engine's potential.

- The update supports complex reasoning and faster response times, beneficial for various industries.

- Future expansions will include more model selections and enhanced API features.

Llama 3.1 Official Launch

Llama introduces Llama 3.1, an open-source AI model available in 8B, 70B, and 405B versions. The 405B model is highlighted for its versatility in supporting various use cases, including multi-lingual agents and analyzing large documents. Users can leverage coding assistants, real-time or batch inference, and fine-tuning capabilities. Llama emphasizes open-source AI and offers subscribers updates via a newsletter.

Cerebras Inference: AI at Instant Speed

Cerebras launched its AI inference solution, claiming to process 1,800 tokens per second, outperforming NVIDIA by 20 times, with competitive pricing and plans for future model support.

Cerebras reaches 1800 tokens/s for 8B Llama3.1

Cerebras Systems is deploying Meta's LLaMA 3.1 model on its wafer-scale chip, achieving faster processing speeds and lower costs, while aiming to simplify developer integration through an API.

Cerebras Launches the Fastest AI Inference

Cerebras Systems launched Cerebras Inference, the fastest AI inference solution, outperforming NVIDIA GPUs by 20 times, processing up to 1,800 tokens per second, with significant cost advantages and multiple service tiers.

Llama 3.2 released: Multimodal, 1B to 90B sizes

Llama 3.2 has been released as an open-source AI model in various sizes for text and image processing, enhancing application development and gaining significant traction with over 350 million downloads.

15 comments

By @simonw - 7 months

It turns out someone has written a plugin for my LLM CLI tool already: https://github.com/irthomasthomas/llm-cerebras

You need an API key - I got one from https://cloud.cerebras.ai/ but I'm not sure if there's a waiting list at the moment - then you can do this:

    pipx install llm # or brew install llm or uv tool install llm
    llm install llm-cerebras
    llm keys set cerebras
    # paste key here

Then you can run lightning fast prompts like this:

    llm -m cerebras-llama3.1-70b 'an epic tail of a walrus pirate'

Here's a video of that running, it's very speedy: https://static.simonwillison.net/static/2024/cerebras-is-fas...

By @obviyus - 7 months

Wonder if they'll eventually release Whisper support. Groq has been great for transcribing 1hr+ calls at a significnatly lower price compared to OpenAI ($0.36/hr vs. $0.04/hr).

By @maz1b - 7 months

Cerebras really has impressed me with their technicality and their approach in the modern LLM era. I hope they do well, as I've heard they are en-route to IPO. It will be interesting to see if they can make a dent vs NVIDIA and other players in this space.

By @GavCo - 7 months

When Meta releases the quantized 70B it will give another > 2X speedup with similar accuracy: https://ai.meta.com/blog/meta-llama-quantized-lightweight-mo...

By @asabla - 7 months

Damn, that's some impressive speeds.

At that rate it doesn't matter if the first try resulted in an unwanted answer, you'll be able to run once or twice more in a fast succession.

I hope their hardware stays relevant as this field continues to evolve

By @d4rkp4ttern - 7 months

For those looking to easily build on top of this or other OpenAI-compatible LLM APIs -- you can have a look at Langroid[1] (I am the lead dev): you can easily switch to cerebras (or groq, or other LLMs/Providers). E.g. after installing langroid in your virtual env, and setting up CEREBRAS_API_KEY in your env or .env file, you can run a simple chat example[2] like this:

    python3 examples/basic/chat.py -m cerebras/llama3.1-70b

Specifying the model and setting up basic chat is simple (and there are numerous other examples in the examples folder in the repo):

    import langroid.language_models as lm
    import langroid as lr
    llm_config = lm.OpenAIGPTConfig(chat_model= "cerebras/llama3.1-70b")
    agent = lr.ChatAgent(
        lr.ChatAgentConfig(llm=llm_config, system_message="Be helpful but concise"))
    )
    task = lr.Task(agent)
    task.run()

[1] https://github.com/langroid/langroid [2] https://github.com/langroid/langroid/blob/main/examples/basi... [3] Guide to using Langroid with non-OpenAI LLM APIs https://langroid.github.io/langroid/tutorials/local-llm-setu...

By @fancyfredbot - 7 months

Wow, software is hard! Imagine an entire company working to build an insanely huge and expensive wafer scale chip and your super smart and highly motivated machine learning engineers get 1/3 of peak performance on their first attempt. When people say NVIDIA has no moat I'm going to remember this - partly because it does show that they do, and partly because it shows that with time the moat can probably be crossed...

By @a2128 - 7 months

I wonder at what point does increasing LLM throughput only start to serve negative uses of AI. This is already 2 orders of magnitude faster than humans can read. Are there any significant legitimate uses beyond just spamming AI-generated SEO articles and fake Amazon books more quickly and cheaply?

By @odo1242 - 7 months

What made it so much faster based on just a software update?

By @majke - 7 months

I wonder if there is a token/watt metric. Afaiu cerebras uses plenty of power/cooling.

By @neals - 7 months

So what is inference?

By @anonzzzies - 7 months

Demo, API?

By @andrewstuart - 7 months

Could someone please bring Microsoft's Bitnet into the discussion and explain how its performance relates to this announcement, if at all?

https://github.com/microsoft/BitNet

"bitnet.cpp achieves speedups of 1.37x to 5.07x on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by 55.4% to 70.0%, further boosting overall efficiency. On x86 CPUs, speedups range from 2.37x to 6.17x with energy reductions between 71.9% to 82.2%. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. "

Llama 3.1 Official Launch

Cerebras Inference: AI at Instant Speed

Cerebras launched its AI inference solution, claiming to process 1,800 tokens per second, outperforming NVIDIA by 20 times, with competitive pricing and plans for future model support.

Cerebras reaches 1800 tokens/s for 8B Llama3.1

Cerebras Systems is deploying Meta's LLaMA 3.1 model on its wafer-scale chip, achieving faster processing speeds and lower costs, while aiming to simplify developer integration through an API.

Cerebras Inference now 3x faster: Llama3.1-70B breaks 2,100 tokens/s

Related

Llama 3.1 Official Launch

Cerebras Inference: AI at Instant Speed

Cerebras reaches 1800 tokens/s for 8B Llama3.1

Cerebras Launches the Fastest AI Inference

Llama 3.2 released: Multimodal, 1B to 90B sizes

Related

Llama 3.1 Official Launch

Cerebras Inference: AI at Instant Speed

Cerebras reaches 1800 tokens/s for 8B Llama3.1

Cerebras Launches the Fastest AI Inference

Llama 3.2 released: Multimodal, 1B to 90B sizes