August 27th, 2024

Cerebras Inference: AI at Instant Speed

Cerebras launched its AI inference solution, claiming to process 1,800 tokens per second, outperforming NVIDIA by 20 times, with competitive pricing and plans for future model support.

Read original articleLink Icon
Cerebras Inference: AI at Instant Speed

Cerebras has launched its new AI inference solution, claiming to be the fastest in the world. The Cerebras inference system can process 1,800 tokens per second for the Llama3.1 8B model and 450 tokens per second for the Llama3.1 70B model, significantly outperforming NVIDIA GPU-based systems by a factor of 20. The pricing is competitive, at 10 cents per million tokens for the 8B model and 60 cents for the 70B model. This performance is made possible by the third-generation Wafer Scale Engine, which integrates 44GB of SRAM on a single chip, allowing the entire model to be stored on-chip and eliminating external memory bottlenecks. The system is designed to handle models ranging from billions to trillions of parameters, with plans to support larger models in the future. The inference API is available for developers, offering generous rate limits and the ability to integrate easily with existing applications. The use of native 16-bit weights ensures high accuracy, with evaluations showing that 16-bit models outperform 8-bit counterparts in various tasks. The introduction of Cerebras inference is expected to enhance real-time AI capabilities and enable more complex workflows, setting a new standard for AI model deployment.

- Cerebras inference claims to be the fastest AI inference solution, processing up to 1,800 tokens per second.

- The system is significantly cheaper than competitors, with pricing at 10 cents per million tokens for the 8B model.

- It utilizes a unique wafer-scale design to eliminate memory bandwidth bottlenecks.

- The API is available for developers, offering high rate limits and easy integration.

- Future support for larger models is planned, enhancing the capabilities of AI applications.

Related

Optimizing AI Inference at Character.ai

Optimizing AI Inference at Character.ai

Character.AI optimizes AI inference for LLMs, handling 20,000+ queries/sec globally. Innovations like Multi-Query Attention and int8 quantization reduced serving costs by 33x since late 2022, aiming to enhance AI capabilities worldwide.

Tenstorrent Unveils High-End Wormhole AI Processors, Featuring RISC-V

Tenstorrent Unveils High-End Wormhole AI Processors, Featuring RISC-V

Tenstorrent launches Wormhole AI chips on RISC-V, emphasizing cost-effectiveness and scalability. Wormhole n150 offers 262 TFLOPS, n300 doubles power with 24 GB GDDR6. Priced from $999, undercutting NVIDIA. New workstations from $1,500.

Llama 3.1 Official Launch

Llama 3.1 Official Launch

Llama introduces Llama 3.1, an open-source AI model available in 8B, 70B, and 405B versions. The 405B model is highlighted for its versatility in supporting various use cases, including multi-lingual agents and analyzing large documents. Users can leverage coding assistants, real-time or batch inference, and fine-tuning capabilities. Llama emphasizes open-source AI and offers subscribers updates via a newsletter.

Groq Supercharges Fast AI Inference for Meta Llama 3.1

Groq Supercharges Fast AI Inference for Meta Llama 3.1

Groq launches Llama 3.1 models with LPU™ AI technology on GroqCloud Dev Console and GroqChat. Mark Zuckerberg praises ultra-low-latency inference for cloud deployments, emphasizing open-source collaboration and AI innovation.

Benchmarks show even an old Nvidia RTX 3090 is enough to serve LLMs to thousands

Benchmarks show even an old Nvidia RTX 3090 is enough to serve LLMs to thousands

An analysis by Backprop shows the Nvidia RTX 3090 can effectively serve large language models to thousands of users, achieving 12.88 tokens per second for 100 concurrent requests.

Link Icon 15 comments
By @mikewarot - 3 months
>Cerebras is the only platform to enable instant responses at a blistering 450 tokens/sec. All this is achieved using native 16-bit weights for the model, ensuring the highest accuracy responses.

As near as I can tell, from the model card[1], the majority of the math for this model is 4096x4096 multiply-accumulates. So, there should be 70b/16m about 4000 of these in the Llama3-70B model.

A 16x16 multiplier is about 9000 transistors, according to a quick google. 4096^2 should thus be about 150 billion transistors, if you include the bias values. There are plenty of transistors on this chip to have many of them operating in parallel.

According to [2], a switching transition in the 7nM process node, is about 0.025 femtoJoule (10^-15 watt seconds) per transistor. At a clock rate of 1 Ghz, that's about 25 nanowatt/transistor. Scaling that at 50% transitions(a 50/50 chance any given gate in the MAC will flip), gets you about 2kW for each 4096^2 MAC running at 1 Ghz.

There are enough transistors, and enough RAM on the wafer to fit the entire model. Even if they have a single 4096^2 MAC array, a clock rate of 1 ghz should result in a total time of 4 uSec/token, or 250,000 tokens/second.

[1] https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md

[2] https://mpedram.com/Papers/7nm-finfet-libraries-tcasII.pdf

By @russ - 3 months
It’s insanely fast.

Here’s an AI voice assistant I built that uses it:

https://cerebras.vercel.app

By @ansk - 3 months
Is batched inference for LLMs memory bound? My understanding is that sufficiently large batched matmuls will be compute bound and flash attention has mostly removed the memory bottleneck in the attention computation. If so, the value proposition here -- as well as with other memorymaxxing startups like Groq -- is primarily on the latency side of things. Though my personal impression is that latency isn't really a huge issue right now, especially for text. Even OpenAI's voice models are (purportedly) able to be served with a latency which is a low multiple of network latency, and I expect there is room for improvement here as this is essentially the first generation of real-time voice LLMs.
By @tedivm - 3 months
I want to know what the current power requirements are, as well as the cost for the machine. The last time I looked at one of these it was an absolute beast (although very impressive).
By @cedws - 3 months
Q: so apparently allowing LLMs to “think” by asking it to walk through and generate preamble tokens to an answer improves quality. With this kind of speedup would it be practical/effective to achieve better output quality by baking in a “thinking” step to every prompt? Say, a few thousand tokens before the actual reply.
By @alecco - 3 months
> Thus to generate a 100 words a second requires moving the model 100 times per second – requiring vast amounts of memory bandwidth.

It's actually worse for the majority of GPU implementations for large models. The matrices don't fit shared memory so the model is loaded many, many times to shared memory (as tiles). Also, unless you are using Hopper distributed shared memory, CTAs can't even share across them.

It would be nice to see a Cerebras solution for pre-training and fine-tuning.

By @moconnor - 3 months
70b runs on 4x CS-3 estimated at $2-3m each, let’s say total system cost $10m, drawing ~100kW power. They don’t mention batch size, so let’s start with batch size 1 and see where we get. At 100% utilisation for 3 years that’d be 42 billion tokens for a cost of $10m capital plus ~$0.5m power and cooling let’s say, or $250 per million tokens. They’re claiming they can sell their API access at $0.60/million. To break even on this they’d need a batch size of 420. I don’t know how deep their pipeline is but Llama 3.1 70b has 80 layers with 6 meaningful matmuls per layer so it’s not a crazy multiple of that.

A single A100 processes at 13 t/s/u for batch 32. That costs $10k to process 39 billion tokens over 3 years = $0.25 tok/s/u. If you have batch size 420 you can do it even cheaper.

TL;DR: Cerebras are certainly advertising at a loss-leading price and will only have a viable product if they can get extraordinarily high utilisation of their system at this price. I don’t think they can, so they’re basically screwed selling tokens. Maybe this is to attract attention in the hope of selling hardware to someone willing to pay a premium for very low latency, but I suspect it’s just a means of getting one more round of funding in the hope of reducing costs in the next version.

By @cchance - 3 months
Ok that speeds fucking rediculous are you kidding me?!?!?! I just tried the Chat trial wtf.
By @jedberg - 3 months
This is where we always assumed the industry was going. Expensive GPUs are great for training, but inference is getting so optimized that it will run on smaller and/or cheaper processors (per token).
By @smusamashah - 3 months
Can really fast inference (e.g. 1M tok/sec) make LLMs more intelligent? I am imagining you could run multiple agents and can choose and discard outputs using other LLMs simultaneously. Will the output look more like a real thought process? Or will it remain just same?
By @smsx - 3 months
The numbers are pretty incredible. Will the competition be able to match them?
By @haensi - 3 months
Their CEO talks about this in the gradient dissent podcast [1]

[1]: https://m.youtube.com/watch?v=qNXebAQ6igs

By @m3kw9 - 3 months
Sure but can they fit a bigger model? I don’t think they can string together these to fit bigger models like llama3.1 405b
By @byyoung3 - 3 months
i wrote an article showing how to the api to build a chatbot: https://api.wandb.ai/links/byyoung3/kv7vn1wt
By @brylie - 3 months
It would be understandable that they are focused currently on inference speed, but features like structured output and prompt caching make it possible to build more capable LLM applications.

Does Cerebras support reliable structured output like the recent OpenAI 4o?