Cerebras Inference: AI at Instant Speed
Cerebras launched its AI inference solution, claiming to process 1,800 tokens per second, outperforming NVIDIA by 20 times, with competitive pricing and plans for future model support.
Read original articleCerebras has launched its new AI inference solution, claiming to be the fastest in the world. The Cerebras inference system can process 1,800 tokens per second for the Llama3.1 8B model and 450 tokens per second for the Llama3.1 70B model, significantly outperforming NVIDIA GPU-based systems by a factor of 20. The pricing is competitive, at 10 cents per million tokens for the 8B model and 60 cents for the 70B model. This performance is made possible by the third-generation Wafer Scale Engine, which integrates 44GB of SRAM on a single chip, allowing the entire model to be stored on-chip and eliminating external memory bottlenecks. The system is designed to handle models ranging from billions to trillions of parameters, with plans to support larger models in the future. The inference API is available for developers, offering generous rate limits and the ability to integrate easily with existing applications. The use of native 16-bit weights ensures high accuracy, with evaluations showing that 16-bit models outperform 8-bit counterparts in various tasks. The introduction of Cerebras inference is expected to enhance real-time AI capabilities and enable more complex workflows, setting a new standard for AI model deployment.
- Cerebras inference claims to be the fastest AI inference solution, processing up to 1,800 tokens per second.
- The system is significantly cheaper than competitors, with pricing at 10 cents per million tokens for the 8B model.
- It utilizes a unique wafer-scale design to eliminate memory bandwidth bottlenecks.
- The API is available for developers, offering high rate limits and easy integration.
- Future support for larger models is planned, enhancing the capabilities of AI applications.
Related
Optimizing AI Inference at Character.ai
Character.AI optimizes AI inference for LLMs, handling 20,000+ queries/sec globally. Innovations like Multi-Query Attention and int8 quantization reduced serving costs by 33x since late 2022, aiming to enhance AI capabilities worldwide.
Tenstorrent Unveils High-End Wormhole AI Processors, Featuring RISC-V
Tenstorrent launches Wormhole AI chips on RISC-V, emphasizing cost-effectiveness and scalability. Wormhole n150 offers 262 TFLOPS, n300 doubles power with 24 GB GDDR6. Priced from $999, undercutting NVIDIA. New workstations from $1,500.
Llama 3.1 Official Launch
Llama introduces Llama 3.1, an open-source AI model available in 8B, 70B, and 405B versions. The 405B model is highlighted for its versatility in supporting various use cases, including multi-lingual agents and analyzing large documents. Users can leverage coding assistants, real-time or batch inference, and fine-tuning capabilities. Llama emphasizes open-source AI and offers subscribers updates via a newsletter.
Groq Supercharges Fast AI Inference for Meta Llama 3.1
Groq launches Llama 3.1 models with LPU™ AI technology on GroqCloud Dev Console and GroqChat. Mark Zuckerberg praises ultra-low-latency inference for cloud deployments, emphasizing open-source collaboration and AI innovation.
Benchmarks show even an old Nvidia RTX 3090 is enough to serve LLMs to thousands
An analysis by Backprop shows the Nvidia RTX 3090 can effectively serve large language models to thousands of users, achieving 12.88 tokens per second for 100 concurrent requests.
As near as I can tell, from the model card[1], the majority of the math for this model is 4096x4096 multiply-accumulates. So, there should be 70b/16m about 4000 of these in the Llama3-70B model.
A 16x16 multiplier is about 9000 transistors, according to a quick google. 4096^2 should thus be about 150 billion transistors, if you include the bias values. There are plenty of transistors on this chip to have many of them operating in parallel.
According to [2], a switching transition in the 7nM process node, is about 0.025 femtoJoule (10^-15 watt seconds) per transistor. At a clock rate of 1 Ghz, that's about 25 nanowatt/transistor. Scaling that at 50% transitions(a 50/50 chance any given gate in the MAC will flip), gets you about 2kW for each 4096^2 MAC running at 1 Ghz.
There are enough transistors, and enough RAM on the wafer to fit the entire model. Even if they have a single 4096^2 MAC array, a clock rate of 1 ghz should result in a total time of 4 uSec/token, or 250,000 tokens/second.
[1] https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
[2] https://mpedram.com/Papers/7nm-finfet-libraries-tcasII.pdf
Here’s an AI voice assistant I built that uses it:
It's actually worse for the majority of GPU implementations for large models. The matrices don't fit shared memory so the model is loaded many, many times to shared memory (as tiles). Also, unless you are using Hopper distributed shared memory, CTAs can't even share across them.
It would be nice to see a Cerebras solution for pre-training and fine-tuning.
A single A100 processes at 13 t/s/u for batch 32. That costs $10k to process 39 billion tokens over 3 years = $0.25 tok/s/u. If you have batch size 420 you can do it even cheaper.
TL;DR: Cerebras are certainly advertising at a loss-leading price and will only have a viable product if they can get extraordinarily high utilisation of their system at this price. I don’t think they can, so they’re basically screwed selling tokens. Maybe this is to attract attention in the hope of selling hardware to someone willing to pay a premium for very low latency, but I suspect it’s just a means of getting one more round of funding in the hope of reducing costs in the next version.
Does Cerebras support reliable structured output like the recent OpenAI 4o?
Related
Optimizing AI Inference at Character.ai
Character.AI optimizes AI inference for LLMs, handling 20,000+ queries/sec globally. Innovations like Multi-Query Attention and int8 quantization reduced serving costs by 33x since late 2022, aiming to enhance AI capabilities worldwide.
Tenstorrent Unveils High-End Wormhole AI Processors, Featuring RISC-V
Tenstorrent launches Wormhole AI chips on RISC-V, emphasizing cost-effectiveness and scalability. Wormhole n150 offers 262 TFLOPS, n300 doubles power with 24 GB GDDR6. Priced from $999, undercutting NVIDIA. New workstations from $1,500.
Llama 3.1 Official Launch
Llama introduces Llama 3.1, an open-source AI model available in 8B, 70B, and 405B versions. The 405B model is highlighted for its versatility in supporting various use cases, including multi-lingual agents and analyzing large documents. Users can leverage coding assistants, real-time or batch inference, and fine-tuning capabilities. Llama emphasizes open-source AI and offers subscribers updates via a newsletter.
Groq Supercharges Fast AI Inference for Meta Llama 3.1
Groq launches Llama 3.1 models with LPU™ AI technology on GroqCloud Dev Console and GroqChat. Mark Zuckerberg praises ultra-low-latency inference for cloud deployments, emphasizing open-source collaboration and AI innovation.
Benchmarks show even an old Nvidia RTX 3090 is enough to serve LLMs to thousands
An analysis by Backprop shows the Nvidia RTX 3090 can effectively serve large language models to thousands of users, achieving 12.88 tokens per second for 100 concurrent requests.