Researchers run high-performing LLM on the energy needed to power a lightbulb
Researchers at UC Santa Cruz developed an energy-efficient method for large language models. By using custom hardware and ternary numbers, they achieved high performance with minimal power consumption, potentially revolutionizing model power efficiency.
Read original articleResearchers at UC Santa Cruz have developed a method to significantly improve the energy efficiency of large language models while maintaining performance. By eliminating the computationally expensive element of matrix multiplication and running their algorithm on custom hardware, they were able to power a billion-parameter-scale language model on just 13 watts, equivalent to the energy needed to power a lightbulb. This approach, which uses ternary numbers to reduce computation to summing rather than multiplying, resulted in a model that achieved the same performance as state-of-the-art models while being over 50 times more efficient than typical hardware. The custom hardware designed by the researchers allowed the model to operate at a remarkable efficiency, surpassing human-readable throughput on minimal power consumption. The team believes that further optimization could lead to even greater energy efficiency, potentially revolutionizing the way large language models are powered in the future.
Related
Optimizing AI Inference at Character.ai
Character.AI optimizes AI inference for LLMs, handling 20,000+ queries/sec globally. Innovations like Multi-Query Attention and int8 quantization reduced serving costs by 33x since late 2022, aiming to enhance AI capabilities worldwide.
OpenPipe Mixture of Agents: Outperform GPT-4 at 1/25th the Cost
OpenPipe's cost-effective agent mixture surpasses GPT-4, promising advanced language processing at a fraction of the cost. This innovation could disrupt the market with its high-performance, affordable language solutions.
Testing Generative AI for Circuit Board Design
A study tested Large Language Models (LLMs) like GPT-4o, Claude 3 Opus, and Gemini 1.5 for circuit board design tasks. Results showed varied performance, with Claude 3 Opus excelling in specific questions, while others struggled with complexity. Gemini 1.5 showed promise in parsing datasheet information accurately. The study emphasized the potential and limitations of using AI models in circuit board design.
How to run an LLM on your PC, not in the cloud, in less than 10 minutes
You can easily set up and run large language models (LLMs) on your PC using tools like Ollama, LM Suite, and Llama.cpp. Ollama supports AMD GPUs and AVX2-compatible CPUs, with straightforward installation across different systems. It offers commands for managing models and now supports select AMD Radeon cards.
Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]
The video discusses limitations of large language models in AI, emphasizing genuine understanding and problem-solving skills. A prize incentivizes AI systems showcasing these abilities. Adaptability and knowledge acquisition are highlighted as crucial for true intelligence.
Code: https://github.com/ridgerchu/matmulfreellm
---
Like others before them, the authors train LLMs using parameters consisting of ternary digits, or trits, with values in {-1, 0, 1}.
What's new is that the authors then build a custom hardware solution on an FPGA and run billion-parameter LLMs consuming only 13W, moving LLM inference closer to brain-like efficiency.
Sure, it's on an FPGA, and it's only a lab experiment, but we're talking about an early proof of concept, not a commercial product.
As far as I know, this is the first energy-efficient hardware implementation of tritwise LLMs. That seems like a pretty big deal to me.
Compared to what? I wouldn't defend LLMs as "worth their electricity" quite yet, and they are definitely less efficient than a lot of other software, but I'd still like to see how this compares to gaming consoles, or email servers, the advertising industry hosting costs, cryptocurrency, and so on. Just doesn't seem worth pointing out the carbon footprint of AI just yet.
The actual paper is here: https://arxiv.org/abs/2406.02528
The key part from the summary:
> To properly quantify the efficiency of our architecture, we build a custom hardware solution on an FPGA which exploits lightweight operations beyond what GPUs are capable of. We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency.
There is a lot of unnecessary obfuscation of the numbers going on in the abstract as well, which is unfortunate. Instead of quoting the numbers they call it “billion-parameter scale” and “beyond human readable throughout”.
That's a really, _really_ big difference in memory usage and since this scales sub-linear (300M param model uses 0.21GB, 13B model uses 4.19B) a 70B model would fit on an RTX 4090. I think currently people often run 34B Models with 4bit quants on that so I would like to see some larger models trained on more tokens with this approach.
Also their 2.7B Model took 173hours on 8 NVIDIA H100 GPUs and that also seems to roughly scale linearly with the parameter size, so a company with access to a small cluster of those DGX pods (say 8) could train such a model in about 30 days - though the 100B token training set might be lackluster for SotA but maybe someone else could chime in on that.
If anyone can offer insight that would be greatly appreciated
My professor at the time was at his last leg in impact, when ANNs were looked down on right before someone had the bright idea of using video cards.
I hope he’s doing well/retired on a high note.
The FPGA trinary implementation is also really interesting!
Discussion a few weeks ago: https://news.ycombinator.com/item?id=40620955
Related
Optimizing AI Inference at Character.ai
Character.AI optimizes AI inference for LLMs, handling 20,000+ queries/sec globally. Innovations like Multi-Query Attention and int8 quantization reduced serving costs by 33x since late 2022, aiming to enhance AI capabilities worldwide.
OpenPipe Mixture of Agents: Outperform GPT-4 at 1/25th the Cost
OpenPipe's cost-effective agent mixture surpasses GPT-4, promising advanced language processing at a fraction of the cost. This innovation could disrupt the market with its high-performance, affordable language solutions.
Testing Generative AI for Circuit Board Design
A study tested Large Language Models (LLMs) like GPT-4o, Claude 3 Opus, and Gemini 1.5 for circuit board design tasks. Results showed varied performance, with Claude 3 Opus excelling in specific questions, while others struggled with complexity. Gemini 1.5 showed promise in parsing datasheet information accurately. The study emphasized the potential and limitations of using AI models in circuit board design.
How to run an LLM on your PC, not in the cloud, in less than 10 minutes
You can easily set up and run large language models (LLMs) on your PC using tools like Ollama, LM Suite, and Llama.cpp. Ollama supports AMD GPUs and AVX2-compatible CPUs, with straightforward installation across different systems. It offers commands for managing models and now supports select AMD Radeon cards.
Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]
The video discusses limitations of large language models in AI, emphasizing genuine understanding and problem-solving skills. A prize incentivizes AI systems showcasing these abilities. Adaptability and knowledge acquisition are highlighted as crucial for true intelligence.