June 25th, 2024

Researchers run high-performing LLM on the energy needed to power a lightbulb

Researchers at UC Santa Cruz developed an energy-efficient method for large language models. By using custom hardware and ternary numbers, they achieved high performance with minimal power consumption, potentially revolutionizing model power efficiency.

Read original articleLink Icon
Researchers run high-performing LLM on the energy needed to power a lightbulb

Researchers at UC Santa Cruz have developed a method to significantly improve the energy efficiency of large language models while maintaining performance. By eliminating the computationally expensive element of matrix multiplication and running their algorithm on custom hardware, they were able to power a billion-parameter-scale language model on just 13 watts, equivalent to the energy needed to power a lightbulb. This approach, which uses ternary numbers to reduce computation to summing rather than multiplying, resulted in a model that achieved the same performance as state-of-the-art models while being over 50 times more efficient than typical hardware. The custom hardware designed by the researchers allowed the model to operate at a remarkable efficiency, surpassing human-readable throughput on minimal power consumption. The team believes that further optimization could lead to even greater energy efficiency, potentially revolutionizing the way large language models are powered in the future.

Related

Optimizing AI Inference at Character.ai

Optimizing AI Inference at Character.ai

Character.AI optimizes AI inference for LLMs, handling 20,000+ queries/sec globally. Innovations like Multi-Query Attention and int8 quantization reduced serving costs by 33x since late 2022, aiming to enhance AI capabilities worldwide.

OpenPipe Mixture of Agents: Outperform GPT-4 at 1/25th the Cost

OpenPipe Mixture of Agents: Outperform GPT-4 at 1/25th the Cost

OpenPipe's cost-effective agent mixture surpasses GPT-4, promising advanced language processing at a fraction of the cost. This innovation could disrupt the market with its high-performance, affordable language solutions.

Testing Generative AI for Circuit Board Design

Testing Generative AI for Circuit Board Design

A study tested Large Language Models (LLMs) like GPT-4o, Claude 3 Opus, and Gemini 1.5 for circuit board design tasks. Results showed varied performance, with Claude 3 Opus excelling in specific questions, while others struggled with complexity. Gemini 1.5 showed promise in parsing datasheet information accurately. The study emphasized the potential and limitations of using AI models in circuit board design.

How to run an LLM on your PC, not in the cloud, in less than 10 minutes

How to run an LLM on your PC, not in the cloud, in less than 10 minutes

You can easily set up and run large language models (LLMs) on your PC using tools like Ollama, LM Suite, and Llama.cpp. Ollama supports AMD GPUs and AVX2-compatible CPUs, with straightforward installation across different systems. It offers commands for managing models and now supports select AMD Radeon cards.

Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]

Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]

The video discusses limitations of large language models in AI, emphasizing genuine understanding and problem-solving skills. A prize incentivizes AI systems showcasing these abilities. Adaptability and knowledge acquisition are highlighted as crucial for true intelligence.

Link Icon 20 comments
By @cs702 - 7 months
Paper: https://arxiv.org/abs/2406.02528 -- always better than a press release.

Code: https://github.com/ridgerchu/matmulfreellm

---

Like others before them, the authors train LLMs using parameters consisting of ternary digits, or trits, with values in {-1, 0, 1}.

What's new is that the authors then build a custom hardware solution on an FPGA and run billion-parameter LLMs consuming only 13W, moving LLM inference closer to brain-like efficiency.

Sure, it's on an FPGA, and it's only a lab experiment, but we're talking about an early proof of concept, not a commercial product.

As far as I know, this is the first energy-efficient hardware implementation of tritwise LLMs. That seems like a pretty big deal to me.

By @blixt - 7 months
> It costs $700,000 per day in energy costs to run ChatGPT 3.5, according to recent estimates, and leaves behind a massive carbon footprint in the process.

Compared to what? I wouldn't defend LLMs as "worth their electricity" quite yet, and they are definitely less efficient than a lot of other software, but I'd still like to see how this compares to gaming consoles, or email servers, the advertising industry hosting costs, cryptocurrency, and so on. Just doesn't seem worth pointing out the carbon footprint of AI just yet.

By @Aurornis - 7 months
The press release is devoid of useful information, unsurprisingly. You can run an LLM under almost any energy envelope if you’re willing to wait long enough for the result. Total energy consumed and the time difference are the more important metrics.

The actual paper is here: https://arxiv.org/abs/2406.02528

The key part from the summary:

> To properly quantify the efficiency of our architecture, we build a custom hardware solution on an FPGA which exploits lightweight operations beyond what GPUs are capable of. We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency.

There is a lot of unnecessary obfuscation of the numbers going on in the abstract as well, which is unfortunate. Instead of quoting the numbers they call it “billion-parameter scale” and “beyond human readable throughout”.

By @Escapado - 7 months
> "For the largest model size of 13B parameters, the MatMul-free LM uses only 4.19 GB of GPU memory and has a latency of 695.48 ms, whereas Transformer++ requires 48.50 GB of memory and exhibits a latency of 3183.10 ms"

That's a really, _really_ big difference in memory usage and since this scales sub-linear (300M param model uses 0.21GB, 13B model uses 4.19B) a 70B model would fit on an RTX 4090. I think currently people often run 34B Models with 4bit quants on that so I would like to see some larger models trained on more tokens with this approach.

Also their 2.7B Model took 173hours on 8 NVIDIA H100 GPUs and that also seems to roughly scale linearly with the parameter size, so a company with access to a small cluster of those DGX pods (say 8) could train such a model in about 30 days - though the 100B token training set might be lackluster for SotA but maybe someone else could chime in on that.

By @syntaxing - 7 months
For reference, this is a real prototype for Bitnet 1.58b which uses (-1,0,1) as the weights which simplifies the matrix multiplication [0].

[0] https://arxiv.org/abs/2402.17764

By @ijustlovemath - 7 months
I'm just curious how you close timing on a billion parameter model! I used to TA a digital design course that heavily involved FPGAs, and any kind of image or sprite usage, even on the order of megabytes, would crank up compile times insanely and sometimes even fail 30-45min into the build!

If anyone can offer insight that would be greatly appreciated

By @Retr0id - 7 months
That's 13 Watts apparently, in non-American units.
By @bottom999mottob - 7 months
It looks like this is a quantization method to flatten matrices for vector addition. Can anyone explain how this could allow LLMs to reach current benchmarks without losing performance?
By @DennisP - 7 months
Don't FPGAs have some overhead? How much better could this be on a custom ASIC?
By @nobodyandproud - 7 months
It’s cool that my ANN course from twenty years back gives me enough background to understand this and quantization.

My professor at the time was at his last leg in impact, when ANNs were looked down on right before someone had the bright idea of using video cards.

I hope he’s doing well/retired on a high note.

By @truckerbill - 7 months
I'd like to play around with ideas like this (training in ternary for example). Does anyone have any links to good reading materials or resources?

The FPGA trinary implementation is also really interesting!

By @ChrisArchitect - 7 months
[dupe]

Discussion a few weeks ago: https://news.ycombinator.com/item?id=40620955

By @foreverpiano - 7 months
Can this work build some easy-to-use apis? So it may be easy to apply in diffusion on other model.
By @datameta - 7 months
I'm curious about the specs of the FPGA used - I imagine fairly high-end?
By @moffkalast - 7 months
Lightbulbs are probably not the best thing to compare against, they're famously high in energy consumption and terribly inefficient. Even LEDs are absolutely awful and barely crack 30% total efficiency, most of what they make is heat.
By @daralthus - 7 months
Does anyone know what's the patent situation around this?
By @throwawaymaths - 7 months
Incandescent I presume. Mistral-7b on a Nvidia 3060 draws about 100-odd watts of power.
By @onesphere - 7 months
And for candle power?
By @m3kw9 - 7 months
Why not have fpga code and the LLM algorithm so we can replicate it?