June 25th, 2024

Researchers upend AI status quo by eliminating matrix multiplication in LLMs

Researchers innovate AI language models by eliminating matrix multiplication, enhancing efficiency. A MatMul-free method reduces power consumption, costs, and challenges the necessity of matrix multiplication in high-performing models.

Read original articleLink Icon
Researchers upend AI status quo by eliminating matrix multiplication in LLMs

Researchers have developed a new method to enhance the efficiency of AI language models by eliminating matrix multiplication, a process crucial for neural network operations accelerated by GPU chips. The study, led by researchers from various universities and tech companies, introduces a MatMul-free approach that could significantly reduce power consumption and operational costs of AI systems. By creating a custom 2.7 billion parameter model without using MatMul, the researchers achieved similar performance to traditional large language models (LLMs). They demonstrated running a 1.3 billion parameter model on a GPU accelerated by a custom-programmed FPGA chip, consuming only 13 watts of power. This breakthrough challenges the conventional belief that matrix multiplication is essential for high-performing language models, potentially making large models more accessible and sustainable, especially for deployment on resource-constrained devices like smartphones. The technique, although not yet peer-reviewed, aims to pave the way for more efficient and hardware-friendly AI architectures, offering a promising alternative to current GPU-intensive approaches.

Related

20x Faster Background Removal in the Browser Using ONNX Runtime with WebGPU

20x Faster Background Removal in the Browser Using ONNX Runtime with WebGPU

Using ONNX Runtime with WebGPU and WebAssembly in browsers achieves 20x speedup for background removal, reducing server load, enhancing scalability, and improving data security. ONNX models run efficiently with WebGPU support, offering near real-time performance. Leveraging modern technology, IMG.LY aims to enhance design tools' accessibility and efficiency.

Optimizing AI Inference at Character.ai

Optimizing AI Inference at Character.ai

Character.AI optimizes AI inference for LLMs, handling 20,000+ queries/sec globally. Innovations like Multi-Query Attention and int8 quantization reduced serving costs by 33x since late 2022, aiming to enhance AI capabilities worldwide.

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

The article discusses the release of open-source Llama3 70B model, highlighting its performance compared to GPT-4 and Claude3 Opus. It emphasizes training enhancements, data quality, and the competition between open and closed-source models.

How to run an LLM on your PC, not in the cloud, in less than 10 minutes

How to run an LLM on your PC, not in the cloud, in less than 10 minutes

You can easily set up and run large language models (LLMs) on your PC using tools like Ollama, LM Suite, and Llama.cpp. Ollama supports AMD GPUs and AVX2-compatible CPUs, with straightforward installation across different systems. It offers commands for managing models and now supports select AMD Radeon cards.

Researchers run high-performing LLM on the energy needed to power a lightbulb

Researchers run high-performing LLM on the energy needed to power a lightbulb

Researchers at UC Santa Cruz developed an energy-efficient method for large language models. By using custom hardware and ternary numbers, they achieved high performance with minimal power consumption, potentially revolutionizing model power efficiency.

Link Icon 11 comments
By @tomohelix - 5 months
The relevant paper: https://arxiv.org/abs/2406.02528

In summary, they forced the model to process data in ternary system and then build a custom FPGA chip to process the data more efficiently. Tested to be "comparable" to small models (3B), theoretically scale to 70B, unknown for SOTAs (>100B params).

We have always known custom chips are more efficient especially for tasks like these where it is basically approximating an analog process (i.e. the brain). What is impressive is how fast it is prgressing. These 3B params models would demolish GPT2 which was, what, 4-5 years old? And they would be pure scifi tech 10 years ago.

Now they can run on your phone.

A machine, running locally on your phone, that can listen and respond to anything a human may say. Who could have confidently claim this 10 years ago?

By @anon291 - 5 months
Note that the architecture does use matmuls. They just defined ternary matmuls to not be 'real' matrix multiplication. I mean... it is certainly a good thing for power consumption to be wrangling less bits, but from a semantic standpoint, it is matrix multiplication.
By @JKCalhoun - 5 months
"Call my broker, tell him to sell all my NVDA!"

Combined with the earlier paper this year that claimed LLMs work fine (and faster) with trinary numbers (rather than floats? or long ints?) — the idea of running a quick LLM local is looking better and better.

By @ChrisArchitect - 5 months
[dupe]

Some more discussion a few weeks ago: https://news.ycombinator.com/item?id=40620955

By @bee_rider - 5 months
Noooooooo

The whole point of AI was to sell premium GEMMs and come up with funky low precision accelerators.

By @mysteria - 5 months
There's additional discussion on the same research in an earlier thread [1].

https://news.ycombinator.com/item?id=40787349

By @MiguelX413 - 5 months
By @aixpert - 5 months
these quantization are throwing away an advantage of analog computers to handle imprecise "floats"
By @skeledrew - 5 months
Heh, Nvidia may want to take steps to bury this. Will likely be a humongous loss for them if it pans out.