September 6th, 2024

Hardware Acceleration of LLMs: A comprehensive survey and comparison

The paper reviews hardware acceleration techniques for Large Language Models, comparing frameworks across platforms like FPGA and GPU, addressing evaluation challenges, and contributing to advancements in natural language processing.

Read original article

CuriositySkepticismInterest

The paper titled "Hardware Acceleration of LLMs: A comprehensive survey and comparison" by Nikoletta Koilia and Christoforos Kachris provides an extensive review of research efforts aimed at enhancing the performance of Large Language Models (LLMs) through hardware acceleration. It discusses various frameworks developed for accelerating transformer networks, focusing on different processing platforms such as FPGA, ASIC, In-Memory, and GPU. The authors conduct both qualitative and quantitative comparisons of these frameworks, evaluating factors like speedup, energy efficiency, performance (measured in GOPs), and energy efficiency (GOPs/W). A significant challenge in this comparison arises from the use of different process technologies across studies, which complicates fair evaluations. To address this, the authors extrapolate performance and energy efficiency results to a common technology, allowing for a more equitable comparison. They also implement parts of the LLMs on various FPGA chips to support their analysis. This work aims to provide a clearer understanding of the trade-offs involved in hardware acceleration for LLMs, contributing to the ongoing development in the field of natural language processing.

- The paper surveys hardware acceleration techniques for Large Language Models (LLMs).

- It compares various frameworks based on processing platforms like FPGA, ASIC, and GPU.

- The authors address challenges in fair comparisons due to differing process technologies.

- Performance and energy efficiency results are extrapolated to a common technology for better evaluation.

- The study contributes to advancements in natural language processing through hardware optimization.

AI: What people are saying

The comments on the article about hardware acceleration techniques for Large Language Models reveal several key themes and insights.

Memory bandwidth is increasingly recognized as a bottleneck in LLM performance, driving interest in new technologies like Compute-in-Memory (CIM).
There is curiosity about hybrid architectures that combine FPGA, ASIC, and in-memory technologies to enhance performance and flexibility.
Some commenters express a desire for more accessible resources and clarity regarding the content on platforms like Arxiv.
Discussion around the implications of rapid advancements in LLMs raises questions about the longevity and adaptability of specialized hardware.
Links to related research and articles are shared, indicating a collaborative effort to deepen understanding of the topic.

18 comments

By @refibrillator - 8 months

This paper is light on background so I’ll offer some additional context:

As early as the 90s it was observed that CPU speed (FLOPs) was improving faster than memory bandwidth. In 1995 William Wulf and Sally Mckee predicted this divergence would lead to a “memory wall”, where most computations would be bottlenecked by data access rather than arithmetic operations.

Over the past 20 years peak server hardware FLOPS has been scaling at 3x every 2 years, outpacing the growth of DRAM and interconnect bandwidth, which have only scaled at 1.6 and 1.4 times every 2 years, respectively.

Thus for training and inference of LLMs, the performance bottleneck is increasingly shifting toward memory bandwidth. Particularly for autoregressive Transformer decoder models, it can be the dominant bottleneck.

This is driving the need for new tech like Compute-in-memory (CIM), also known as processing-in-memory (PIM). Hardware in which operations are performed directly on the data in memory, rather than transferring data to CPU registers first. Thereby improving latency and power consumption, and possibly sidestepping the great “memory wall”.

Notably to compare ASIC and FPGA hardware across varying semiconductor process sizes, the paper uses a fitted polynomial to extrapolate to a common denominator of 16nm:

> Based on the article by Aaron Stillmaker and B.Baas titled ”Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7nm,” we extrapolated the performance and the energy efficiency on a 16nm technology to make a fair comparison

But extrapolation for CIM/PIM is not done because they claim:

> As the in-memory accelerators the performance is not based only on the process technology, the extrapolation is performed only on the FPGA and ASIC accelerators where the process technology affects significantly the performance of the systems.

Which strikes me as an odd claim at face value, but perhaps others here could offer further insight on that decision.

Links below for further reading.

https://arxiv.org/abs/2403.14123

https://en.m.wikipedia.org/wiki/In-memory_processing

http://vcl.ece.ucdavis.edu/pubs/2017.02.VLSIintegration.Tech...

By @mikewarot - 8 months

I've always been partial to systolic arrays. I iterated through a bunch of options over the past few decades, and settled upon what I think is the optimal solution, a cartesian grid of cells.

Each cell would have 4 input bits, 1 each from the neighbors, and 4 output bits, again, one to each neighbor. In the middle would be 64 bits of shift register from a long scan chain, the output of which goes to 4 16:1 multiplexers, and 4 bits of latch.

Through the magic of graph coloring, a checkerboard pattern would be used to clock all of the cells to allow data to flow in any direction without preference, and without race conditions. All of the inputs to any given cell would be stable.

This allows the flexibility of an FPGA, without the need to worry about timing issues or race conditions, glitches, etc. This also keeps all the lines short, so everything is local and fast/low power.

What it doesn't do is be efficient with gates, nor give the fastest path for logic. Every single operation happens effectively in parallel. All computation is pipelined.

I've had this idea since about 1982... I really wish someone would pick it up and run with it. I call it the BitGrid.

By @fulafel - 8 months

https://arxiv.org/pdf/2406.08413 Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference

By @koolala - 8 months

I'd love to watch a LLM run in WebGL where everything is Textures. Would be neat to visually see the difference in architectures.

By @synergy20 - 8 months

Memory move is the bottleneck these days, thus the expensive HBM, Nvidia's design is also memory-optimized since it's the true bottleneck chip wise and system wise.

By @next_xibalba - 8 months

Could a FPGA + ASICs + in-mem hybrid architecture have any role to play in scaling/flexibility? Each one has its own benefits (e.g., FPGAs for flexibility, ASICs for performance, in-memory for energy efficiency), so could a hybrid approach integrating each to juice LLM perf even further?

By @moffkalast - 8 months

In-memory sounds like the way to go not just in terms of performance, but in that it makes no sense to build an ASIC or program an FPGA for a model that will most likely be obsolete in a few months at best if you're lucky.

By @smusamashah - 8 months

There was a paper about LLM running on same power as a light bulb.

https://arxiv.org/abs/2406.02528

https://news.ucsc.edu/2024/06/matmul-free-llm.html

By @yjftsjthsd-h - 8 months

I'm unfamiliar; in this context is "in-memory" specialized hardware that combines CPU+RAM?

By @smcleod - 8 months

Is there a "nice" way to read content on Arxiv?

Every time I land on that site I'm so confused / lost in it's interface (or lack there of) I usually end up leaving without getting to the content.

By @jumploops - 8 months

Curious if anyone is making AccelTran ASICs?

By @DrNosferatu - 8 months

The values (namely the FPGAs) should have been normalized also by price.

By @fsndz - 8 months

this explains the success of Groq's ASIC-powered LPUs. LLM inference on Groq Cloud is blazingly fast. Also, the reduction in energy consumption is nice.

Hardware Acceleration of LLMs: A comprehensive survey and comparison

Related

Related