Hardware Acceleration of LLMs: A comprehensive survey and comparison
The paper reviews hardware acceleration techniques for Large Language Models, comparing frameworks across platforms like FPGA and GPU, addressing evaluation challenges, and contributing to advancements in natural language processing.
Read original articleThe paper titled "Hardware Acceleration of LLMs: A comprehensive survey and comparison" by Nikoletta Koilia and Christoforos Kachris provides an extensive review of research efforts aimed at enhancing the performance of Large Language Models (LLMs) through hardware acceleration. It discusses various frameworks developed for accelerating transformer networks, focusing on different processing platforms such as FPGA, ASIC, In-Memory, and GPU. The authors conduct both qualitative and quantitative comparisons of these frameworks, evaluating factors like speedup, energy efficiency, performance (measured in GOPs), and energy efficiency (GOPs/W). A significant challenge in this comparison arises from the use of different process technologies across studies, which complicates fair evaluations. To address this, the authors extrapolate performance and energy efficiency results to a common technology, allowing for a more equitable comparison. They also implement parts of the LLMs on various FPGA chips to support their analysis. This work aims to provide a clearer understanding of the trade-offs involved in hardware acceleration for LLMs, contributing to the ongoing development in the field of natural language processing.
- The paper surveys hardware acceleration techniques for Large Language Models (LLMs).
- It compares various frameworks based on processing platforms like FPGA, ASIC, and GPU.
- The authors address challenges in fair comparisons due to differing process technologies.
- Performance and energy efficiency results are extrapolated to a common technology for better evaluation.
- The study contributes to advancements in natural language processing through hardware optimization.
Related
- Memory bandwidth is increasingly recognized as a bottleneck in LLM performance, driving interest in new technologies like Compute-in-Memory (CIM).
- There is curiosity about hybrid architectures that combine FPGA, ASIC, and in-memory technologies to enhance performance and flexibility.
- Some commenters express a desire for more accessible resources and clarity regarding the content on platforms like Arxiv.
- Discussion around the implications of rapid advancements in LLMs raises questions about the longevity and adaptability of specialized hardware.
- Links to related research and articles are shared, indicating a collaborative effort to deepen understanding of the topic.
As early as the 90s it was observed that CPU speed (FLOPs) was improving faster than memory bandwidth. In 1995 William Wulf and Sally Mckee predicted this divergence would lead to a “memory wall”, where most computations would be bottlenecked by data access rather than arithmetic operations.
Over the past 20 years peak server hardware FLOPS has been scaling at 3x every 2 years, outpacing the growth of DRAM and interconnect bandwidth, which have only scaled at 1.6 and 1.4 times every 2 years, respectively.
Thus for training and inference of LLMs, the performance bottleneck is increasingly shifting toward memory bandwidth. Particularly for autoregressive Transformer decoder models, it can be the dominant bottleneck.
This is driving the need for new tech like Compute-in-memory (CIM), also known as processing-in-memory (PIM). Hardware in which operations are performed directly on the data in memory, rather than transferring data to CPU registers first. Thereby improving latency and power consumption, and possibly sidestepping the great “memory wall”.
Notably to compare ASIC and FPGA hardware across varying semiconductor process sizes, the paper uses a fitted polynomial to extrapolate to a common denominator of 16nm:
> Based on the article by Aaron Stillmaker and B.Baas titled ”Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7nm,” we extrapolated the performance and the energy efficiency on a 16nm technology to make a fair comparison
But extrapolation for CIM/PIM is not done because they claim:
> As the in-memory accelerators the performance is not based only on the process technology, the extrapolation is performed only on the FPGA and ASIC accelerators where the process technology affects significantly the performance of the systems.
Which strikes me as an odd claim at face value, but perhaps others here could offer further insight on that decision.
Links below for further reading.
https://arxiv.org/abs/2403.14123
https://en.m.wikipedia.org/wiki/In-memory_processing
http://vcl.ece.ucdavis.edu/pubs/2017.02.VLSIintegration.Tech...
Each cell would have 4 input bits, 1 each from the neighbors, and 4 output bits, again, one to each neighbor. In the middle would be 64 bits of shift register from a long scan chain, the output of which goes to 4 16:1 multiplexers, and 4 bits of latch.
Through the magic of graph coloring, a checkerboard pattern would be used to clock all of the cells to allow data to flow in any direction without preference, and without race conditions. All of the inputs to any given cell would be stable.
This allows the flexibility of an FPGA, without the need to worry about timing issues or race conditions, glitches, etc. This also keeps all the lines short, so everything is local and fast/low power.
What it doesn't do is be efficient with gates, nor give the fastest path for logic. Every single operation happens effectively in parallel. All computation is pipelined.
I've had this idea since about 1982... I really wish someone would pick it up and run with it. I call it the BitGrid.
https://arxiv.org/pdf/2406.08413 Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference
Every time I land on that site I'm so confused / lost in it's interface (or lack there of) I usually end up leaving without getting to the content.