September 7th, 2024

How to evaluate performance of LLM inference frameworks

LLM inference frameworks face a "memory wall" limiting performance. Developers should choose frameworks wisely, apply optimizations cautiously, and structure applications for server or offline scenarios to enhance efficiency.

Read original articleLink Icon
How to evaluate performance of LLM inference frameworks

LLM inference frameworks are currently facing a "memory wall," a hardware-imposed limit on performance due to memory constraints. Developers should focus on understanding their system's memory wall and select an inference framework that approaches this limit, rather than getting bogged down in the nuances of different frameworks. Performance metrics like requests per second can be misleading, especially since server and offline scenarios yield higher throughput than single-stream scenarios. While optimizations such as quantization and sparsity can enhance performance, they must be applied judiciously, as aggressive pruning can lead to significant accuracy loss. It is generally advisable to use well-validated models in their published formats. The Lamini inference engine is designed to optimize performance on various GPUs, focusing on the MLPerf server scenario for maximum throughput. The memory wall affects transformer models during inference, as they require loading extensive weights for each token generated. Although higher bandwidth memory could theoretically improve performance, practical limitations exist due to hardware constraints. The MLPerf benchmark defines various scenarios for measuring LLM performance, with server and offline scenarios allowing for better performance through batching requests. Developers are encouraged to structure applications to leverage these scenarios for improved efficiency.

- LLM inference frameworks are limited by a hardware "memory wall."

- Performance metrics can vary significantly between single-stream and server scenarios.

- Caution is advised when applying optimizations like quantization and sparsity.

- The Lamini engine is optimized for high throughput on multiple GPU types.

- Structuring applications for server or offline scenarios can enhance performance.

Link Icon 2 comments
By @Havoc - 3 months
>Aggressively pruning LLMs via quantization can significantly reduce their accuracy and you might be better off using a smaller model in the first place.

Not sure that is correct. Quantization charts suggest its a fairly continous spectrum. i.e. an aggressive quant 13B ends up about same as a no quant 7B:

https://www.researchgate.net/figure/Performance-degradation-...

By @brrrrrm - 3 months
If you're hitting a memory wall it means you're not scaling. This stuff really doesn't apply to scaled up inference but rather local small batch execution