Fast LLM Inference From Scratch (using CUDA)
The article outlines the development of a C++ and CUDA-based LLM inference engine, emphasizing optimizations for single-GPU throughput, memory bandwidth importance, and benchmarking against existing engines for improved performance.
Read original articleThis article discusses the development of a large language model (LLM) inference engine using C++ and CUDA, focusing on optimizing single-GPU inference throughput without relying on existing libraries. The author emphasizes the importance of understanding the full stack of LLM inference, particularly as models are increasingly deployed on consumer devices. The implementation aims to load weights from common open models and achieve high token throughput, surpassing existing benchmarks like llama.cpp. The article provides a recap of LLM architectures, detailing the inference process, including the prefill and decode steps. It highlights the significance of memory bandwidth in inference performance, noting that modern hardware is often memory-bandwidth-bound. The author presents benchmarks comparing various inference engines, revealing that optimizations such as multithreading and weight quantization can significantly enhance throughput. The article also discusses the challenges of implementing efficient inference on CPUs, including the use of SIMD and the potential benefits of quantization. Overall, the piece serves as a guide for developers interested in building efficient LLM inference engines from scratch.
- The article focuses on building an LLM inference engine using C++ and CUDA without libraries.
- It emphasizes the importance of understanding LLM inference mechanics and optimizations.
- Memory bandwidth is identified as a critical factor affecting inference performance.
- The author presents benchmarks comparing different inference engines and highlights optimization techniques.
- The implementation aims to achieve high token throughput on consumer devices.
Related
Benchmarking LLM Inference Back Ends: VLLM, LMDeploy, MLC-LLM, TensorRT-LLM, TGI
Selecting the right inference backend for large language models is crucial for user experience and cost efficiency. A benchmark study by BentoML compared various backends, highlighting LMDeploy's decoding performance, vLLM's low TTFT, and considerations beyond performance. BentoML and BentoCloud are recommended tools for efficient AI model deployment.
How to evaluate performance of LLM inference frameworks
LLM inference frameworks face a "memory wall" limiting performance. Developers should choose frameworks wisely, apply optimizations cautiously, and structure applications for server or offline scenarios to enhance efficiency.
LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs
The paper presents an FPGA-based accelerator for large language models, achieving 14.3-15.8 times speedup and 6.1 times power efficiency, enhancing deployment in resource-constrained environments.
Everything I've learned so far about running local LLMs
Local Large Language Models (LLMs) now run on modest hardware, enhancing accessibility. The llama.cpp software simplifies usage, while Hugging Face offers various models. Understanding specifications is vital for optimization.
Llama.cpp guide – Running LLMs locally on any hardware, from scratch
The guide on SteelPh0enix's blog details running large language models locally using llama.cpp, highlighting hardware options, quantization benefits, setup instructions, and encouraging non-commercial self-hosting experimentation.
To the author, if you're lurking here, I have a tangential question- how long did it take you to write this article? From first line of code to the last line of this post?
As someone who works in GPGPU space, I can imagine myself writing an article of this sort. But the huge uncertainty around time needed has deterred me so far.
Programming these is a nightmare as you need to have several in flight concurrently for peak performance.
Perhaps you don't need the extra flops as you end up bandwidth bound?
Regardless the good thing about the code in the blog though is it'll probably work pretty well for other accelerators, if you port it to HIP or similar. If you use wgmma I'm not sure it'll even be portable across Nvidia generations.
BTW, see that section of the readme about quantization: https://github.com/Const-me/Cgml/tree/master?tab=readme-ov-f...
PS. appreciate it if anyone can recommend more material like this
Does any one know of something similar in python? I want to share with my team something similar to this that goes into (almost) everything (at least conceptually) needed to efficiently serve an LLM.
It doesn’t actually need to be performant mind you (it’s in python) I just need something “conceptually complete” while being more “tutorial style” and concise than vLLM codebase
It is a fork of:
https://github.com/jameswdelancey/llama3.c
I am compiling and running it this way:
nvcc -ptx -arch=sm_86 rung.cu -o rung.ptx
gcc -I/opt/cuda/targets/x86_64-linux/include -L/opt/cuda/lib64 -O3 -g -lcuda -lcublas -lcudart -lm -lmvec -o rung rung.c
./rung "llama3_8b_base.bin" -z "tokenizer.bin" -t 0 -i "Once upon a time"
I am getting around 48 to 49 tokens per second with bf16 on my 3090 Ti from this. Oddly, llama.cpp only gets around 1 token per second with bf16, but it gets around 51 to 52 tokens per second with fp16. I suspect the performance gap would close if I use CUDA graphs.
The code is ugly in comparison to the original and I am not finished optimizing it, which is why I have not yet pushed it to a GitHub branch. I still have a few ways to reduce the amount of communication between the GPU and CPU per token, before using CUDA graphs. In particular, using the batched API for computing K and V, prefilling an array of pointers for the batched API across all layer iterations (such that all CPU to GPU pointer copies for the batched API are aggregated) and a few opportunities for kernel fusion, especially in rmsnorm after I figure out how to calculate the sum of squares quickly without cublas. I also have yet to try using Nvidia’s profiler.
Technically, CUDA C is really extended C++, but I am using a C-like subset of C++ for the device code. For the host code, I have written it in ANSI C. There are a few artifacts (the __host__ and __device__ keywords) from an earlier version where I had tried using CUDA C before I realized it was really extended C++ and not extended C, but those are trivial to remove (and will be removed before I push it to a branch). It should be possible to compile the device code with Clang instead of Nvidia’s compiler to use a fully open source toolchain, although I have yet to try it.
I have another fork here that uses the MKL to make the prompt processing portion run faster:
https://github.com/ryao/llama3.c
When I am happy with the CUDA version, I will either push it to a branch or to its own set of files in master.
Related
Benchmarking LLM Inference Back Ends: VLLM, LMDeploy, MLC-LLM, TensorRT-LLM, TGI
Selecting the right inference backend for large language models is crucial for user experience and cost efficiency. A benchmark study by BentoML compared various backends, highlighting LMDeploy's decoding performance, vLLM's low TTFT, and considerations beyond performance. BentoML and BentoCloud are recommended tools for efficient AI model deployment.
How to evaluate performance of LLM inference frameworks
LLM inference frameworks face a "memory wall" limiting performance. Developers should choose frameworks wisely, apply optimizations cautiously, and structure applications for server or offline scenarios to enhance efficiency.
LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs
The paper presents an FPGA-based accelerator for large language models, achieving 14.3-15.8 times speedup and 6.1 times power efficiency, enhancing deployment in resource-constrained environments.
Everything I've learned so far about running local LLMs
Local Large Language Models (LLMs) now run on modest hardware, enhancing accessibility. The llama.cpp software simplifies usage, while Hugging Face offers various models. Understanding specifications is vital for optimization.
Llama.cpp guide – Running LLMs locally on any hardware, from scratch
The guide on SteelPh0enix's blog details running large language models locally using llama.cpp, highlighting hardware options, quantization benefits, setup instructions, and encouraging non-commercial self-hosting experimentation.