December 14th, 2024

Fast LLM Inference From Scratch (using CUDA)

The article outlines the development of a C++ and CUDA-based LLM inference engine, emphasizing optimizations for single-GPU throughput, memory bandwidth importance, and benchmarking against existing engines for improved performance.

Read original articleLink Icon
Fast LLM Inference From Scratch (using CUDA)

This article discusses the development of a large language model (LLM) inference engine using C++ and CUDA, focusing on optimizing single-GPU inference throughput without relying on existing libraries. The author emphasizes the importance of understanding the full stack of LLM inference, particularly as models are increasingly deployed on consumer devices. The implementation aims to load weights from common open models and achieve high token throughput, surpassing existing benchmarks like llama.cpp. The article provides a recap of LLM architectures, detailing the inference process, including the prefill and decode steps. It highlights the significance of memory bandwidth in inference performance, noting that modern hardware is often memory-bandwidth-bound. The author presents benchmarks comparing various inference engines, revealing that optimizations such as multithreading and weight quantization can significantly enhance throughput. The article also discusses the challenges of implementing efficient inference on CPUs, including the use of SIMD and the potential benefits of quantization. Overall, the piece serves as a guide for developers interested in building efficient LLM inference engines from scratch.

- The article focuses on building an LLM inference engine using C++ and CUDA without libraries.

- It emphasizes the importance of understanding LLM inference mechanics and optimizations.

- Memory bandwidth is identified as a critical factor affecting inference performance.

- The author presents benchmarks comparing different inference engines and highlights optimization techniques.

- The implementation aims to achieve high token throughput on consumer devices.

Related

Benchmarking LLM Inference Back Ends: VLLM, LMDeploy, MLC-LLM, TensorRT-LLM, TGI

Benchmarking LLM Inference Back Ends: VLLM, LMDeploy, MLC-LLM, TensorRT-LLM, TGI

Selecting the right inference backend for large language models is crucial for user experience and cost efficiency. A benchmark study by BentoML compared various backends, highlighting LMDeploy's decoding performance, vLLM's low TTFT, and considerations beyond performance. BentoML and BentoCloud are recommended tools for efficient AI model deployment.

How to evaluate performance of LLM inference frameworks

How to evaluate performance of LLM inference frameworks

LLM inference frameworks face a "memory wall" limiting performance. Developers should choose frameworks wisely, apply optimizations cautiously, and structure applications for server or offline scenarios to enhance efficiency.

LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs

LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs

The paper presents an FPGA-based accelerator for large language models, achieving 14.3-15.8 times speedup and 6.1 times power efficiency, enhancing deployment in resource-constrained environments.

Everything I've learned so far about running local LLMs

Everything I've learned so far about running local LLMs

Local Large Language Models (LLMs) now run on modest hardware, enhancing accessibility. The llama.cpp software simplifies usage, while Hugging Face offers various models. Understanding specifications is vital for optimization.

Llama.cpp guide – Running LLMs locally on any hardware, from scratch

Llama.cpp guide – Running LLMs locally on any hardware, from scratch

The guide on SteelPh0enix's blog details running large language models locally using llama.cpp, highlighting hardware options, quantization benefits, setup instructions, and encouraging non-commercial self-hosting experimentation.

Link Icon 12 comments
By @reasonableklout - 4 months
Hi, I'm the author. Thanks for sharing, was great to wake up to my blog post on the front page! Would love to hear any feedback or if I missed anything.
By @shihab - 4 months
Excellent, amazing article.

To the author, if you're lurking here, I have a tangential question- how long did it take you to write this article? From first line of code to the last line of this post?

As someone who works in GPGPU space, I can imagine myself writing an article of this sort. But the huge uncertainty around time needed has deterred me so far.

By @fancyfredbot - 4 months
I don't think this code can make use of the tensor cores, or the wgmma instructions that you typically need to get peak performance out of them.

Programming these is a nightmare as you need to have several in flight concurrently for peak performance.

Perhaps you don't need the extra flops as you end up bandwidth bound?

Regardless the good thing about the code in the blog though is it'll probably work pretty well for other accelerators, if you port it to HIP or similar. If you use wgmma I'm not sure it'll even be portable across Nvidia generations.

By @Const-me - 4 months
I wonder how does the perf in tokens/second compares to my version of Mistral: https://github.com/Const-me/Cgml/tree/master/Mistral/Mistral...

BTW, see that section of the readme about quantization: https://github.com/Const-me/Cgml/tree/master?tab=readme-ov-f...

By @DigitalNoumena - 4 months
Great post! I've been looking to get into the guts of large scale model training (I'm half-way between the design and application layer of LLMs, mostly in python, sometimes a bit of c++) and this will be a great reference to have.

PS. appreciate it if anyone can recommend more material like this

By @diego898 - 4 months
This is great thank you!

Does any one know of something similar in python? I want to share with my team something similar to this that goes into (almost) everything (at least conceptually) needed to efficiently serve an LLM.

It doesn’t actually need to be performant mind you (it’s in python) I just need something “conceptually complete” while being more “tutorial style” and concise than vLLM codebase

By @guerrilla - 4 months
What are the prerequities for this kind of thing? I've written ANNs back in college and understood backpropagation and gradient descent at some point but I don't know most of the terms mentioned in the architectural overview. How big of an investment is this?
By @saagarjha - 4 months
Isn’t __shfl_down not recommended these days because of warp synchronization issues?
By @sakex - 4 months
Great article. I think what you should cover next is collective matmuls and sharding.
By @ryao - 4 months
I notice that the CUDA example uses C++. I had not planned to publish my own work in this area yet (as I am still working on it), but if anyone wants fast llama inference using CUDA in C, here are some files:

https://bpa.st/CA6A

https://bpa.st/WOSA

It is a fork of:

https://github.com/jameswdelancey/llama3.c

I am compiling and running it this way:

nvcc -ptx -arch=sm_86 rung.cu -o rung.ptx

gcc -I/opt/cuda/targets/x86_64-linux/include -L/opt/cuda/lib64 -O3 -g -lcuda -lcublas -lcudart -lm -lmvec -o rung rung.c

./rung "llama3_8b_base.bin" -z "tokenizer.bin" -t 0 -i "Once upon a time"

I am getting around 48 to 49 tokens per second with bf16 on my 3090 Ti from this. Oddly, llama.cpp only gets around 1 token per second with bf16, but it gets around 51 to 52 tokens per second with fp16. I suspect the performance gap would close if I use CUDA graphs.

The code is ugly in comparison to the original and I am not finished optimizing it, which is why I have not yet pushed it to a GitHub branch. I still have a few ways to reduce the amount of communication between the GPU and CPU per token, before using CUDA graphs. In particular, using the batched API for computing K and V, prefilling an array of pointers for the batched API across all layer iterations (such that all CPU to GPU pointer copies for the batched API are aggregated) and a few opportunities for kernel fusion, especially in rmsnorm after I figure out how to calculate the sum of squares quickly without cublas. I also have yet to try using Nvidia’s profiler.

Technically, CUDA C is really extended C++, but I am using a C-like subset of C++ for the device code. For the host code, I have written it in ANSI C. There are a few artifacts (the __host__ and __device__ keywords) from an earlier version where I had tried using CUDA C before I realized it was really extended C++ and not extended C, but those are trivial to remove (and will be removed before I push it to a branch). It should be possible to compile the device code with Clang instead of Nvidia’s compiler to use a fully open source toolchain, although I have yet to try it.

I have another fork here that uses the MKL to make the prompt processing portion run faster:

https://github.com/ryao/llama3.c

When I am happy with the CUDA version, I will either push it to a branch or to its own set of files in master.

By @blastbking - 4 months
super cool post!!