Initial CUDA Performance Lessons
Malte Skarupke discusses optimizing CUDA performance, highlighting memory coalescing, specialized hardware like tensor cores, and the importance of understanding CUDA memory types and prioritizing parallelism in algorithm design.
Read original articleMalte Skarupke shares insights on optimizing CUDA performance, emphasizing that CUDA is essentially C++ with additional features. He highlights the importance of memory coalescing, where threads should access adjacent memory locations to enhance speed, contrasting it with traditional C++ practices. Skarupke notes that modern PCs, particularly those with GPUs, rely heavily on specialized hardware for performance, with tensor cores significantly boosting capabilities in tasks like deep learning. He categorizes CUDA memory into three types: normal, shared, and registers, explaining that registers can store more data than shared memory and are crucial for performance. The author also discusses the concept of warps, where 32 threads execute the same instruction simultaneously, allowing for efficient data sharing. He advises prioritizing parallelism in algorithm design, suggesting that launching multiple threads for different tasks can lead to better performance. Skarupke concludes that effective CUDA programming requires a shift in mindset, focusing on maximizing GPU utilization rather than merely optimizing code.
- CUDA is a C++ extension with specific performance considerations.
- Memory coalescing is essential for efficient thread execution.
- Specialized hardware, like tensor cores, significantly enhances GPU performance.
- Understanding different memory types in CUDA is crucial for optimization.
- Prioritizing parallelism in algorithm design leads to better performance outcomes.
Related
Show HN: UNet diffusion model in pure CUDA
The GitHub content details optimizing a UNet diffusion model in C++/CUDA to match PyTorch's performance. It covers custom convolution kernels, forward pass improvements, backward pass challenges, and future optimization plans.
How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog
The author optimizes a CUDA matrix multiplication kernel, achieving 93.7% of cuBLAS performance through techniques like memory coalescing and shared memory caching, emphasizing GPU performance in deep learning applications.
Fast Multidimensional Matrix Multiplication on CPU from Scratch
The article examines multidimensional matrix multiplication performance on CPUs using Numpy and C++. It discusses optimization techniques and challenges in replicating Numpy's efficiency, emphasizing the importance of memory access patterns.
CPU Dispatching: Make your code both portable and fast (2020)
CPU dispatching improves software performance and portability by allowing binaries to select code versions based on CPU features at runtime, with manual and compiler-assisted approaches enhancing efficiency, especially using SIMD instructions.
Zen, CUDA, and Tensor Cores, Part I: The Silicon
The article compares Zen, CUDA, and Tensor cores, highlighting their physical structures and complexities. Zen 4 cores are larger and more intricate than CUDA and Tensor cores, with measurement challenges noted.
There are 65536 registers per SM not thread block and while you can indirectly control that by making your block takes all the SM but this presents its own problems.
NVIDIA hardware limits the threads max number to 1024 (2048) and shared memory to 48 KB (64 KB) per SM. So if you consume all of that in one thread block or near the maximum then you are using one thread block per SM. You don't usually want to do that because it will lower your occupancy. Additionaly , If the kernel you’re running is not compute-bound and does not need all the registers or shared memory allocated to it, having fewer blocks on the SM could leave some compute resources idle. GPUs are designed to thrive on parallelism, and limiting the number of active blocks could cause underutilization of the SM’s cores, leading to poor performance. Finally, If each thread block occupies an entire SM, you limit the scalability of your kernel to the number of SMs on the GPU. For example, if your GPU has 60 SMs, and each block uses one SM, you can only run 60 blocks in parallel, even if the problem you’re solving could benefit from more parallelism. This can reduce the efficiency of the GPU for very large problem sizes.
I can't help but feel that with CUDA we're having new constraints (32 threads in a warp, what?), which are begging to be unified at some point.
> My mental model [for GPU threads] is that you’ve got a bunch of container ships that can travel at 10% of the speed of light. You’re using them to ship goods around the world. They’re very fast so most of the work is in setting up your harbors so that you can load and unload these container-ships in fractions of a second so that it can sail to do the next thing. It’s not easy to feed these beasts, but if you do it right you can do huge chunks of work in almost no time.
For example, LLM AI inference on a desktop (where you don’t have a dozen of concurrent sessions from multiple users) is guaranteed to be memory bound, fetching these gigabytes of model’s tensors for each generated token. For use cases like that, specialized tensor cores deliver about the same performance as well-written compute shaders running on general purpose GPU cores.
However, AVX512 is way slower than GPUs, because modern GPUs have memory with very high bandwidth. In my desktop computer the system memory is dual channel DDR5 which delivers 75 GB/s, VRAM in the discrete GPU 670 GB/sec.
Also the 9950x has 32 threads, but is hyperthreaded, so it only has 16 actual cores, so the correct scaling factor is 16 cores * 16 SIMD lanes. Anyway the final number is 8.678 32 bit float TFLOPS.
The RTX 4090 has 82.58 32 bit TFLOPS according to Nvidia, but it also costs far more than the 9950x($1,600 vs $650), so I find this comparison rather odd.
So it costs 2.46 as much and delivers 9.5x the perf.
If you normalize for cost the perf advantage is about 3.8x, which is roughly the same numbers Intel reported years ago when they debunked the whole GPU is 100x better nonsense.
Anyway, I really hate the Cuda terminology where they refer to SIMD lanes as "threads".
There are also alot of the things to consider, where either the CPU or GPU has an advantage such as..
GPU advantages:
Hardware sin/cos support(with Nivida at least)
abs/saturate are often just modifiers
scaling by small powers of 2 is often free
16bit floats are fully supported
CPU advantages:
doubles are full speed and you can interleave with floats if you just need for a few calculations
access to wide variety of integer sizes and bit manipulation functions, GPU has some of this but not nearly as broad
lower level programing model
It's interesting from the perspective of maintenance too. You can bet most constants like warp sizes will change, so you get into things like having profiles, autotuners, or not sweating the small stuff.
We went more extreme, and nowadays focus on several layers up: By accepting the (high!) constant overheads of tools like RAPIDS cuDF , we get in exchange the ability to easily crank code with good saturation on the newest GPUs and that any data scientist can edit and extend. Likewise, they just need to understand basics like data movement and columnar analytics data reps to make GPU pipelines. We have ~1 CUDA kernel left and many years of higher-level.
As an example, this is one of the core methods of our new graph query language GFQL (think cypher on pandas/spark, w optional GPU runtime), and it gets Graph500 level performance on cheapo GPUs just by being data parallel with high saturation per step: https://github.com/graphistry/pygraphistry/blob/master/graph... . Despite ping-ponging a ton because cudf doesn't (yet) coalesce GPU kernel calls, V1 competes surprisingly high, and is easy to maintain & extend.
compare a 500W GPU to all the cores of a 500W CPU, please. I'm not expecting the CPU (say, a 192-core AMD that does fast AVX512) to beat the GPU on all data-parallel workloads, but it won't be the silly sort of graphs shown in this blog.
or compare one SM to one CPU core - that has merit as well.
best yet, we're finally getting some CPUs (well, APUs...) with in-package RAM. that makes the comparison more interesting as well.
Related
Show HN: UNet diffusion model in pure CUDA
The GitHub content details optimizing a UNet diffusion model in C++/CUDA to match PyTorch's performance. It covers custom convolution kernels, forward pass improvements, backward pass challenges, and future optimization plans.
How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog
The author optimizes a CUDA matrix multiplication kernel, achieving 93.7% of cuBLAS performance through techniques like memory coalescing and shared memory caching, emphasizing GPU performance in deep learning applications.
Fast Multidimensional Matrix Multiplication on CPU from Scratch
The article examines multidimensional matrix multiplication performance on CPUs using Numpy and C++. It discusses optimization techniques and challenges in replicating Numpy's efficiency, emphasizing the importance of memory access patterns.
CPU Dispatching: Make your code both portable and fast (2020)
CPU dispatching improves software performance and portability by allowing binaries to select code versions based on CPU features at runtime, with manual and compiler-assisted approaches enhancing efficiency, especially using SIMD instructions.
Zen, CUDA, and Tensor Cores, Part I: The Silicon
The article compares Zen, CUDA, and Tensor cores, highlighting their physical structures and complexities. Zen 4 cores are larger and more intricate than CUDA and Tensor cores, with measurement challenges noted.