October 11th, 2024

Initial CUDA Performance Lessons

Malte Skarupke discusses optimizing CUDA performance, highlighting memory coalescing, specialized hardware like tensor cores, and the importance of understanding CUDA memory types and prioritizing parallelism in algorithm design.

Read original article

Malte Skarupke shares insights on optimizing CUDA performance, emphasizing that CUDA is essentially C++ with additional features. He highlights the importance of memory coalescing, where threads should access adjacent memory locations to enhance speed, contrasting it with traditional C++ practices. Skarupke notes that modern PCs, particularly those with GPUs, rely heavily on specialized hardware for performance, with tensor cores significantly boosting capabilities in tasks like deep learning. He categorizes CUDA memory into three types: normal, shared, and registers, explaining that registers can store more data than shared memory and are crucial for performance. The author also discusses the concept of warps, where 32 threads execute the same instruction simultaneously, allowing for efficient data sharing. He advises prioritizing parallelism in algorithm design, suggesting that launching multiple threads for different tasks can lead to better performance. Skarupke concludes that effective CUDA programming requires a shift in mindset, focusing on maximizing GPU utilization rather than merely optimizing code.

- CUDA is a C++ extension with specific performance considerations.

- Memory coalescing is essential for efficient thread execution.

- Specialized hardware, like tensor cores, significantly enhances GPU performance.

- Understanding different memory types in CUDA is crucial for optimization.

- Prioritizing parallelism in algorithm design leads to better performance outcomes.

Show HN: UNet diffusion model in pure CUDA

The GitHub content details optimizing a UNet diffusion model in C++/CUDA to match PyTorch's performance. It covers custom convolution kernels, forward pass improvements, backward pass challenges, and future optimization plans.

How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog

The author optimizes a CUDA matrix multiplication kernel, achieving 93.7% of cuBLAS performance through techniques like memory coalescing and shared memory caching, emphasizing GPU performance in deep learning applications.

Fast Multidimensional Matrix Multiplication on CPU from Scratch

The article examines multidimensional matrix multiplication performance on CPUs using Numpy and C++. It discusses optimization techniques and challenges in replicating Numpy's efficiency, emphasizing the importance of memory access patterns.

CPU Dispatching: Make your code both portable and fast (2020)

CPU dispatching improves software performance and portability by allowing binaries to select code versions based on CPU features at runtime, with manual and compiler-assisted approaches enhancing efficiency, especially using SIMD instructions.

Zen, CUDA, and Tensor Cores, Part I: The Silicon

The article compares Zen, CUDA, and Tensor cores, highlighting their physical structures and complexities. Zen 4 cores are larger and more intricate than CUDA and Tensor cores, with measurement challenges noted.

9 comments

By @elashri - 7 months

I like this writeup as it summarizes my journey with optimizing some cuda code I wrote for an LHC experiment trigger. But there are few comments on some details.

There are 65536 registers per SM not thread block and while you can indirectly control that by making your block takes all the SM but this presents its own problems.

NVIDIA hardware limits the threads max number to 1024 (2048) and shared memory to 48 KB (64 KB) per SM. So if you consume all of that in one thread block or near the maximum then you are using one thread block per SM. You don't usually want to do that because it will lower your occupancy. Additionaly , If the kernel you’re running is not compute-bound and does not need all the registers or shared memory allocated to it, having fewer blocks on the SM could leave some compute resources idle. GPUs are designed to thrive on parallelism, and limiting the number of active blocks could cause underutilization of the SM’s cores, leading to poor performance. Finally, If each thread block occupies an entire SM, you limit the scalability of your kernel to the number of SMs on the GPU. For example, if your GPU has 60 SMs, and each block uses one SM, you can only run 60 blocks in parallel, even if the problem you’re solving could benefit from more parallelism. This can reduce the efficiency of the GPU for very large problem sizes.

By @amelius - 7 months

In the 90s we had segmented memory programming with near and far pointers, and you had to be very careful about when you used what type of pointer and how you'd organize your memory accesses. Then we got processors like the 286 that finally relieved us from this constrained way of programming.

I can't help but feel that with CUDA we're having new constraints (32 threads in a warp, what?), which are begging to be unified at some point.

By @Mithriil - 7 months

In the conclusion, I like the image:

> My mental model [for GPU threads] is that you’ve got a bunch of container ships that can travel at 10% of the speed of light. You’re using them to ship goods around the world. They’re very fast so most of the work is in setting up your harbors so that you can load and unload these container-ships in fractions of a second so that it can sail to do the next thing. It’s not easy to feed these beasts, but if you do it right you can do huge chunks of work in almost no time.

By @Const-me - 7 months

Note that not all problems are compute bound. Many practical problems bottleneck on memory bandwidth.

For example, LLM AI inference on a desktop (where you don’t have a dozen of concurrent sessions from multiple users) is guaranteed to be memory bound, fetching these gigabytes of model’s tensors for each generated token. For use cases like that, specialized tensor cores deliver about the same performance as well-written compute shaders running on general purpose GPU cores.

However, AVX512 is way slower than GPUs, because modern GPUs have memory with very high bandwidth. In my desktop computer the system memory is dual channel DDR5 which delivers 75 GB/s, VRAM in the discrete GPU 670 GB/sec.

By @TinkersW - 7 months

CPU numbers are off, as FMA is considered 2 instructions, and Zen5 can do 2 of them per cycle in addition to two adds, so it would be 6 instructions per cycle not 4(GPU numbers are always quoted this way, so it is only fair to do the same for the CPU).

Also the 9950x has 32 threads, but is hyperthreaded, so it only has 16 actual cores, so the correct scaling factor is 16 cores * 16 SIMD lanes. Anyway the final number is 8.678 32 bit float TFLOPS.

The RTX 4090 has 82.58 32 bit TFLOPS according to Nvidia, but it also costs far more than the 9950x($1,600 vs $650), so I find this comparison rather odd.

So it costs 2.46 as much and delivers 9.5x the perf.

If you normalize for cost the perf advantage is about 3.8x, which is roughly the same numbers Intel reported years ago when they debunked the whole GPU is 100x better nonsense.

Anyway, I really hate the Cuda terminology where they refer to SIMD lanes as "threads".

There are also alot of the things to consider, where either the CPU or GPU has an advantage such as..

GPU advantages:

Hardware sin/cos support(with Nivida at least)

abs/saturate are often just modifiers

scaling by small powers of 2 is often free

16bit floats are fully supported

CPU advantages:

doubles are full speed and you can interleave with floats if you just need for a few calculations

access to wide variety of integer sizes and bit manipulation functions, GPU has some of this but not nearly as broad

lower level programing model

By @lmeyerov - 7 months

Nice!

It's interesting from the perspective of maintenance too. You can bet most constants like warp sizes will change, so you get into things like having profiles, autotuners, or not sweating the small stuff.

We went more extreme, and nowadays focus on several layers up: By accepting the (high!) constant overheads of tools like RAPIDS cuDF , we get in exchange the ability to easily crank code with good saturation on the newest GPUs and that any data scientist can edit and extend. Likewise, they just need to understand basics like data movement and columnar analytics data reps to make GPU pipelines. We have ~1 CUDA kernel left and many years of higher-level.

As an example, this is one of the core methods of our new graph query language GFQL (think cypher on pandas/spark, w optional GPU runtime), and it gets Graph500 level performance on cheapo GPUs just by being data parallel with high saturation per step: https://github.com/graphistry/pygraphistry/blob/master/graph... . Despite ping-ponging a ton because cudf doesn't (yet) coalesce GPU kernel calls, V1 competes surprisingly high, and is easy to maintain & extend.

By @bagels - 7 months

Definitely not an expert, but trying to use AVX instructions explicitly in a c++ program can also produce un-optimal performance vs. just letting the optimizer decide, much like this article points out with not shaping your memory and compute to fit the GPU model.

By @miki123211 - 7 months

What are some actually good resources to learn this stuff?

By @markhahn - 7 months

little annoying to see the one-core-compared-to-whole-gpu comparisons - now decades past when this was an innocent wrong.

compare a 500W GPU to all the cores of a 500W CPU, please. I'm not expecting the CPU (say, a 192-core AMD that does fast AVX512) to beat the GPU on all data-parallel workloads, but it won't be the silly sort of graphs shown in this blog.

or compare one SM to one CPU core - that has merit as well.

best yet, we're finally getting some CPUs (well, APUs...) with in-package RAM. that makes the comparison more interesting as well.

Initial CUDA Performance Lessons

Related

Show HN: UNet diffusion model in pure CUDA

How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog

Fast Multidimensional Matrix Multiplication on CPU from Scratch

CPU Dispatching: Make your code both portable and fast (2020)

Zen, CUDA, and Tensor Cores, Part I: The Silicon

Related

Show HN: UNet diffusion model in pure CUDA

How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog

Fast Multidimensional Matrix Multiplication on CPU from Scratch

CPU Dispatching: Make your code both portable and fast (2020)

Zen, CUDA, and Tensor Cores, Part I: The Silicon