April 24th, 2025

PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch

The paper presents PyGraph, which enhances CUDA Graph deployment in PyTorch by optimizing performance issues, reducing overhead, and integrating with the compilation toolchain, showing significant improvements over PyTorch2.

Read original article

The paper titled "PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch" introduces a new approach to enhance the deployment of CUDA Graphs within the PyTorch framework. CUDA Graphs, a feature for NVIDIA GPUs, aim to minimize CPU launch overhead by capturing and executing a series of GPU tasks as a directed acyclic graph (DAG). However, the static nature of these graphs can lead to performance issues, including data copy overheads that may degrade performance. The authors present PyGraph, which incorporates three key optimizations: it broadens the deployment of CUDA Graphs, reduces the overhead associated with GPU kernel parameter copying, and selectively implements CUDA Graphs based on a cost-benefit analysis. PyGraph integrates seamlessly with PyTorch's compilation toolchain, allowing for efficient utilization of CUDA Graphs without requiring manual code adjustments. The evaluation of PyGraph across various machine learning benchmarks indicates significant performance improvements compared to the existing PyTorch2 framework.

- PyGraph enhances the deployment of CUDA Graphs in PyTorch.

- It addresses performance issues related to static graph structures and data copying.

- The approach includes optimizations for broader deployment and reduced overhead.

- PyGraph integrates with PyTorch's compilation toolchain for ease of use.

- Performance evaluations show substantial improvements over PyTorch2.

Show HN: UNet diffusion model in pure CUDA

The GitHub content details optimizing a UNet diffusion model in C++/CUDA to match PyTorch's performance. It covers custom convolution kernels, forward pass improvements, backward pass challenges, and future optimization plans.

PyTorch 2.4 Now Supports Intel GPUs for Faster Workloads

PyTorch 2.4 introduces support for Intel Data Center GPU Max Series, enhancing AI workloads with minimal code changes. Future updates in 2.5 will expand functionality and benchmarks, inviting community contributions.

Check if your performance intuition still works with CUDA

CUDA, developed by NVIDIA, enhances computational speed on GPUs for parallel processing. The article explores performance optimizations for mathematical operations, highlighting the benefits of single-precision floats and manual optimizations.

AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition

Sakana AI launched The AI CUDA Engineer, automating PyTorch to CUDA kernel conversion, achieving 10 to 100 times speedups. A dataset of 17,000 kernels supports further optimization, despite some challenges.

Introduction to Graph Transformers

Graph Transformers improve processing of graph-structured data by capturing long-range dependencies and integrating edge information, enabling efficient handling of large datasets in applications like protein folding and fraud detection.

6 comments

By @damnitbuilds - about 5 hours

Python can be used for many types of graphs. This package is for CUDA Graphs, so wouldn't "PyCudaGraph" be a better name?

By @infocollector - about 13 hours

The lack of a readily available, installable package (pip install pygraph - has no relation to this paper as far as i can tell) makes it difficult to fully assess the reproducibility and practical applicability of the work.

By @saagarjha - about 7 hours

This is neat, although it would be nice to see it merged into PyTorch instead of just a paper :) The key seems to be (beyond "obvious" optimizations like not running graphs that are measured to be slower) is that graphs "bake-in" parameters and if those change then the graph needs to be thrown away. The solution is indirecting more, so that what gets captured is a pointer that can remain constant, while the data behind it is changed. This also saves the need to copy in and out of a graph-captured buffer because you can just swap out the pointer instead. Of course there is overhead to this approach (I don't think the authors actually explore this much) in that you throw away information (divisibility, for example) that would allow for constructing better kernels, but often this is still worth it. (Or you could pass this through too.)

Something worth exploring later would be getting better support for the rest of CUDA graphs into PyTorch, like conditional nodes.

By @tho423i43234 - about 13 hours

Nice to see work by IISc show up on HN.

Uday Bondhugula, the lead developer of Pluto framework for polyhedral comp. is also at IISc, whose group has spun out a startup,

https://www.polymagelabs.com/

Nice to see IISc support cool stuff like this (incl. their ArtPark initiative.)

By @OutOfHere - about 12 hours

I don't see any source code.

PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch

Related

Show HN: UNet diffusion model in pure CUDA

PyTorch 2.4 Now Supports Intel GPUs for Faster Workloads

Check if your performance intuition still works with CUDA

AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition

Introduction to Graph Transformers

Related

Show HN: UNet diffusion model in pure CUDA

PyTorch 2.4 Now Supports Intel GPUs for Faster Workloads

Check if your performance intuition still works with CUDA

AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition

Introduction to Graph Transformers