April 24th, 2025

PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch

The paper presents PyGraph, which enhances CUDA Graph deployment in PyTorch by optimizing performance issues, reducing overhead, and integrating with the compilation toolchain, showing significant improvements over PyTorch2.

Read original articleLink Icon
PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch

The paper titled "PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch" introduces a new approach to enhance the deployment of CUDA Graphs within the PyTorch framework. CUDA Graphs, a feature for NVIDIA GPUs, aim to minimize CPU launch overhead by capturing and executing a series of GPU tasks as a directed acyclic graph (DAG). However, the static nature of these graphs can lead to performance issues, including data copy overheads that may degrade performance. The authors present PyGraph, which incorporates three key optimizations: it broadens the deployment of CUDA Graphs, reduces the overhead associated with GPU kernel parameter copying, and selectively implements CUDA Graphs based on a cost-benefit analysis. PyGraph integrates seamlessly with PyTorch's compilation toolchain, allowing for efficient utilization of CUDA Graphs without requiring manual code adjustments. The evaluation of PyGraph across various machine learning benchmarks indicates significant performance improvements compared to the existing PyTorch2 framework.

- PyGraph enhances the deployment of CUDA Graphs in PyTorch.

- It addresses performance issues related to static graph structures and data copying.

- The approach includes optimizations for broader deployment and reduced overhead.

- PyGraph integrates with PyTorch's compilation toolchain for ease of use.

- Performance evaluations show substantial improvements over PyTorch2.

Link Icon 6 comments
By @damnitbuilds - about 5 hours
Python can be used for many types of graphs. This package is for CUDA Graphs, so wouldn't "PyCudaGraph" be a better name?
By @infocollector - about 13 hours
The lack of a readily available, installable package (pip install pygraph - has no relation to this paper as far as i can tell) makes it difficult to fully assess the reproducibility and practical applicability of the work.
By @saagarjha - about 7 hours
This is neat, although it would be nice to see it merged into PyTorch instead of just a paper :) The key seems to be (beyond "obvious" optimizations like not running graphs that are measured to be slower) is that graphs "bake-in" parameters and if those change then the graph needs to be thrown away. The solution is indirecting more, so that what gets captured is a pointer that can remain constant, while the data behind it is changed. This also saves the need to copy in and out of a graph-captured buffer because you can just swap out the pointer instead. Of course there is overhead to this approach (I don't think the authors actually explore this much) in that you throw away information (divisibility, for example) that would allow for constructing better kernels, but often this is still worth it. (Or you could pass this through too.)

Something worth exploring later would be getting better support for the rest of CUDA graphs into PyTorch, like conditional nodes.

By @tho423i43234 - about 13 hours
Nice to see work by IISc show up on HN.

Uday Bondhugula, the lead developer of Pluto framework for polyhedral comp. is also at IISc, whose group has spun out a startup,

https://www.polymagelabs.com/

Nice to see IISc support cool stuff like this (incl. their ArtPark initiative.)

By @OutOfHere - about 12 hours
I don't see any source code.