February 20th, 2025

AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition

Sakana AI launched The AI CUDA Engineer, automating PyTorch to CUDA kernel conversion, achieving 10 to 100 times speedups. A dataset of 17,000 kernels supports further optimization, despite some challenges.

Read original articleLink Icon
AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition

Sakana AI has introduced The AI CUDA Engineer, an innovative framework designed to automate the discovery and optimization of CUDA kernels, significantly enhancing the efficiency of AI systems. This framework leverages advanced large language models (LLMs) to convert standard PyTorch code into optimized CUDA kernels, achieving speedups of 10 to 100 times over traditional implementations. The process involves several stages: translating PyTorch code into CUDA, applying evolutionary optimization techniques, and maintaining an Innovation Archive that builds on past successful kernels. The AI CUDA Engineer has demonstrated its capability to outperform existing CUDA kernels, achieving state-of-the-art performance in various machine learning operations. A dataset of over 17,000 verified kernels has been released, which can be utilized for further optimization and fine-tuning of AI models. Despite its advancements, the framework faces challenges, including the potential for exploiting verification processes and limitations in utilizing advanced GPU features. Sakana AI envisions a future where AI systems can achieve efficiencies comparable to human intelligence, emphasizing the importance of using AI to enhance AI development.

- The AI CUDA Engineer automates the conversion of PyTorch code to optimized CUDA kernels.

- Speedups of 10 to 100 times over traditional implementations have been achieved.

- A dataset of over 17,000 verified CUDA kernels has been released for further research.

- The framework faces challenges related to verification and advanced GPU feature utilization.

- Sakana AI aims to make AI systems as efficient as human intelligence through this technology.

Link Icon 2 comments
By @ragnarok451 - 2 months
This was debunked - the agent was actually fooling the verification harness https://x.com/SakanaAILabs/status/1892992938013270019. One particular test that showed a 150x speedup is actually 3x slower.
By @01100011 - 2 months
Nvidia is doing work like this internally: https://developer.nvidia.com/blog/automating-gpu-kernel-gene...