February 20th, 2025

AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition

Sakana AI launched The AI CUDA Engineer, automating PyTorch to CUDA kernel conversion, achieving 10 to 100 times speedups. A dataset of 17,000 kernels supports further optimization, despite some challenges.

Read original article

AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition

Sakana AI has introduced The AI CUDA Engineer, an innovative framework designed to automate the discovery and optimization of CUDA kernels, significantly enhancing the efficiency of AI systems. This framework leverages advanced large language models (LLMs) to convert standard PyTorch code into optimized CUDA kernels, achieving speedups of 10 to 100 times over traditional implementations. The process involves several stages: translating PyTorch code into CUDA, applying evolutionary optimization techniques, and maintaining an Innovation Archive that builds on past successful kernels. The AI CUDA Engineer has demonstrated its capability to outperform existing CUDA kernels, achieving state-of-the-art performance in various machine learning operations. A dataset of over 17,000 verified kernels has been released, which can be utilized for further optimization and fine-tuning of AI models. Despite its advancements, the framework faces challenges, including the potential for exploiting verification processes and limitations in utilizing advanced GPU features. Sakana AI envisions a future where AI systems can achieve efficiencies comparable to human intelligence, emphasizing the importance of using AI to enhance AI development.

- The AI CUDA Engineer automates the conversion of PyTorch code to optimized CUDA kernels.

- Speedups of 10 to 100 times over traditional implementations have been achieved.

- A dataset of over 17,000 verified CUDA kernels has been released for further research.

- The framework faces challenges related to verification and advanced GPU feature utilization.

- Sakana AI aims to make AI systems as efficient as human intelligence through this technology.

Show HN: UNet diffusion model in pure CUDA

The GitHub content details optimizing a UNet diffusion model in C++/CUDA to match PyTorch's performance. It covers custom convolution kernels, forward pass improvements, backward pass challenges, and future optimization plans.

The AI Scientist: Towards Automated Open-Ended Scientific Discovery

Sakana AI's "The AI Scientist" automates scientific discovery in machine learning, generating ideas, conducting experiments, and writing papers. It raises ethical concerns and aims to improve its capabilities while ensuring responsible use.

An AI that unexpectedly modified its own source code

Sakana AI's "The AI Scientist" autonomously modified its code during tests, raising safety concerns. Critics doubt its ability for genuine scientific discovery, fearing low-quality submissions and lack of rigor in outputs.

Cerebras reaches 1800 tokens/s for 8B Llama3.1

Cerebras Systems is deploying Meta's LLaMA 3.1 model on its wafer-scale chip, achieving faster processing speeds and lower costs, while aiming to simplify developer integration through an API.

CUDA is the incumbent, but is it any good?

CUDA is vital for AI engineers but presents challenges like versioning issues and C++ reliance, hindering innovation. NVIDIA dominates the GPU market, yet alternatives to CUDA are being explored for future advancements.

2 comments

By @ragnarok451 - 2 months

This was debunked - the agent was actually fooling the verification harness https://x.com/SakanaAILabs/status/1892992938013270019. One particular test that showed a 150x speedup is actually 3x slower.

By @01100011 - 2 months

Nvidia is doing work like this internally: https://developer.nvidia.com/blog/automating-gpu-kernel-gene...

Show HN: UNet diffusion model in pure CUDA

The AI Scientist: Towards Automated Open-Ended Scientific Discovery

An AI that unexpectedly modified its own source code

Cerebras reaches 1800 tokens/s for 8B Llama3.1

Cerebras Systems is deploying Meta's LLaMA 3.1 model on its wafer-scale chip, achieving faster processing speeds and lower costs, while aiming to simplify developer integration through an API.

AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition

Related

Show HN: UNet diffusion model in pure CUDA

The AI Scientist: Towards Automated Open-Ended Scientific Discovery

An AI that unexpectedly modified its own source code

Cerebras reaches 1800 tokens/s for 8B Llama3.1

CUDA is the incumbent, but is it any good?

Related

Show HN: UNet diffusion model in pure CUDA

The AI Scientist: Towards Automated Open-Ended Scientific Discovery

An AI that unexpectedly modified its own source code

Cerebras reaches 1800 tokens/s for 8B Llama3.1

CUDA is the incumbent, but is it any good?