June 28th, 2024

Show HN: UNet diffusion model in pure CUDA

The GitHub content details optimizing a UNet diffusion model in C++/CUDA to match PyTorch's performance. It covers custom convolution kernels, forward pass improvements, backward pass challenges, and future optimization plans.

Read original articleLink Icon
Show HN: UNet diffusion model in pure CUDA

The GitHub content discusses the optimization journey of a UNet diffusion model training in C++/CUDA to match PyTorch's performance. It covers the implementation of various kernels, including custom convolution kernels for enhanced performance. The forward pass is highlighted as time-consuming, with optimizations focusing on residual blocks and convolutions. Challenges in the backward pass involve convolution computations and memory loads for gradient calculations. Future plans include further optimizing convolution kernels, addressing backward pass challenges, and enhancing other kernels like attention. The project draws inspiration from Karpathy's work and Boehm's blog on CUDA kernel optimization. The content offers a comprehensive view of the project's evolution, emphasizing performance enhancements from initial versions to the latest optimizations.

Related

20x Faster Background Removal in the Browser Using ONNX Runtime with WebGPU

20x Faster Background Removal in the Browser Using ONNX Runtime with WebGPU

Using ONNX Runtime with WebGPU and WebAssembly in browsers achieves 20x speedup for background removal, reducing server load, enhancing scalability, and improving data security. ONNX models run efficiently with WebGPU support, offering near real-time performance. Leveraging modern technology, IMG.LY aims to enhance design tools' accessibility and efficiency.

From bare metal to a 70B model: infrastructure set-up and scripts

From bare metal to a 70B model: infrastructure set-up and scripts

The Imbue team trained a 70B parameter model, surpassing GPT-4o. They shared a guide on infrastructure setup, covering health checks, patches, tests, and addressing challenges like synchronization, failures, and bottlenecks.

Infrastructure set-up & open-source scripts to train a 70B model from bare metal

Infrastructure set-up & open-source scripts to train a 70B model from bare metal

The Imbue team trained a 70B parameter model, surpassing GPT-4o. They shared a guide for infrastructure setup, covering health checks, patches, tests, and addressing challenges like synchronization, failures, and bottlenecks.

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Nscale explores AI model optimization through GEMM tuning, leveraging rocBLAS and hipBLASlt for AMD MI300x GPUs. Results show up to 7.2x throughput increase and reduced latency, benefiting large models and enhancing processing efficiency.

ML from Scratch, Part 3: Backpropagation (2019)

ML from Scratch, Part 3: Backpropagation (2019)

The article explains backpropagation in neural networks, detailing equations, matrix operations, and activation functions. It emphasizes linear algebra and calculus, model fitting, parameter optimization, and binary cross-entropy for minimizing loss. Calculating gradients and deltas iteratively is crucial.

Link Icon 0 comments