July 19th, 2024

Kompute – Vulkan Alternative to CUDA

The GitHub repository for the Kompute project offers detailed insights into its principles, features, and examples. It covers Python modules, asynchronous processing, mobile optimization, memory management, unit testing, and more.

Read original articleLink Icon
Kompute – Vulkan Alternative to CUDA

The GitHub repository for the Kompute project contains detailed information on its principles, features, architectural overview, and examples. It covers topics such as flexible Python modules, asynchronous processing, mobile optimization, memory management, unit testing, and advanced use-cases. Additionally, it includes a list of projects using Kompute, GPU multiplication examples, interactive notebooks, hands-on videos, and an explanation of its core components. The project emphasizes asynchronous and parallel operations, mobile compatibility, and provides various examples for different use cases. Details about the Python package, C++ build system using CMake, development guidelines, and motivations behind the project are also available. This comprehensive resource serves as a guide for understanding Kompute's capabilities and how to start utilizing it for GPU compute tasks.

Related

Neko: Portable framework for high-order spectral element flow simulations

Neko: Portable framework for high-order spectral element flow simulations

A portable framework named "Neko" for high-order spectral element flow simulations in modern Fortran. Object-oriented, supports various hardware, with detailed documentation, cloning guidelines, publications, and development acknowledgments. Additional support available for inquiries.

Komorebi: Tiling Window Management for Windows

Komorebi: Tiling Window Management for Windows

The "komorebi" project is a tiling window manager for Windows, extending Microsoft's Desktop Window Manager. It offers CLI control, installation guides, configuration details, and a supportive community for contributions and discussions.

Karpathy: Let's reproduce GPT-2 (1.6B): one 8XH100 node 24h $672 in llm.c

Karpathy: Let's reproduce GPT-2 (1.6B): one 8XH100 node 24h $672 in llm.c

The GitHub repository focuses on the "llm.c" project by Andrej Karpathy, aiming to implement Large Language Models in C/CUDA without extensive libraries. It emphasizes pretraining GPT-2 and GPT-3 models.

gpu.cpp: A lightweight library for portable low-level GPU computation

gpu.cpp: A lightweight library for portable low-level GPU computation

The GitHub repository features gpu.cpp, a lightweight C++ library for portable GPU compute using WebGPU. It offers fast cycles, minimal dependencies, and examples like GELU kernel and matrix multiplication for easy integration.

Nvidia Warp (a Python framework for writing high-performance code)

Nvidia Warp (a Python framework for writing high-performance code)

Warp is a Python framework for high-performance simulation and graphics, compiling functions for CPU or GPU. It supports spatial computing, differentiable kernels, PyTorch, JAX, and USD file generation. Installation options include PyPI and CUDA.

Link Icon 10 comments
By @Conscat - 7 months
Vulkan has some advantages to OpenCL. You gain lower level control over memory allocation and resource synchronization. Rocm has an infamous synchronization pessimization which doesn't exist for Vulkan. You can even explicitly allocate Vulkan resources at specific memory addresses, which means Vulkan can easily be used for embedded devices.

But some of the caveats for compute applications are currently:

- No bfloat16 in shaders

- No shader work graphs (GPU-driven shader control flow)

- No inline PTX (inline GCN/RDNA/GEN is available)

These may or may not be important to you. Vulkan recently gained an ability to seamlessly dispatch CUDA kernels if you need these in some places, but there aren't currently similar Vulkan extensions for HIP.

By @einpoklum - 7 months
This is _not_ an alternative to CUDA nor to OpenCL. It has some high-level and opinionated API [1], which covers a part (rather small part) of the API of each of those.

It may, _in principle_, have been developed - with much more work than has gone into it - into such an alternative; but I am actually not sure of that since I have poor command of Vulcan. I got suspicious being someone who maintains C++ API wrappers for CUDA myself [2], and know that just doing that is a lot more code and a lot more work.

[1] - I assume it is opinionated to cater to CNN simulation for large language models, and basically not much more.

[2] - https://github.com/eyalroz/cuda-api-wrappers/

By @Remnant44 - 7 months
This looks great - I've been looking for a sustainable, cross-platform-and-vendor GPU compute solution, and the alternatives are not really great. CUDA is nvidia only, Metal is apple only, etc etc. OpenCL has been the closest match but it seems like it's on the way out.

Does anyone have real world experience using Vulkan compute shaders versus, say, OpenCL? Does Kompute make things as straightforward as it seems?

By @pjmlp - 7 months
Alternatives can only become one, if they support the same set of C, C++, Fortran, and PTX compiler backends, with similar level of IDE integration, grapical GPGPU debugging, and frameworks.

Until then they are wannabe alternatives, for a subset of use cases, with lesser tooling.

It always feels like those proposing CUDA alternatives don't understand what they are trying to replace, and that is already the first error.

By @kcb - 7 months
A key component of CUDA is that the kernels are written in C/C++ and not some shader language you would only be familiar with if you were into graphics.
By @JackYoustra - 7 months
Anyone have a comparison to something like wgsl's compute shader mode over stuff like wgpu? I've never seriously written in either.
By @cowmix - 7 months
Pytorch alreadh has Vulkan support -- and Kompute does not support pytorch yet. That's is going to show adaptation of this project.
By @axsaucedo - 7 months
Kompute author here - thank you very much for sharing our work!

If you are interested to learn more, do join the community through our discord here: https://discord.gg/MaH5Jv5zwv

For some background, this project started after seeing various renowned machine learning frameworks like Pytorch and Tensorflow integrating Vulkan as a backend. The Vulkan SDK offers a great low level interface that enables for highly specialized optimizations - however it comes at a cost of highly verbose code which requires 800-2000 lines of code to even begin writing application code. This has resulted in each of these projects having to implement the same baseline to abstract the non-compute related features of the Vulkan SDK.

This large amount of non-standardised boiler-plate can result in limited knowledge transfer, higher chance of unique framework implementation bugs being introduced, etc. We are aiming to address this with Kompute. As of today, we are now part of the Linux Foundation, and slowly contributing to the cross-vendor GPGPU revolution.

Some of the key features / highlights of Kompute:

* C++ SDK with Flexible Python Package * BYOV: Bring-your-own-Vulkan design to play nice with existing Vulkan applications * Asynchronous & parallel processing support through GPU family queues * Explicit relationships for GPU and host memory ownership and memory management: https://kompute.cc/overview/memory-management.html * Robust codebase with 90% unit test code coverage: https://kompute.cc/codecov/ * Mobile enabled via Android NDK across several architectures

Relevant blog posts:

Machine Learning: https://towardsdatascience.com/machine-learning-and-data-pro...

Mobile development: https://towardsdatascience.com/gpu-accelerated-machine-learn...

Game development (we need to update to Godot4): https://towardsdatascience.com/supercharging-game-developmen...

By @EVa5I7bHFq9mnYK - 7 months
Can't we make a chip that only does one thing: multiply and add a lot of 32x32 matrices in parallel? I think that would be enough for all AI needs and easy to program.
By @ein0p - 7 months
All you really need form these in Transfofmers-dominated 2024 are GEMM and GEMV, plus fused RMS norm and some element wise primitives to apply RoPE and residuals. And all of that must be brain dead easy to install and access, and it should be cross platform. And yet no such thing exists as far as I can tell.