Run CUDA, unmodified, on AMD GPUs
SCALE is a GPGPU programming toolkit enabling CUDA apps to run on AMD GPUs without modification. It supports various vendors, mimics CUDA Toolkit, and aims for full compatibility with optional extensions.
Read original articleSCALE is a GPGPU programming toolkit that enables CUDA applications to be compiled for AMD GPUs without modifying the original CUDA program or its build system. It supports various GPU vendors and CUDA APIs, with ongoing development to expand compatibility. SCALE accepts CUDA programs without the need for language porting and mimics the NVIDIA CUDA Toolkit installation for seamless integration with existing build tools. The toolkit has been tested with open-source projects like NVIDIA Thrust and Blender Cycles, ensuring compatibility and functionality. Supported GPUs include AMD gfx1030 and gfx1100, with plans to extend support to other models. SCALE comprises an nvcc-compatible compiler, CUDA runtime and driver APIs for AMD GPUs, and wrapper libraries for CUDA-X APIs. Unlike other solutions, SCALE focuses on directly compiling CUDA code for AMD GPUs, aiming for full compatibility with NVIDIA CUDA while offering optional language extensions for improved efficiency. Users can contact the developers for support or to request additional features.
Related
AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x
Nscale explores AI model optimization through GEMM tuning, leveraging rocBLAS and hipBLASlt for AMD MI300x GPUs. Results show up to 7.2x throughput increase and reduced latency, benefiting large models and enhancing processing efficiency.
AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x
Nscale explores GEMM tuning impact on AI model optimization, emphasizing throughput and latency benefits. Fine-tuning parameters and algorithms significantly boost speed and efficiency, especially on AMD GPUs, showcasing up to 7.2x throughput improvement.
GPUs can now use PCIe-attached memory or SSDs to boost VRAM capacity
Major companies like AMD, Intel, and Nvidia are considering supporting Panmnesia's CXL IP for GPU memory expansion using PCIe-attached memory or SSDs. Panmnesia's low-latency solution outperforms traditional methods, showing promise for AI/HPC applications. Adoption by key players remains uncertain.
GPUs can now use PCIe-attached memory or SSDs to boost VRAM capacity
Major companies like AMD, Intel, and Nvidia are considering supporting Panmnesia's CXL IP technology for GPU memory expansion using PCIe-attached memory or SSDs. The low-latency CXL IP addresses increasing memory needs for AI training datasets, offering improved GPU performance for AI and HPC applications. Adoption by GPU manufacturers is uncertain, potentially shaping future GPU memory expansion technologies.
gpu.cpp: A lightweight library for portable low-level GPU computation
The GitHub repository features gpu.cpp, a lightweight C++ library for portable GPU compute using WebGPU. It offers fast cycles, minimal dependencies, and examples like GELU kernel and matrix multiplication for easy integration.
- Concerns about compatibility and performance: Many commenters doubt that complex CUDA code can "just work" on AMD hardware without significant issues.
- Legal and licensing issues: Several comments highlight potential legal challenges and licensing restrictions, particularly with Nvidia's proprietary libraries.
- Open source and transparency: There is disappointment that SCALE is not open source, with some suggesting that open alternatives like ZLUDA might be more viable.
- AMD's market position: Some believe AMD has missed opportunities in the GPU market, particularly in AI and ML, and hope SCALE can help level the playing field.
- Technical feasibility and implementation: Commenters are curious about the technical details and feasibility of SCALE, with some expressing skepticism about its long-term viability and performance.
Chasing bug-for-bug compatibility is a fool's errand. The important users of CUDA are open source. AMD can implement support directly in the upstream projects like pytorch or llama.cpp. And once support is there it can be maintained by the community.
Presumably that stuff doesn't "just work" but they don't want to mention it?
Edit: not sure why I just sort of expect projects to be open source or at least source available these days.
Maybe AMD fears antitrust action, or maybe there is something about its underlying hardware approach that would limit competitiveness, but the company seems to have left billions of dollars on the table during the crypto mining GPU demand spike and now during the AI boom demand spike.
At the time, not only did they target AMD (with less compatibility than they have now), but also outperformed the default LLVM ptx backend, and even NVCC, when compiling for Nvidia GPUs!
But I can't help but think if something like this can be done to this extend, I wonder what went wrong/why it's a struggle for OpenCL to unify the two fragmentized communities. While this is very practical and has a significant impact for people who develop GPGPU/AI applications, for the heterogeneous computing community as a whole, relying on/promoting a proprietary interface/API/language to become THE interface to work with different GPUs sounds like bad news.
Can someone educate me on why OpenCL seems to be out of scene in the comments/any of the recent discussions related to this topic?
If they plan to open it up, it can be something useful to add to options of breaking CUDA lock-in.
E.g., how does Cycles compare on AMD vs Nvidia?
Working from cuda source that doesn't use inline ptx to target amdgpu is roughly regex find and replace to get hip, which has implemented pretty much the same functionality.
Some of the details would be dubious, e.g. the atomic models probably don't match, and volta has a different instruction pointer model, but it could all be done correctly.
Amd won't do this. Cuda isn't a very nice thing in general and the legal team would have kittens. But other people totally could.
Our jerry-rigged solution for now is writing kernels that are the same source for both OpenCL and CUDA, with a few macros doing a bit of adaptation (e.g. the syntax for constructing a struct). This requires no special library or complicated runtime work - but it does have the downside of forcing our code to be C'ish rather than C++'ish, which is quite annoying if you want to write anything that's templated.
Note that all of this regards device-side, not host-side, code. For the host-side, I would like, at some point, to take the modern-C++ CUDA API wrappers (https://github.com/eyalroz/cuda-api-wrappers/) and derive from them something which supports CUDA, OpenCL and maybe HIP/ROCm. Unfortunately, I don't have the free time to do this on my own, so if anyone is interested in collaborating on something like that, please drop me a line.
-----
You can find the OpenCL-that-is-also-CUDA mechanism at:
https://github.com/eyalroz/gpu-kernel-runner/blob/main/kerne...
and
https://github.com/eyalroz/gpu-kernel-runner/blob/main/kerne...
(the files are provided alongside a tool for testing, profiling and debugging individual kernels outside of their respective applications.)
But can this help me directly? Or would OpenAI have to use this tool for me to benefit?
It is not immediately clear to me (but I am a beginner in this space).
1) Kudos
2) Finally !
Does NCCL just work? If not, what would be involved in getting it to work?
How do I find out which do I have?
how big of a deal is this?
This sounds like DirectX vs OpenGL debate when I was younger lol
Related
AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x
Nscale explores AI model optimization through GEMM tuning, leveraging rocBLAS and hipBLASlt for AMD MI300x GPUs. Results show up to 7.2x throughput increase and reduced latency, benefiting large models and enhancing processing efficiency.
AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x
Nscale explores GEMM tuning impact on AI model optimization, emphasizing throughput and latency benefits. Fine-tuning parameters and algorithms significantly boost speed and efficiency, especially on AMD GPUs, showcasing up to 7.2x throughput improvement.
GPUs can now use PCIe-attached memory or SSDs to boost VRAM capacity
Major companies like AMD, Intel, and Nvidia are considering supporting Panmnesia's CXL IP for GPU memory expansion using PCIe-attached memory or SSDs. Panmnesia's low-latency solution outperforms traditional methods, showing promise for AI/HPC applications. Adoption by key players remains uncertain.
GPUs can now use PCIe-attached memory or SSDs to boost VRAM capacity
Major companies like AMD, Intel, and Nvidia are considering supporting Panmnesia's CXL IP technology for GPU memory expansion using PCIe-attached memory or SSDs. The low-latency CXL IP addresses increasing memory needs for AI training datasets, offering improved GPU performance for AI and HPC applications. Adoption by GPU manufacturers is uncertain, potentially shaping future GPU memory expansion technologies.
gpu.cpp: A lightweight library for portable low-level GPU computation
The GitHub repository features gpu.cpp, a lightweight C++ library for portable GPU compute using WebGPU. It offers fast cycles, minimal dependencies, and examples like GELU kernel and matrix multiplication for easy integration.