July 15th, 2024

Run CUDA, unmodified, on AMD GPUs

SCALE is a GPGPU programming toolkit enabling CUDA apps to run on AMD GPUs without modification. It supports various vendors, mimics CUDA Toolkit, and aims for full compatibility with optional extensions.

Read original article

SkepticismCuriosityExcitement

SCALE is a GPGPU programming toolkit that enables CUDA applications to be compiled for AMD GPUs without modifying the original CUDA program or its build system. It supports various GPU vendors and CUDA APIs, with ongoing development to expand compatibility. SCALE accepts CUDA programs without the need for language porting and mimics the NVIDIA CUDA Toolkit installation for seamless integration with existing build tools. The toolkit has been tested with open-source projects like NVIDIA Thrust and Blender Cycles, ensuring compatibility and functionality. Supported GPUs include AMD gfx1030 and gfx1100, with plans to extend support to other models. SCALE comprises an nvcc-compatible compiler, CUDA runtime and driver APIs for AMD GPUs, and wrapper libraries for CUDA-X APIs. Unlike other solutions, SCALE focuses on directly compiling CUDA code for AMD GPUs, aiming for full compatibility with NVIDIA CUDA while offering optional language extensions for improved efficiency. Users can contact the developers for support or to request additional features.

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Nscale explores AI model optimization through GEMM tuning, leveraging rocBLAS and hipBLASlt for AMD MI300x GPUs. Results show up to 7.2x throughput increase and reduced latency, benefiting large models and enhancing processing efficiency.

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Nscale explores GEMM tuning impact on AI model optimization, emphasizing throughput and latency benefits. Fine-tuning parameters and algorithms significantly boost speed and efficiency, especially on AMD GPUs, showcasing up to 7.2x throughput improvement.

GPUs can now use PCIe-attached memory or SSDs to boost VRAM capacity

Major companies like AMD, Intel, and Nvidia are considering supporting Panmnesia's CXL IP for GPU memory expansion using PCIe-attached memory or SSDs. Panmnesia's low-latency solution outperforms traditional methods, showing promise for AI/HPC applications. Adoption by key players remains uncertain.

GPUs can now use PCIe-attached memory or SSDs to boost VRAM capacity

Major companies like AMD, Intel, and Nvidia are considering supporting Panmnesia's CXL IP technology for GPU memory expansion using PCIe-attached memory or SSDs. The low-latency CXL IP addresses increasing memory needs for AI training datasets, offering improved GPU performance for AI and HPC applications. Adoption by GPU manufacturers is uncertain, potentially shaping future GPU memory expansion technologies.

gpu.cpp: A lightweight library for portable low-level GPU computation

The GitHub repository features gpu.cpp, a lightweight C++ library for portable GPU compute using WebGPU. It offers fast cycles, minimal dependencies, and examples like GELU kernel and matrix multiplication for easy integration.

AI: What people are saying

The article on SCALE, a toolkit enabling CUDA apps to run on AMD GPUs, has sparked a variety of discussions.

Concerns about compatibility and performance: Many commenters doubt that complex CUDA code can "just work" on AMD hardware without significant issues.
Legal and licensing issues: Several comments highlight potential legal challenges and licensing restrictions, particularly with Nvidia's proprietary libraries.
Open source and transparency: There is disappointment that SCALE is not open source, with some suggesting that open alternatives like ZLUDA might be more viable.
AMD's market position: Some believe AMD has missed opportunities in the GPU market, particularly in AI and ML, and hope SCALE can help level the playing field.
Technical feasibility and implementation: Commenters are curious about the technical details and feasibility of SCALE, with some expressing skepticism about its long-term viability and performance.

43 comments

By @modeless - 9 months

A lot of people think AMD should support these translation layers but I think it's a bad idea. CUDA is not designed to be vendor agnostic and Nvidia can make things arbitrarily difficult both technically and legally. For example I think it would be against the license agreement of cuDNN or cuBLAS to run them on this. So those and other Nvidia libraries would become part of the API boundary that AMD would need to reimplement and support.

Chasing bug-for-bug compatibility is a fool's errand. The important users of CUDA are open source. AMD can implement support directly in the upstream projects like pytorch or llama.cpp. And once support is there it can be maintained by the community.

By @ladberg - 9 months

I don't really see how any code that depends heavily on the underlying hardware can "just work" on AMD. Most serious CUDA code is aware of register file and shared memory sizes, wgmma instructions, optimal tensor core memory & register layouts, tensor memory accelerator instructions, etc...

Presumably that stuff doesn't "just work" but they don't want to mention it?

By @acheong08 - 9 months

Impressive if true. Unfortunately not open source and scarce on exact details on how it works

Edit: not sure why I just sort of expect projects to be open source or at least source available these days.

By @resters - 9 months

The main cause of Nvidia's crazy valuation is AMD's unwillingness to invest in making its GPUs as useful as Nvidia's for ML.

Maybe AMD fears antitrust action, or maybe there is something about its underlying hardware approach that would limit competitiveness, but the company seems to have left billions of dollars on the table during the crypto mining GPU demand spike and now during the AI boom demand spike.

By @Straw - 9 months

I worked for spectral compute a few years ago. Very smart and capable technical team.

At the time, not only did they target AMD (with less compatibility than they have now), but also outperformed the default LLVM ptx backend, and even NVCC, when compiling for Nvidia GPUs!

By @juujian - 9 months

I don't understand how AMD has messed up so badly that I feel like celebrating a project like this. Features of my laptop are just physically there but not usable, particularly in Linux. So frustrating.

By @ashvardanian - 9 months

It’s great that there is a page about current limitations [1], but I am afraid that what most people describe as “CUDA” is a small subset of the real CUDA functionality. Would be great to have a comparison table for advanced features like warp shuffles, atomics, DPX, TMA, MMA, etc. Ideally a table, mapping every PTX instruction to a direct RDNA counterpart or a list of instructions used to emulate it.

[1]: https://docs.scale-lang.com/manual/differences/

By @joe_the_user - 9 months

This sounds fabulous. I look forward to AMD being drawn kicking and screaming into direct competition with Nvidia.

By @spfd - 9 months

Very impressive!

But I can't help but think if something like this can be done to this extend, I wonder what went wrong/why it's a struggle for OpenCL to unify the two fragmentized communities. While this is very practical and has a significant impact for people who develop GPGPU/AI applications, for the heterogeneous computing community as a whole, relying on/promoting a proprietary interface/API/language to become THE interface to work with different GPUs sounds like bad news.

Can someone educate me on why OpenCL seems to be out of scene in the comments/any of the recent discussions related to this topic?

By @shmerl - 9 months

Compiler isn't open source? That feels like DOA in this day and age. There is ZLUDA already which is open.

If they plan to open it up, it can be something useful to add to options of breaking CUDA lock-in.

By @deliveryboyman - 9 months

Would like to see benchmarks for the applications in the test suite.

E.g., how does Cycles compare on AMD vs Nvidia?

By @JonChesterfield - 9 months

This is technically feasible so might be the real thing. Parsing inline ptx and mapping that onto amdgpu would be a huge pain.

Working from cuda source that doesn't use inline ptx to target amdgpu is roughly regex find and replace to get hip, which has implemented pretty much the same functionality.

Some of the details would be dubious, e.g. the atomic models probably don't match, and volta has a different instruction pointer model, but it could all be done correctly.

Amd won't do this. Cuda isn't a very nice thing in general and the legal team would have kittens. But other people totally could.

By @einpoklum - 9 months

At my workplace, we were reluctant in making the choice between writing OpenCL and being AMD-compliant, but missing out on CUDA features and tooling; and writing CUDA and being vendor-locked.

Our jerry-rigged solution for now is writing kernels that are the same source for both OpenCL and CUDA, with a few macros doing a bit of adaptation (e.g. the syntax for constructing a struct). This requires no special library or complicated runtime work - but it does have the downside of forcing our code to be C'ish rather than C++'ish, which is quite annoying if you want to write anything that's templated.

Note that all of this regards device-side, not host-side, code. For the host-side, I would like, at some point, to take the modern-C++ CUDA API wrappers (https://github.com/eyalroz/cuda-api-wrappers/) and derive from them something which supports CUDA, OpenCL and maybe HIP/ROCm. Unfortunately, I don't have the free time to do this on my own, so if anyone is interested in collaborating on something like that, please drop me a line.

-----

You can find the OpenCL-that-is-also-CUDA mechanism at:

https://github.com/eyalroz/gpu-kernel-runner/blob/main/kerne...

and

https://github.com/eyalroz/gpu-kernel-runner/blob/main/kerne...

(the files are provided alongside a tool for testing, profiling and debugging individual kernels outside of their respective applications.)

By @adzm - 9 months

I'd love to see some benchmarks but this is something the market has been yearning for.

By @yieldcrv - 9 months

the real question here is whether anybody has gotten cheap, easily available AMD GPUs to run their AI workloads, and if we can predict more people will do so

By @m3kw9 - 9 months

This isn’t a solution for pros because it will always play catch up and Nvidia can always add things to make it difficult. This is like emulation.

By @nabogh - 9 months

I've written a bit of CUDA before. If I want to go pretty bare-bones, what's the equivalent setup for writing code for my AMD card?

By @lukan - 9 months

Ok, so I just stumbled on the problem, that I tried out openwhisper (from OpenAI), but on my CPU, because of no CUDA and workarounds seem hacky. So the headline sounds good!

But can this help me directly? Or would OpenAI have to use this tool for me to benefit?

It is not immediately clear to me (but I am a beginner in this space).

By @pjmlp - 9 months

This targets CUDA C++, not CUDA the NVIDIA infrastructure for C, C++, Fortran, and anything else targeting PTX.

By @sakras - 9 months

One question I always have about these sorts of translation layers is how they deal with the different warp sizes. I'd imagine a lot of CUDA code relies on 32-wide warps, while as far as I know AMD tends to have 64-wide warps. Is there some sort of emulation that needs to happen?

By @qeternity - 9 months

The future is inference. Many inference stacks already support AMD although the kernels are less optimized. This will of course change over time, but if AMD can crack the inference demand, it will put NVDA under huge pressure.

By @ur-whale - 9 months

If this actually works (remains to be seen), I can only say:

   1) Kudos
   2) Finally !

By @stuaxo - 9 months

What's the licensing, will I be able run this as a hobbyist for free software?

By @jarbus - 9 months

Really, really, really curious as to how they managed to pull this off, if their project works as well as they claim it does. If stuff as complex as paged/flash attention can "just work", this is really cool.

By @omneity - 9 months

Wondering if there's an ongoing effort to do the same with MPS/Metal as a backend. If anything given how many developers are on macs I think it could get immense traction.

By @uptownfunk - 9 months

Very clearly the business motive make sense, go after nvidia gpu monopoly. Can someone help a lay person understand the pitfalls here that prevent this from being an intelligent venture?

By @ekelsen - 9 months

A major component of many CUDA programs these days involves NCCL and high bandwidth intra-node communication.

Does NCCL just work? If not, what would be involved in getting it to work?

By @qwerty456127 - 9 months

> gfx1030, gfx1100, gfx1010, gfx1101, gfx900...

How do I find out which do I have?

By @arjvik - 9 months

Who is this Spectral Compute, and where can we see more about them?

By @pixelpoet - 9 months

Isn't this a bit legally dubious, like zluda?

By @tallmed - 9 months

I wonder if this thing has anything common with zluda, its permissively licensed after all.

By @amelius - 9 months

Can anyone explain why libcudnn is taking several gigabytes of my harddrive?

By @localfirst - 9 months

> SCALE does not require the CUDA program or its build system to be modified.

how big of a deal is this?

By @dagmx - 9 months

Has anyone tried this and knows how well it works? It definitely sounds very compelling

By @gizajob - 9 months

Is Nvidia not likely to sue or otherwise bork this into non-existence?

By @galaxyLogic - 9 months

Companies selling CUDA software should no doubt adopt this tool

By @rjurney - 9 months

If it's efficient, this is very good for competition.

By @paulmist - 9 months

Doesn't seem to mention CDNA?

By @cheptsov - 9 months

Sounds really awesome. Any chance someone can suggest if this works also inside a Docker container?

By @EGreg - 9 months

Does it translate to OpenCL?

This sounds like DirectX vs OpenGL debate when I was younger lol

By @seanp2k2 - 9 months

This is the way.

By @EGreg - 9 months

But the question is, can it also run SHUDA and WUDA?

Run CUDA, unmodified, on AMD GPUs

Related

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

GPUs can now use PCIe-attached memory or SSDs to boost VRAM capacity

GPUs can now use PCIe-attached memory or SSDs to boost VRAM capacity

gpu.cpp: A lightweight library for portable low-level GPU computation

Related

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

GPUs can now use PCIe-attached memory or SSDs to boost VRAM capacity

GPUs can now use PCIe-attached memory or SSDs to boost VRAM capacity

gpu.cpp: A lightweight library for portable low-level GPU computation