January 29th, 2025

DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses PTX

DeepSeek trained a 671 billion parameter AI model using 2,048 Nvidia GPUs, achieving tenfold efficiency over competitors. This raised Nvidia's stock concerns but may democratize AI technology access.

Read original articleLink Icon
DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses PTX

DeepSeek has made significant advancements in the AI sector by training its Mixture-of-Experts (MoE) language model, which boasts 671 billion parameters, using a cluster of 2,048 Nvidia H800 GPUs. This process took approximately two months and achieved a tenfold increase in efficiency compared to industry leaders like Meta. The key to this breakthrough lies in DeepSeek's use of Nvidia's assembly-like PTX (Parallel Thread Execution) programming, which allows for fine-grained optimizations that standard CUDA programming cannot provide. These optimizations include advanced pipeline algorithms and specific configurations of the GPUs to enhance performance. The success of DeepSeek has raised concerns among investors, leading to a significant drop in Nvidia's stock value, as some believe that the demand for high-performance hardware may decrease. Industry experts, including former Intel CEO Pat Gelsinger, suggest that DeepSeek's innovations could democratize AI technology, making it accessible to a wider range of devices. However, the financial investment required for such developments remains unclear.

- DeepSeek's AI model trained with 671 billion parameters using 2,048 Nvidia H800 GPUs.

- Achieved 10X efficiency compared to competitors like Meta.

- Utilized Nvidia's PTX programming for advanced optimizations.

- Nvidia's stock dropped significantly following DeepSeek's announcement.

- Potential for broader AI applications in less expensive devices.

Link Icon 13 comments
By @hintymad - 3 months
When I grew up, the media in my country kept telling us how great the geek culture in the US was, how deep down the stack the geeks were willing to go, and both adults and us kids were left in awe. The entire nation, for what I could tell, routinely reflected why we couldn't be like the US: educating and nurturing generations of geeks to be the best engineers and scientists in the world.

Well, it was quite a reverse culture shock after I moved to the US. I definitely didn't know that "teacher's pet" was a thing, or my coworker, a brilliant engineer who went to a highly reputed public school, was chased off his school bus simply because he used some poetic words, or geeks were not that respected in schools, or a mile wide and an inch deep with great leadership is what the US people revered. In the meantime, I guess other countries more or less picked up the baton of the US culture, and grew their own geeks.

By @DiabloD3 - 3 months
Haha, what a shoddy headline. "Bypasses" and "industry-standard" have no place here.

CUDA is not an industry standard. Vulcan is an industry standard. They did not bypass CUDA... that's like saying if I use Vulcan I'm bypassing OpenGL. PTX is an alternative low level API provided by Nvidia because of how awful CUDA is for high performance code.

What DeepSeek wrote could only have either been written in PTX or Vulcan.

Any other company could have done this, and low latency traders on Wall Street that use Nvidia write their stuff in PTX for obvious reasons.

OpenAI, was, is, and always will be, absolutely incompetent when it comes to using their hardware effectively... and they're no different than any other company. Reading is not a goddamned super power! Just read the docs!

By @nialse - 3 months
What it does show is that CUDA leaves serious performance optimization on the table despite its gigantic code base. Using compression to reduce memory bandwidth is a well known trick in quantization, and in other scenarios since forever. There has been little competitive pressure on Nvidia to go further since their software stack leaves the competition in the dust. This time, they may actually need to step up their efforts, due to customer pressure. Good times!
By @lvl155 - 3 months
You know what IS ridiculous is that people are willing to go assembly but still do not use AMD gpus despite the price difference.
By @kristjansson - 3 months
This is ridiculous. Since the actual training code for DeepSeek is _not_ public, this is a based only on the technical report, which mentions PTX one (1) time in §3.2.2 Efficient Implementation of Cross-Node All-to-All Communication:

> Specifically, we employ customized PTX (Parallel Thread Execution) instructions and auto-tune the communication chunk size, which significantly reduces the use of the L2 cache and the interference to other SMs.

So they have some intrinsic in some part of their training framework. That's it.

By @almostgotcaught - 3 months
lol this is the wackiest non-news; every serious project has at least some parts of their kernels implemented in PTX/AMDGCN.
By @westurner - 3 months
By @t2oi4h324jl234 - 3 months
This is strange. PTX is still at the IR level ?

IIRC this is still relatively hardware agnostic. Can you actually get very far by doing this ? From a quick perusal, DeepSeek also uses Triton in the codebase.

By @hulitu - 3 months
> DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses PTX

Isn't CUDA an Nvidia child ? This sounds like "Microsoft = industry standard".

By @lousken - 3 months
i would like to assume that low level optimization is crucial part of AI training and openai, meta and others aren't wasting billions on this
By @sdedovic - 3 months
Maybe I’m misunderstanding but CUDA compiles to PTX? Is the implication they wrote in a different language than CUDA and to generate the PTX?
By @konradha - 3 months
Next article: What's SASS?
By @gamblor956 - 3 months
tldr: they wrote in low level code instead of using a higher level framework like their competitors have been doing so they were able to hand tune the performance.

This gives them a few months head start before meta and Google start doing the same thing.