January 29th, 2025

DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses PTX

DeepSeek trained a 671 billion parameter AI model using 2,048 Nvidia GPUs, achieving tenfold efficiency over competitors. This raised Nvidia's stock concerns but may democratize AI technology access.

Read original article

DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses PTX

DeepSeek has made significant advancements in the AI sector by training its Mixture-of-Experts (MoE) language model, which boasts 671 billion parameters, using a cluster of 2,048 Nvidia H800 GPUs. This process took approximately two months and achieved a tenfold increase in efficiency compared to industry leaders like Meta. The key to this breakthrough lies in DeepSeek's use of Nvidia's assembly-like PTX (Parallel Thread Execution) programming, which allows for fine-grained optimizations that standard CUDA programming cannot provide. These optimizations include advanced pipeline algorithms and specific configurations of the GPUs to enhance performance. The success of DeepSeek has raised concerns among investors, leading to a significant drop in Nvidia's stock value, as some believe that the demand for high-performance hardware may decrease. Industry experts, including former Intel CEO Pat Gelsinger, suggest that DeepSeek's innovations could democratize AI technology, making it accessible to a wider range of devices. However, the financial investment required for such developments remains unclear.

- DeepSeek's AI model trained with 671 billion parameters using 2,048 Nvidia H800 GPUs.

- Achieved 10X efficiency compared to competitors like Meta.

- Utilized Nvidia's PTX programming for advanced optimizations.

- Nvidia's stock dropped significantly following DeepSeek's announcement.

- Potential for broader AI applications in less expensive devices.

DeepSeek's new AI model appears to be one of the best 'open' challengers yet

DeepSeek, a Chinese AI firm, launched DeepSeek V3, an open-source model with 671 billion parameters, excelling in text tasks and outperforming competitors, though limited by regulatory constraints.

DeepSeek and the Effects of GPU Export Controls

DeepSeek launched its V3 model, trained on 2,048 H800 GPUs for $5.5 million, emphasizing efficiency and innovation due to U.S. export controls, while exploring advancements beyond transformer architectures.

Nvidia Stock May Fall as DeepSeek's 'Amazing' AI Model Disrupts OpenAI

Nvidia's stock may decline as DeepSeek's R1 AI model, launched in January 2025, offers similar performance to OpenAI's at lower costs, attracting enterprise interest and increasing competition in the AI market.

Nvidia falls 14% in premarket trading as China’s DeepSeek triggers tech sell-off

Nvidia's stock fell 16% after Chinese startup DeepSeek launched a competitive AI model at low cost, raising concerns about U.S. tech firms' competitiveness and prompting broader market sell-offs.

Nvidia calls China's DeepSeek R1 model 'an excellent AI advancement'

Nvidia praised DeepSeek's R1 model as a significant AI advancement, despite a stock drop, noting its cost-effectiveness and potential to increase GPU demand, while raising questions about large tech investments.

13 comments

By @hintymad - 3 months

When I grew up, the media in my country kept telling us how great the geek culture in the US was, how deep down the stack the geeks were willing to go, and both adults and us kids were left in awe. The entire nation, for what I could tell, routinely reflected why we couldn't be like the US: educating and nurturing generations of geeks to be the best engineers and scientists in the world.

Well, it was quite a reverse culture shock after I moved to the US. I definitely didn't know that "teacher's pet" was a thing, or my coworker, a brilliant engineer who went to a highly reputed public school, was chased off his school bus simply because he used some poetic words, or geeks were not that respected in schools, or a mile wide and an inch deep with great leadership is what the US people revered. In the meantime, I guess other countries more or less picked up the baton of the US culture, and grew their own geeks.

By @DiabloD3 - 3 months

Haha, what a shoddy headline. "Bypasses" and "industry-standard" have no place here.

CUDA is not an industry standard. Vulcan is an industry standard. They did not bypass CUDA... that's like saying if I use Vulcan I'm bypassing OpenGL. PTX is an alternative low level API provided by Nvidia because of how awful CUDA is for high performance code.

What DeepSeek wrote could only have either been written in PTX or Vulcan.

Any other company could have done this, and low latency traders on Wall Street that use Nvidia write their stuff in PTX for obvious reasons.

OpenAI, was, is, and always will be, absolutely incompetent when it comes to using their hardware effectively... and they're no different than any other company. Reading is not a goddamned super power! Just read the docs!

By @nialse - 3 months

What it does show is that CUDA leaves serious performance optimization on the table despite its gigantic code base. Using compression to reduce memory bandwidth is a well known trick in quantization, and in other scenarios since forever. There has been little competitive pressure on Nvidia to go further since their software stack leaves the competition in the dust. This time, they may actually need to step up their efforts, due to customer pressure. Good times!

By @lvl155 - 3 months

You know what IS ridiculous is that people are willing to go assembly but still do not use AMD gpus despite the price difference.

By @kristjansson - 3 months

This is ridiculous. Since the actual training code for DeepSeek is _not_ public, this is a based only on the technical report, which mentions PTX one (1) time in §3.2.2 Efficient Implementation of Cross-Node All-to-All Communication:

> Specifically, we employ customized PTX (Parallel Thread Execution) instructions and auto-tune the communication chunk size, which significantly reduces the use of the L2 cache and the interference to other SMs.

So they have some intrinsic in some part of their training framework. That's it.

By @almostgotcaught - 3 months

lol this is the wackiest non-news; every serious project has at least some parts of their kernels implemented in PTX/AMDGCN.

By @westurner - 3 months

Parallel Thread Execution: https://en.wikipedia.org/wiki/Parallel_Thread_Execution

By @t2oi4h324jl234 - 3 months

This is strange. PTX is still at the IR level ?

IIRC this is still relatively hardware agnostic. Can you actually get very far by doing this ? From a quick perusal, DeepSeek also uses Triton in the codebase.

By @hulitu - 3 months

> DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses PTX

Isn't CUDA an Nvidia child ? This sounds like "Microsoft = industry standard".

By @lousken - 3 months

i would like to assume that low level optimization is crucial part of AI training and openai, meta and others aren't wasting billions on this

By @sdedovic - 3 months

Maybe I’m misunderstanding but CUDA compiles to PTX? Is the implication they wrote in a different language than CUDA and to generate the PTX?

By @konradha - 3 months

Next article: What's SASS?

By @gamblor956 - 3 months

tldr: they wrote in low level code instead of using a higher level framework like their competitors have been doing so they were able to hand tune the performance.

This gives them a few months head start before meta and Google start doing the same thing.

DeepSeek's new AI model appears to be one of the best 'open' challengers yet

DeepSeek, a Chinese AI firm, launched DeepSeek V3, an open-source model with 671 billion parameters, excelling in text tasks and outperforming competitors, though limited by regulatory constraints.

DeepSeek and the Effects of GPU Export Controls

Nvidia Stock May Fall as DeepSeek's 'Amazing' AI Model Disrupts OpenAI

Nvidia falls 14% in premarket trading as China’s DeepSeek triggers tech sell-off

Nvidia's stock fell 16% after Chinese startup DeepSeek launched a competitive AI model at low cost, raising concerns about U.S. tech firms' competitiveness and prompting broader market sell-offs.

DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses PTX

Related

DeepSeek's new AI model appears to be one of the best 'open' challengers yet

DeepSeek and the Effects of GPU Export Controls

Nvidia Stock May Fall as DeepSeek's 'Amazing' AI Model Disrupts OpenAI

Nvidia falls 14% in premarket trading as China’s DeepSeek triggers tech sell-off

Nvidia calls China's DeepSeek R1 model 'an excellent AI advancement'

Related

DeepSeek's new AI model appears to be one of the best 'open' challengers yet

DeepSeek and the Effects of GPU Export Controls

Nvidia Stock May Fall as DeepSeek's 'Amazing' AI Model Disrupts OpenAI

Nvidia falls 14% in premarket trading as China’s DeepSeek triggers tech sell-off

Nvidia calls China's DeepSeek R1 model 'an excellent AI advancement'