June 25th, 2024

Testing AMD's Giant MI300X

AMD introduces Radeon Instinct MI300X to challenge NVIDIA in GPU compute market. MI300X features chiplet setup, Infinity Cache, CDNA 3 architecture, competitive performance against NVIDIA's H100, and excels in local memory bandwidth tests.

Read original articleLink Icon
Testing AMD's Giant MI300X

AMD has introduced the Radeon Instinct MI300X, aiming to challenge NVIDIA in the GPU compute market. The MI300X features a massive chiplet setup with eight compute dies, 256 MB Infinity Cache, and 5.3 TB/s of bandwidth. The CDNA 3 architecture improves latency and cache setup compared to previous generations, with the Infinity Cache providing a significant advantage for larger test sizes. The MI300X showcases impressive L2 cache latency matching RDNA 2's performance. In terms of cache and memory access, the MI300X demonstrates competitive performance against NVIDIA's H100, with notable advantages in bandwidth and cache capacity. Additionally, the MI300X excels in local memory bandwidth tests, outperforming other GPUs like the RX 6900 XT. Global memory atomics performance varies on the MI300X, showcasing the complexity of managing data movement and synchronization across the GPU. Overall, the MI300X offers substantial compute throughput, surpassing the H100 PCIe in various operations and packed executions.

Related

Unisoc and Xiaomi's 4nm Chips Said to Challenge Qualcomm and MediaTek

Unisoc and Xiaomi's 4nm Chips Said to Challenge Qualcomm and MediaTek

UNISOC and Xiaomi collaborate on 4nm chips challenging Qualcomm and MediaTek. UNISOC's chip features X1 big core + A78 middle core + A55 small core with Mali G715 MC7 GPU, offering competitive performance and lower power consumption. Xiaomi's Xuanjie chip includes X3 big core + A715 middle core + A510 small core with IMG CXT 48-1536 GPU, potentially integrating a MediaTek baseband. Xiaomi plans a separate mid-range phone line with Xuanjie chips, aiming to strengthen its market presence. The successful development of these 4nm chips by UNISOC and Xiaomi marks progress in domestically produced mobile chips, enhancing competitiveness.

TSMC experimenting with rectangular wafers vs. round for more chips per wafer

TSMC experimenting with rectangular wafers vs. round for more chips per wafer

TSMC is developing an advanced chip packaging method to address AI-driven demand for computing power. Intel and Samsung are also exploring similar approaches to boost semiconductor capabilities amid the AI boom.

Intel's Gaudi 3 will cost half the price of Nvidia's H100

Intel's Gaudi 3 will cost half the price of Nvidia's H100

Intel's Gaudi 3 AI processor is priced at $15,650, half of Nvidia's H100. Intel aims to compete in the AI market dominated by Nvidia, facing challenges from cloud providers' custom AI processors.

Testing AMD's Bergamo: Zen 4c

Testing AMD's Bergamo: Zen 4c

AMD's Bergamo server CPU, based on Zen 4c cores, prioritizes core count over clock speed for power efficiency and density. It targets cloud providers and parallel applications, emphasizing memory performance trade-offs.

First 128TB SSDs will launch in the coming months

First 128TB SSDs will launch in the coming months

Phison's Pascari brand plans to release 128TB SSDs, competing with Samsung, Solidigm, and Kioxia. These SSDs target high-performance computing, AI, and data centers, with larger models expected soon. The X200 PCIe Gen5 Enterprise SSDs with CoXProcessor CPU architecture aim to meet the rising demand for storage solutions amidst increasing data volumes and generative AI integration, addressing businesses' data management challenges effectively.

Link Icon 18 comments
By @w-m - 7 months
Impressions from last week’s CVPR, a conference with 12k attendees on computer vision - Pretty much everyone is using NVIDIA GPUs, and pretty much everyone isn’t happy with the prices, and would like some competition in the space:

NVIDIA was there with 57 papers, a website dedicated to their research presented at the conference, a full day tutorial on accelerating deep learning, and ever present with shirts and backpacks in the corridors and at poster presentations.

AMD had a booth at the expo part, where they were raffling off some GPUs. I went up to them to ask what framework I should look into, when writing kernels (ideally from Python) for GPGPU. They referred me to the “technical guy”, who it turns out had a demo on inference on an LLM. Which he couldn’t show me, as the laptop with the APU had crashed and wouldn’t reboot. He didn’t know about writing kernels, but told me there was a compiler guy who might be able to help, but he wasn’t to be found at that moment, and I couldn’t find him when returning to the booth later.

I’m not at all happy with this situation. As long as AMDs investment into software and evangelism remains at ~$0, I don’t see how any hardware they put out will make a difference. And you’ll continue to hear people walking away from their booth, saying “oh when I win it I’m going to sell it to buy myself an NVIDIA GPU”.

By @latchkey - 7 months
The news you've all been waiting for!

We are thrilled to announce that Hot Aisle Inc. proudly volunteered our system for Chips and Cheese to use in their benchmarking and performance showcase. This collaboration has demonstrated the exceptional capabilities of our hardware and further highlighted our commitment to cutting-edge technology.

Stay tuned for more exciting updates!

By @jsheard - 7 months
All eyes are of course on AI, but with 192GB of VRAM I wonder if this or something like it could be good enough for high end production rendering. Pixar and co still use CPU clusters for all of their final frame rendering, even though the task is ostensibly a better fit for GPUs, mainly because their memory demands have usually been so far ahead of what even the biggest GPUs could offer.

Much like with AI, Nvidia has the software side of GPU production rendering locked down tight though so that's just as much of an uphill battle for AMD.

By @snaeker58 - 7 months
I hate the state of AMDs software for non gamers. RoCm is a war crime (which has improved dramatically in the last two years and still sucks).

But like many have said considering AMD was almost bankrupt their performance is impressive. This really speaks for their hardware division. If only they could get the software side of things fixed!

Also I wonder if NVIDIA has an employee of the decade plaque for CUDA. Because CUDA is the best thing that could’ve happened to them.

By @Pesthuf - 7 months
I feel like these huge graphics cards with insane amounts of RAM are the moat that AI companies have been hoping for.

We can't possibly hope to run the kinds of models that run on 192GB of VRAM at home.

By @elorant - 7 months
Even if the community provides support it could take years to reach the maturity of CUDA. So while it's good to have some competition, I doubt it will make any difference in the immediate future. Unless some of the big corporations in the market lean in heavily and support the framework.
By @spitfire - 7 months
I remember years ago one of the amd apus had the cup and gpu on the same die, and could exchange ownership of cpu and gpu memory with just a pointer change or some other small accounting.

Has this returned? Because for dual gpu/cpu workloads (alpha zero, etc) that would deliver effective “infinite bandwidth” between gpu and cpu. Using an apu of course gets you huge amounts of slowish memory. But being some to fling things around with abandon would be an advantage, particularly for development.

By @omneity - 7 months
I'm surprised at the simplicity of the formula in the paragraph below. Could someone explain the relationship between model size, memory bandwidth and token/s as they calculated here?

> Taking LLaMA 3 70B as an example, in float16 the weights are approximately 140GB, and the generation context adds another ~2GB. MI300X’s theoretical maximum is 5.3TB/second, which gives us a hard upper limit of (5300 / 142) = ~37.2 tokens per second.

By @alkonaut - 7 months
Good. If there is even a slight suspicion that the best value is team read in 5 or 10 years then CUDA will look a lot less attractive already today.
By @alecco - 7 months
> Taking LLaMA 3 70B as an example, in float16 the weights are approximately 140GB, and the generation context adds another ~2GB. MI300X’s theoretical maximum is 5.3TB/second, which gives us a hard upper limit of (5300 / 142) = ~37.2 tokens per second.

I think they mean 37.2 forward passes per second. And at 4008 tokens per second (from "LLaMA3-70B Inference" chart) it means they were using a batch size of ~138 (if using that math, but probably not correct). Right?

By @Filligree - 7 months
So just out of curiosity, what does this thing cost?
By @JonChesterfield - 7 months
Fantastic to see.

The MI300X does memory bandwidth better than anything else by a ridiculous margin, up and down the cache hierarchy.

It did not score very well on global atomics.

So yeah, that seems about right. If you manage to light up the hardware, lots and lots of number crunching for you.

By @tonetegeatinst - 7 months
I wonder if the human body could grow artificial kidney's so that I can just sell infinite kidney's and manage to afford a couple of these so I can do AI training on my own hardware.
By @Palmik - 7 months
It would be great to have real world inference benchmarks for LLMs. These aren't it.

That means e.g. 8xH100 with TensorRT-LLM / vLLM vs 8xMI300X with vLLM running many concurrent requests with reasonable # of input and output tokens. Ran both in fp8 and fp16.

Most of the benchmarks I've seen had setups that no one would use in production. For example running on a single MI300X or 2xH100 -- this will likely be memory bound, you need to go to higher batch sizes (more VRAM) to be compute bound to properly utilize these. Or benchmarking requests with unrealistically low # of input tokens.

By @rbanffy - 7 months
Would be interesting to see a workstation based on the version with a couple x86 dies, the MI300A. Oddly enough, it’d need a discrete GPU.
By @pheatherlite - 7 months
Without first-class CUDA translation or cross compile, AMD is just throwing more transistors at the void
By @mugivarra69 - 7 months
i worked there. they see software as a cost center, they should fix their mentality.
By @pella - 7 months
from the summary:

"When it is all said and done, MI300X is a very impressive piece of hardware. However, the software side suffers from a chicken-and-egg dilemma. Developers are hesitant to invest in a platform with limited adoption, but the platform also depends on their support. Hopefully the software side of the equation gets into ship shape. Should that happen, AMD would be a serious competitor to NVIDIA."