Testing AMD's Giant MI300X
AMD introduces Radeon Instinct MI300X to challenge NVIDIA in GPU compute market. MI300X features chiplet setup, Infinity Cache, CDNA 3 architecture, competitive performance against NVIDIA's H100, and excels in local memory bandwidth tests.
Read original articleAMD has introduced the Radeon Instinct MI300X, aiming to challenge NVIDIA in the GPU compute market. The MI300X features a massive chiplet setup with eight compute dies, 256 MB Infinity Cache, and 5.3 TB/s of bandwidth. The CDNA 3 architecture improves latency and cache setup compared to previous generations, with the Infinity Cache providing a significant advantage for larger test sizes. The MI300X showcases impressive L2 cache latency matching RDNA 2's performance. In terms of cache and memory access, the MI300X demonstrates competitive performance against NVIDIA's H100, with notable advantages in bandwidth and cache capacity. Additionally, the MI300X excels in local memory bandwidth tests, outperforming other GPUs like the RX 6900 XT. Global memory atomics performance varies on the MI300X, showcasing the complexity of managing data movement and synchronization across the GPU. Overall, the MI300X offers substantial compute throughput, surpassing the H100 PCIe in various operations and packed executions.
Related
Unisoc and Xiaomi's 4nm Chips Said to Challenge Qualcomm and MediaTek
UNISOC and Xiaomi collaborate on 4nm chips challenging Qualcomm and MediaTek. UNISOC's chip features X1 big core + A78 middle core + A55 small core with Mali G715 MC7 GPU, offering competitive performance and lower power consumption. Xiaomi's Xuanjie chip includes X3 big core + A715 middle core + A510 small core with IMG CXT 48-1536 GPU, potentially integrating a MediaTek baseband. Xiaomi plans a separate mid-range phone line with Xuanjie chips, aiming to strengthen its market presence. The successful development of these 4nm chips by UNISOC and Xiaomi marks progress in domestically produced mobile chips, enhancing competitiveness.
TSMC experimenting with rectangular wafers vs. round for more chips per wafer
TSMC is developing an advanced chip packaging method to address AI-driven demand for computing power. Intel and Samsung are also exploring similar approaches to boost semiconductor capabilities amid the AI boom.
Intel's Gaudi 3 will cost half the price of Nvidia's H100
Intel's Gaudi 3 AI processor is priced at $15,650, half of Nvidia's H100. Intel aims to compete in the AI market dominated by Nvidia, facing challenges from cloud providers' custom AI processors.
Testing AMD's Bergamo: Zen 4c
AMD's Bergamo server CPU, based on Zen 4c cores, prioritizes core count over clock speed for power efficiency and density. It targets cloud providers and parallel applications, emphasizing memory performance trade-offs.
First 128TB SSDs will launch in the coming months
Phison's Pascari brand plans to release 128TB SSDs, competing with Samsung, Solidigm, and Kioxia. These SSDs target high-performance computing, AI, and data centers, with larger models expected soon. The X200 PCIe Gen5 Enterprise SSDs with CoXProcessor CPU architecture aim to meet the rising demand for storage solutions amidst increasing data volumes and generative AI integration, addressing businesses' data management challenges effectively.
NVIDIA was there with 57 papers, a website dedicated to their research presented at the conference, a full day tutorial on accelerating deep learning, and ever present with shirts and backpacks in the corridors and at poster presentations.
AMD had a booth at the expo part, where they were raffling off some GPUs. I went up to them to ask what framework I should look into, when writing kernels (ideally from Python) for GPGPU. They referred me to the “technical guy”, who it turns out had a demo on inference on an LLM. Which he couldn’t show me, as the laptop with the APU had crashed and wouldn’t reboot. He didn’t know about writing kernels, but told me there was a compiler guy who might be able to help, but he wasn’t to be found at that moment, and I couldn’t find him when returning to the booth later.
I’m not at all happy with this situation. As long as AMDs investment into software and evangelism remains at ~$0, I don’t see how any hardware they put out will make a difference. And you’ll continue to hear people walking away from their booth, saying “oh when I win it I’m going to sell it to buy myself an NVIDIA GPU”.
We are thrilled to announce that Hot Aisle Inc. proudly volunteered our system for Chips and Cheese to use in their benchmarking and performance showcase. This collaboration has demonstrated the exceptional capabilities of our hardware and further highlighted our commitment to cutting-edge technology.
Stay tuned for more exciting updates!
Much like with AI, Nvidia has the software side of GPU production rendering locked down tight though so that's just as much of an uphill battle for AMD.
But like many have said considering AMD was almost bankrupt their performance is impressive. This really speaks for their hardware division. If only they could get the software side of things fixed!
Also I wonder if NVIDIA has an employee of the decade plaque for CUDA. Because CUDA is the best thing that could’ve happened to them.
We can't possibly hope to run the kinds of models that run on 192GB of VRAM at home.
Has this returned? Because for dual gpu/cpu workloads (alpha zero, etc) that would deliver effective “infinite bandwidth” between gpu and cpu. Using an apu of course gets you huge amounts of slowish memory. But being some to fling things around with abandon would be an advantage, particularly for development.
> Taking LLaMA 3 70B as an example, in float16 the weights are approximately 140GB, and the generation context adds another ~2GB. MI300X’s theoretical maximum is 5.3TB/second, which gives us a hard upper limit of (5300 / 142) = ~37.2 tokens per second.
I think they mean 37.2 forward passes per second. And at 4008 tokens per second (from "LLaMA3-70B Inference" chart) it means they were using a batch size of ~138 (if using that math, but probably not correct). Right?
The MI300X does memory bandwidth better than anything else by a ridiculous margin, up and down the cache hierarchy.
It did not score very well on global atomics.
So yeah, that seems about right. If you manage to light up the hardware, lots and lots of number crunching for you.
That means e.g. 8xH100 with TensorRT-LLM / vLLM vs 8xMI300X with vLLM running many concurrent requests with reasonable # of input and output tokens. Ran both in fp8 and fp16.
Most of the benchmarks I've seen had setups that no one would use in production. For example running on a single MI300X or 2xH100 -- this will likely be memory bound, you need to go to higher batch sizes (more VRAM) to be compute bound to properly utilize these. Or benchmarking requests with unrealistically low # of input tokens.
"When it is all said and done, MI300X is a very impressive piece of hardware. However, the software side suffers from a chicken-and-egg dilemma. Developers are hesitant to invest in a platform with limited adoption, but the platform also depends on their support. Hopefully the software side of the equation gets into ship shape. Should that happen, AMD would be a serious competitor to NVIDIA."
Related
Unisoc and Xiaomi's 4nm Chips Said to Challenge Qualcomm and MediaTek
UNISOC and Xiaomi collaborate on 4nm chips challenging Qualcomm and MediaTek. UNISOC's chip features X1 big core + A78 middle core + A55 small core with Mali G715 MC7 GPU, offering competitive performance and lower power consumption. Xiaomi's Xuanjie chip includes X3 big core + A715 middle core + A510 small core with IMG CXT 48-1536 GPU, potentially integrating a MediaTek baseband. Xiaomi plans a separate mid-range phone line with Xuanjie chips, aiming to strengthen its market presence. The successful development of these 4nm chips by UNISOC and Xiaomi marks progress in domestically produced mobile chips, enhancing competitiveness.
TSMC experimenting with rectangular wafers vs. round for more chips per wafer
TSMC is developing an advanced chip packaging method to address AI-driven demand for computing power. Intel and Samsung are also exploring similar approaches to boost semiconductor capabilities amid the AI boom.
Intel's Gaudi 3 will cost half the price of Nvidia's H100
Intel's Gaudi 3 AI processor is priced at $15,650, half of Nvidia's H100. Intel aims to compete in the AI market dominated by Nvidia, facing challenges from cloud providers' custom AI processors.
Testing AMD's Bergamo: Zen 4c
AMD's Bergamo server CPU, based on Zen 4c cores, prioritizes core count over clock speed for power efficiency and density. It targets cloud providers and parallel applications, emphasizing memory performance trade-offs.
First 128TB SSDs will launch in the coming months
Phison's Pascari brand plans to release 128TB SSDs, competing with Samsung, Solidigm, and Kioxia. These SSDs target high-performance computing, AI, and data centers, with larger models expected soon. The X200 PCIe Gen5 Enterprise SSDs with CoXProcessor CPU architecture aim to meet the rising demand for storage solutions amidst increasing data volumes and generative AI integration, addressing businesses' data management challenges effectively.