August 9th, 2024

Grace Hopper, Nvidia's Halfway APU

Nvidia's Grace Hopper architecture integrates a CPU and GPU for high-performance computing, offering high memory bandwidth but facing significant latency issues, particularly in comparison to AMD's solutions.

Read original article

ConfusionSkepticismFrustration

Nvidia's Grace Hopper, a new high-performance computing architecture, combines a CPU and GPU into a single unit, aiming to compete with AMD's integrated solutions. The architecture features the Grace CPU with 72 Neoverse V2 cores running at 3.44 GHz and 114 MB of L3 cache, paired with the H100 GPU that has 96 GB of HBM3 memory. This design utilizes Nvidia's NVLink C2C interconnect, providing high bandwidth and hardware coherency, allowing the CPU to access GPU memory directly. However, the system exhibits high latency, particularly when accessing HBM3 memory, which can hinder performance in certain applications. The architecture is designed for parallel compute tasks, with a focus on high memory bandwidth and efficient data sharing between CPU and GPU. Despite its advantages in bandwidth, Grace Hopper's latency issues and system responsiveness during tests raise concerns about its practical performance in real-world applications. Comparisons with AMD's offerings indicate that while Grace Hopper excels in certain areas, it may not outperform AMD's solutions in all aspects, particularly in latency-sensitive tasks.

- Nvidia's Grace Hopper architecture combines a CPU and GPU for high-performance computing.

- The system features high memory bandwidth but suffers from significant latency issues.

- NVLink C2C interconnect allows direct memory access between CPU and GPU.

- Grace Hopper is designed for parallel compute applications, targeting high bandwidth needs.

- Performance comparisons with AMD highlight both strengths and weaknesses in the architecture.

Testing AMD's Giant MI300X

AMD introduces Radeon Instinct MI300X to challenge NVIDIA in GPU compute market. MI300X features chiplet setup, Infinity Cache, CDNA 3 architecture, competitive performance against NVIDIA's H100, and excels in local memory bandwidth tests.

AMD MI300X performance compared with Nvidia H100

The AMD MI300X AI GPU outperforms Nvidia's H100 in cache, latency, and inference benchmarks. It excels in caching performance, compute throughput, but AI inference performance varies. Real-world performance and ecosystem support are essential.

Nvidia NVLink Switch Chips Change to the HGX B200

NVIDIA introduced the HGX B200 board at Computex 2024, featuring two NVLink Switch chips instead of four, aiming to enhance performance and efficiency in high-performance computing applications by optimizing GPU configurations.

AMD's Long and Winding Road to the Hybrid CPU-GPU Instinct MI300A

AMD's journey from 2012 led to the development of the powerful Instinct MI300A compute engine, used in the "El Capitan" supercomputer. Key researchers detailed AMD's evolution, funding, and technology advancements, impacting future server offerings.

Nvidia's Blackwell Reworked – Shipment Delays and GB200A Reworked Platforms

Nvidia's Blackwell family faces production challenges causing shipment delays, impacting targets for 2024-2025. The company is extending Hopper product lifespans and shifting focus to new systems and simpler packaging solutions.

AI: What people are saying

The comments reflect a range of opinions on Nvidia's Grace Hopper architecture and its competition with AMD.

Concerns about high memory latency in Nvidia's design compared to AMD's solutions.
Speculation on the future of AI computing, with some believing AMD's APUs could dominate if AI becomes more self-hosted.
Criticism of Nvidia's focus on datacenters at the expense of consumer markets.
Discussion on the potential benefits of integrating CPU and GPU into a single chip for simplicity and efficiency.
Mixed feelings about Nvidia's corporate culture and leadership, with some expressing frustration over its market strategies.

13 comments

By @erulabs - 8 months

If AI remains in the cloud, nvidia wins. But I can’t help but think that if AI becomes “self-hosted”, if we return to a world where people own their own machines, AMDs APUs and interconnect technology will be absolutely dominant. Training may still be Nvidias wheelhouse, but for a single device able to do all the things (inference, rendering, and computing), AMD, at least currently, would seem to be the winner. I’d love someone more knowledgeable in AI scaling to correct me here though.

Maybe that’s all far enough afield to make the current state of things irrelevant?

By @MobiusHorizons - 8 months

I am really surprised to see the performance of the CPU and especially the latency characteristics are so poor. The article alludes to the design likely being tuned for specific workloads, which seems like a good explanation. But I can't help wonder if throughput at the cost of high memory latency is just not a good strategy for CPUs even with the excellent branch predictors and clever OOO work that modern CPUs bring to the table. Is this a bad take? Are we just not seeing the intended use-case where this thing really shines compared to anything else?

By @tedunangst - 8 months

Irrelevant, but the intro reminded me that nvidia also used to dabble in chipsets like nforce, back when there was supplier variety in such.

By @sirlancer - 8 months

In my tests of a Supermicro ARS-111GL-NHR with a Nvidia GH200 chipset, I found that my benchmarks performed far better with the RHEL 9 aarch64+64k kernel versus the standard aarch64 kernel. Particularly with LLM workloads. Which kernel was used in these tests?

By @waynecochran - 8 months

Side note: The acronym APU was used in the title but not once defined or referenced in the article?

By @alexhutcheson - 8 months

Somewhat tangential, but did Nvidia ever confirm if they cancelled their project to develop custom cores implementing the ARM instruction set (Project Denver, and later Carmel)?

It’s interesting to me that they’ve settled on using standard Neoverse cores, when almost everything else is custom designed and tuned for the expected workloads.

By @rbanffy - 8 months

> The downside is Genoa-X has more than 1 GB of last level cache, and a single core only allocates into 96 MB of it.

I wonder if AMD could license the IBM Telum cache implementation where one core complex could offer unused cache lines to other cores, increasing overall occupancy.

Would be quite neat, even if cross-complex bandwidth and latency is not awesome, it still should be better than hitting DRAM.

By @bmacho - 8 months

> The first signs of trouble appeared when vi, a simple text editor, took more than several seconds to load.

Can it run vi?

By @jokoon - 8 months

It always made sense to have a single chip instead of 2, I just want to buy a single package with both things on the same die.

That might make things much simpler for people who write kernel, drivers and video games.

The history of CPU and GPU prevented that, it was always more profitable for CPU and GPU vendors to sell them separately.

Having 2 specialized chips makes more sense because it's flexible, but since frequencies are stagnating, having more cores make sense, and AI means massively parallel things are not only for graphics.

Smartphones are much modern in that regard. Nobody upgrades their GPU or CPU anymore, might as well have a single, soldered product that last a long time instead.

That may not be the end of building your own computer, but I just hope it will make things simpler and in a smaller package.

By @dagmx - 8 months

The article talks about the difference in the pre-fetcher between the two neoverse setups (Graviton and Grace Hopper). However isn’t the prefetcher part of the core design in neoverse? How would they differ?

By @astromaniak - 8 months

This is good for datacenters, but.. NVidia stopped doing anything for consumers market.

By @rkwasny - 8 months

Yeah so I also benchmarked GH200 yesterday and I am also a bit puzzled TBH:

https://github.com/mag-/gpu_benchmark

By @benreesman - 8 months

I’m torn: NVIDIA has a fucking insane braintrust of some of the most elite hackers in both software and extreme cutting edge digital logic. You do not want to meet an NVIDIA greybeard in a dark alley, they will fuck you up.

But this bullshit with Jensen signing girls’ breasts like he’s Robert Plant and telling young people to learn prompt engineering instead of C++ and generally pulling a pump and dump shamelessly while wearing a leather jacket?

Fuck that: if LLMs could write cuDNN-caliber kernels that’s how you would do it.

It’s ok in my book to live the rockstar life for the 15 minutes until someone other than Lisa Su ships an FMA unit.

The 3T cap and the forward PE and the market manipulation and the dated signature apparel are still cringe and if I had the capital and trading facility to LEAP DOOM the stock? I’d want as much as there is.

The fact that your CPU sucks ass just proves this isn’t about real competition just now.

Grace Hopper, Nvidia's Halfway APU

Related

Testing AMD's Giant MI300X

AMD MI300X performance compared with Nvidia H100

Nvidia NVLink Switch Chips Change to the HGX B200

AMD's Long and Winding Road to the Hybrid CPU-GPU Instinct MI300A

Nvidia's Blackwell Reworked – Shipment Delays and GB200A Reworked Platforms

Related

Testing AMD's Giant MI300X

AMD MI300X performance compared with Nvidia H100

Nvidia NVLink Switch Chips Change to the HGX B200

AMD's Long and Winding Road to the Hybrid CPU-GPU Instinct MI300A

Nvidia's Blackwell Reworked – Shipment Delays and GB200A Reworked Platforms