July 9th, 2024

So you want to rent an NVIDIA H100 cluster? 2024 Consumer Guide

Considerations when renting an NVIDIA H100 cluster include price, reliability, spare nodes, storage, support, and management. Testing before committing, monitoring GPU usage, and eco-friendly choices are crucial. Prioritize reliability, efficient interconnects, spare nodes, support, and eco-consciousness. Choose cluster management wisely and understand electricity sources for sustainability.

Read original articleLink Icon
So you want to rent an NVIDIA H100 cluster? 2024 Consumer Guide

The article discusses considerations when renting an NVIDIA H100 cluster, focusing on factors like price, interconnect reliability, spare nodes, storage options, support, and management choices. It emphasizes the importance of testing the cluster before committing, monitoring GPU utilization, and being mindful of CO2 emissions from electricity consumption. The piece highlights the need for reliability, efficient interconnects, spare nodes for quick replacements, and proper support channels for issue resolution. It also touches on the environmental impact of running GPU clusters and suggests prioritizing greener options. The article provides insights into managing clusters, such as choosing between bare metal, VMs, or managed SLURM setups, and the significance of thorough testing before finalizing a rental agreement. Additionally, it mentions the importance of understanding the electricity sources powering the cluster and the potential CO2 emissions associated with its operation.

Related

From bare metal to a 70B model: infrastructure set-up and scripts

From bare metal to a 70B model: infrastructure set-up and scripts

The Imbue team trained a 70B parameter model, surpassing GPT-4o. They shared a guide on infrastructure setup, covering health checks, patches, tests, and addressing challenges like synchronization, failures, and bottlenecks.

AMD MI300X performance compared with Nvidia H100

AMD MI300X performance compared with Nvidia H100

The AMD MI300X AI GPU outperforms Nvidia's H100 in cache, latency, and inference benchmarks. It excels in caching performance, compute throughput, but AI inference performance varies. Real-world performance and ecosystem support are essential.

Infrastructure set-up & open-source scripts to train a 70B model from bare metal

Infrastructure set-up & open-source scripts to train a 70B model from bare metal

The Imbue team trained a 70B parameter model, surpassing GPT-4o. They shared a guide for infrastructure setup, covering health checks, patches, tests, and addressing challenges like synchronization, failures, and bottlenecks.

Cubernetes

Cubernetes

Justin Garrison built "Cubernetes," a visually appealing Kubernetes hardware lab for training and content creation. The $6310 setup included unique parts like Mac Cube cases and LP-179 computers with Intel AMT support. Creative solutions like 3D printing and magnetic connectors were used. Lights were controlled by attiny85 and Raspberry Pi Pico for visualizations. The project prioritized functionality and education.

Follow the Capex: Triangulating Nvidia

Follow the Capex: Triangulating Nvidia

NVIDIA's Data Center revenue from major cloud providers like Amazon, Google, Microsoft, and Meta is analyzed. Microsoft stands out as a significant customer. The article emphasizes the increasing significance of AI infrastructure.

Link Icon 8 comments
By @latchkey - 6 months
Great post. The ethernet section is especially interesting to me.

I'm building a cluster of 16x Dell XE9680's (128 AMD MI300x GPUs) [0], with 8x 2p200G broadcom cards (running at 400G), all connected to a single Dell PowerSwitch Z9864F-ON, which should prevent any slowness. It will be connected over rocev2 [1].

We're going with ethernet because we believe in open standards, and few talk about the fact that the lead time on IB was last quoted to me at 50+ weeks. As kind of mentioned in the article, if you can't even deploy a cluster the speed of the network means less and less.

I can't wait to do some benchmarking on the system to see if we run into similar issues or not. Thankfully, we have a great Dell partnership, with full support, so I believe that we are well covered in terms of any potential issues.

Our datacenter is 100% green and low PUE and we are very proud of that as well. Hope to announce which one soon.

  [0] https://hotaisle.xyz/compute/

  [1] https://hotaisle.xyz/networking/
By @huqedato - 6 months
The only essential aspect this article doesn't answer: How much does it cost? All the rest is metadata. I would have preferred a clear table with vendors, prices and features. And less bla-bla.
By @barbazoo - 6 months
> Electricity sources and CO2 emissions

I love that they included this in their consideration and pointed out the impact running these GPUs has on the environment.

By @eigenvalue - 6 months
Lots of good and detailed information here, thanks. I'm curious why Ethernet interconnect is so unreliable in practice compared to the Infiniband. I would think that at this point, after a decade or more of current Ethernet standards, all the kinks would be worked out and the worst that would happen would be occasional latency spikes and a few lost packets that could be retransmitted quickly. Shouldn't the training frameworks be more robust to that sort of thing?
By @ec109685 - 6 months
How do the large clouds compare from an availability and cost perspective compared to finding a smaller provider and renting a dedicated cluster?
By @silverlake - 6 months
Good info! I use an HPC with SLURM. 40k GPUs shared by hundreds of users. It works well enough. I don’t know how the market for cloud-based clusters works. Why didn’t OP use AWS or Google for on-demand training? Is it just down to cost?
By @8organicbits - 6 months
Pretty sparse on pricing data, I guess everyone asked them to keep it private.
By @Jun8 - 6 months
Say you want to burn about $500 as a curiosity project for 8 nodes for a day. Any suggestions for what job to run?