August 4th, 2024

So Who Is Building That 100k GPU Cluster for XAI?

Elon Musk's xAI seeks to build a 100,000 GPU cluster for AI projects, requiring significant Nvidia GPUs. A "Gigafactory of Compute" is planned in Memphis, aiming for completion by late 2025.

Read original articleLink Icon
So Who Is Building That 100k GPU Cluster for XAI?

Elon Musk's companies, including xAI, Tesla, and SpaceX, are in need of a significant number of GPUs for their AI and high-performance computing projects. Musk, who co-founded OpenAI in 2015, established xAI in March 2023 after leaving OpenAI amid a power struggle. xAI has raised $6.4 billion in funding to compete with major players like OpenAI, Google, and Amazon. Musk aims to build a 100,000 GPU cluster for xAI, with the upcoming Grok-2 model requiring 24,000 Nvidia H100 GPUs for training. The Grok-3 model, expected by the end of the year, will also need a similar GPU count.

After a deal for GPU capacity with Oracle fell through, Musk decided to create a "Gigafactory of Compute" in Memphis, Tennessee, to house the GPU cluster. The factory currently has 8 megawatts of power allocated, with plans to increase this to 50 megawatts. The full 100,000 GPU capacity may not be achieved until late 2025. Supermicro is expected to provide the water-cooled machines for the cluster, while Nvidia is likely supplying the networking infrastructure through its Spectrum-X technology. The storage solutions for the cluster remain unspecified, but companies like Vast Data may be involved. Overall, the ambitious project reflects Musk's intent to establish xAI as a formidable competitor in the AI landscape.

Related

XAI's Memphis Supercluster has gone live, with up to 100,000 Nvidia H100 GPUs

XAI's Memphis Supercluster has gone live, with up to 100,000 Nvidia H100 GPUs

Elon Musk launches xAI's Memphis Supercluster with 100,000 Nvidia H100 GPUs for AI training, aiming for advancements by December. Online status unclear, SemiAnalysis estimates 32,000 GPUs operational. Plans for 150MW data center expansion pending utility agreements. xAI partners with Dell and Supermicro, targeting full operation by fall 2025. Musk's humorous launch time noted.

VCs are still pouring billions into generative AI startups

VCs are still pouring billions into generative AI startups

Investments in generative AI startups reached $12.3 billion in H1 2023, focusing on early-stage ventures. Challenges include legal issues and rising costs, making profitability elusive for many companies.

Ex-Twitter dev reminisces about finding 700 unused Nvidia GPUs after takeover

Ex-Twitter dev reminisces about finding 700 unused Nvidia GPUs after takeover

Tim Zaman, a former Twitter engineer, revealed 700 idle Nvidia V100 GPUs in Twitter's data center post-Elon Musk's acquisition, highlighting inefficiencies in resource management amid rising AI demands.

Four co's are hoarding billions worth of Nvidia GPU chips. Meta has 350K of them

Four co's are hoarding billions worth of Nvidia GPU chips. Meta has 350K of them

Meta has launched Llama 3.1, a large language model outperforming ChatGPT 4o on some benchmarks. The model's development involved significant investment in Nvidia GPUs, reflecting high demand for AI training resources.

YC closes deal with Google for dedicated compute cluster for AI startups

YC closes deal with Google for dedicated compute cluster for AI startups

Google Cloud has launched a dedicated Nvidia GPU and TPU cluster for Y Combinator startups, offering $350,000 in cloud credits and support to enhance AI development and innovation.

Link Icon 6 comments
By @ethagknight - 6 months
I just drove by the site for fun the other day. Probably 500 cars parked out front of workers. They’ve installed what appear to be a 6-8 large gas generators on the north end parking lot to provide more power until they get larger links set up to the new TVA combined cycle gas plant, 1.2 GW facility across the street. That’s about all, onsite.

More impressive than that though is the 1000+ acres of empty land going south that is prepped for future industry. As a Memphian, I hope Elon sees the potential to build huge scale stuff here in the future. There’s a huge steel mill, a river terminal for direct access to oversized shipping, a wastewater plant next door, a large Valero oil refinery up the street, and a Canadian National intermodal facility down the road.

Literally everything one would need to build millions of robots or trucks or rocket engines or whatever.

https://maps.app.goo.gl/NuHUpg6CcndxyKHK9?g_st=com.google.ma...

By @lagniappe - 6 months
Dell

Specifically, the XE9680 racks ( 8xH100 ). 100k H100 GPU, 300k B200 GPU, liquid cooled doors in the racks, 750 racks, and some other minor details.

By @echelon - 6 months
Looks like Musk could easily leapfrog OpenAI, Anthropic, and the rest. He was given all the money he needs to catch up and then some.

It doesn't look like there are any moats to speak of.

By @uncharted9 - 6 months
Cool article from the perspective of setting up a data center, but the big question remains - Why?

If the goal is to create a bigger, faster, smarter version of OpenAI's GPT, then I'm not sure if this sounds even logical. The current crop of LLM models is just a Q&A hallucination filled chatbot which are way past their peak hype cycle. There is still no business model or killer product-market fit. I don't know what the investors are expecting, but they are more than willing to throw more than 6B dollars. So idk. °_°