June 27th, 2024

Infrastructure set-up & open-source scripts to train a 70B model from bare metal

The Imbue team trained a 70B parameter model, surpassing GPT-4o. They shared a guide for infrastructure setup, covering health checks, patches, tests, and addressing challenges like synchronization, failures, and bottlenecks.

Read original articleLink Icon
Infrastructure set-up & open-source scripts to train a 70B model from bare metal

The Imbue team successfully trained a 70B parameter model on their infrastructure, outperforming GPT-4o. They share a guide on setting up the infrastructure, including scripts for host health checks, NCCL patches, stress tests, and networking tests. The process involved provisioning machines, setting up InfiniBand, ensuring machine health, diagnosing issues, and improving tools. Challenges included clock synchronization, machine failures, and bandwidth bottlenecks. The team faced issues with GPU errors, PCIe cables, and firmware updates. In setting up InfiniBand, they had to rewire connections and address temperature alerts. The cluster comprised 4,092 H100 GPUs across 511 computers, with a fully non-blocking InfiniBand network architecture. The communication for training networks occurred over InfiniBand, while Ethernet was used for data transfer. The team worked closely with partners to prepare the cluster for production use, ensuring all components functioned optimally for high-performance training.

Related

Lessons Learned from Scaling to Multi-Terabyte Datasets

Lessons Learned from Scaling to Multi-Terabyte Datasets

Insights on scaling to multi-terabyte datasets, emphasizing algorithm evaluation before scaling. Tools like Joblib and GNU Parallel for single machine scaling, transitioning to multiple machines, and comparing performance/cost implications. Recommendations for parallel workloads and analytical tasks using AWS Batch, Dask, and Spark. Considerations for tool selection based on team size and workload.

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

The article discusses the release of open-source Llama3 70B model, highlighting its performance compared to GPT-4 and Claude3 Opus. It emphasizes training enhancements, data quality, and the competition between open and closed-source models.

Intel's Gaudi 3 will cost half the price of Nvidia's H100

Intel's Gaudi 3 will cost half the price of Nvidia's H100

Intel's Gaudi 3 AI processor is priced at $15,650, half of Nvidia's H100. Intel aims to compete in the AI market dominated by Nvidia, facing challenges from cloud providers' custom AI processors.

Testing AMD's Giant MI300X

Testing AMD's Giant MI300X

AMD introduces Radeon Instinct MI300X to challenge NVIDIA in GPU compute market. MI300X features chiplet setup, Infinity Cache, CDNA 3 architecture, competitive performance against NVIDIA's H100, and excels in local memory bandwidth tests.

AMD MI300X performance compared with Nvidia H100

AMD MI300X performance compared with Nvidia H100

The AMD MI300X AI GPU outperforms Nvidia's H100 in cache, latency, and inference benchmarks. It excels in caching performance, compute throughput, but AI inference performance varies. Real-world performance and ecosystem support are essential.

Link Icon 13 comments
By @thejash - 7 months
In the span of a few months, with a small team of researchers and engineers, we trained a 70B parameter model from scratch on our own infrastructure that outperformed zero-shot GPT-4o on reasoning-related tasks. Using our cluster for high performance training meant that every component — InfiniBand, Ethernet, GPUs, and the nodes themselves — had to work perfectly. If even a single one of the over 12,000 connections was a little flaky, it could slow down the entire training run.

We're sharing open-source scripts and an end-to-end guide for infrastructure set-up that details the process of making everything work perfectly, and ensuring that it stays that way.

This is one of a three-part toolkit on training a 70b model from scratch. The other two sections focus on evaluations and CARBS, our hyperparameter optimizer; you can find them here: https://imbue.com/research/70b-intro/

Thoughts and questions welcome! :)

By @alias_neo - 7 months
> This post focuses on one cluster that had 4,092 H100 GPUs spread across 511 computers, with eight GPUs to a computer

Am I right in understanding, that's over $100 Million worth of GPUs?

I wonder what/when/if any of this will be within the realms of an enthusiast with a gaming-pc budget.

By @renewiltord - 7 months
This is hella cool. Cisco has a new nvidia collab with 800G per-port. I don’t recall if it was RoCE or not. The infiniband is accessible by the GPUs here? Beautiful.

Thank you for sharing all this. One of the more directly useful posts.

By @loudmax - 7 months
This was discussed on the Latent Space podcast a few days ago: https://www.latent.space/p/llm-training-2024

That was a good episode, worth listening to for hearing justifications behind some of these decisions.

By @lifeisstillgood - 7 months
I am fascinated by the total electrical power drawn to build models - power and cooling I guess. Do you have any numbers on that (the point being Zuckerberg in a podcast suggested the next 1GW model was being planned - basically a data centre with a mid sized power plant attached)
By @omerhac - 7 months
This is such a valuable piece. I've learned so much reading it! And your open-source code is great as well.

Some open questions I have: 1) Why did you choose to setup your own cluster? How was the experience with your cloud partner regarding faulty machines / switches? 2) What were your considerations choosing the cluster architecture that have proven the most valuable ? (apart from the all2all comms) 3) Can you share a bit more about your logging infra apart from the fact that it was Loki based? 4) What necessitated the use of a local docker registry? did you use other images apart from nvidia-container-runtime?

Thanks!

By @mmastrac - 7 months
Honest question: why is there so much PC hardware in the mix here? Why don't we have PCI + infiniband backends with GPUs and a little tiny orchestrating ARM controller and just let them all coordinate with each other? Is it just "momentum" from previous designs and/or lack of "market" for specialized GPU controllers?
By @instagib - 7 months
4,092 H100 GPUs.

They’re working on “self-coding”. No-code or minimal code solutions or?

Quite a few articles and such people may be interested in also on their website: https://imbue.com/our-work/

By @weinzierl - 7 months
How much did it cost? Overall, from nothing to the usable model files, in hardware cost, development hours and ultimately electricity and cooling?
By @wkat4242 - 7 months
I wonder if it's possible for a huge number of hobbyists to team up and train a model together in a distributed manner like seti@home or folding@home. Or does this kind of workload not really lend itself to that approach?

Those things were of course characterised by the ability to spread the work into pretty self-contained work packages. Not sure if that can be done with model training.

By @john2x - 7 months
once the model is trained, what happens to the hardware and infrastructure?
By @mikewarot - 7 months
It would be quite interesting to see the same hardware used to repeat the training, but with raw Unicode, instead of tokenized training data.

I'd like to see the difference in performance on spelling and rhymes.