August 27th, 2024

DisTrO – a family of low latency distributed optimizers

DisTrO is a GitHub project aimed at reducing inter-GPU communication in distributed training, with a preliminary report released on August 26, 2024, and plans for future publications and community collaboration.

Read original article

DisTrO – a family of low latency distributed optimizers

DisTrO (Distributed Training Over-The-Internet) is a GitHub project aimed at creating low latency distributed optimizers that drastically minimize inter-GPU communication needs, achieving reductions by three to four orders of magnitude. A preliminary report detailing the project's findings was released on August 26, 2024. The repository also mentions upcoming plans to publish a paper and code, along with additional developments in the near future. Furthermore, the project encourages community involvement and invites interested individuals to join their Discord channel for collaboration in the research and development of distributed training.

- DisTrO focuses on reducing inter-GPU communication for distributed training.

- A preliminary report was published on August 26, 2024.

- Future releases will include a paper and code.

- Community members are invited to join the project's Discord for collaboration.

DETRs Beat YOLOs on Real-Time Object Detection

DETRs outperform YOLOs with RT-DETR model, balancing speed and accuracy by adjusting decoder layers. Achieving 53.1% / 54.3% AP on COCO and 108 / 74 FPS on T4 GPU, RT-DETR-R50 surpasses DINO-R50 by 2.2% AP and 21 times in FPS.

Dynolog: Open-Source System Observability

Dynolog is an open-source observability tool for optimizing AI applications on distributed CPU-GPU systems. It offers continuous monitoring of performance metrics, integrates with PyTorch Profiler and Kineto CUDA profiling library, and supports GPU monitoring for NVIDIA GPUs and CPU events for Intel and AMD CPUs. Developed in Rust, Dynolog focuses on Linux platforms to enhance AI model observability in cloud environments.

Disruptor-rs: better latency and throughput than crossbeam

The GitHub repository for the "Disruptor" library in Rust provides low latency inter-thread communication. It offers guidance, code snippets, design decisions, performance evaluations, related projects, contributions, and future plans. Valuable for developers.

Launch HN: Outerport (YC S24) – Instant hot-swapping for AI model weights

Towaki and Allen are developing Outerport, a distribution network that optimizes AI model deployment, enabling quick model swapping and potentially reducing GPU costs by 40% through efficient management and orchestration.

TRON Project

TRON, initiated in 1984, is a real-time operating system with global impact, particularly ITRON. Challenges arose with BTRON, and in 2017, μT-Kernel 2.0 was transferred to the IEEE.

5 comments

By @arjvik - 8 months

There's no information about what this is, beyond a teaser of a loss graph. Really hoping this is something that gets released to the world, not hidden behind closed doors.

By @logicchains - 8 months

I'd love to believe it's true but I suspect they're overstating the result, or it's a fluke. Presumably teams at large firms like Meta would have put a lot of effort into checking whether not-synchronise-every-step training could match synchronise-every-step training before investing hundreds of millions of dollars into the low-latency, high-throughput network hardware necessary for the latter.

By @iamronaldo - 8 months

This seems huge no? Couldn't this enable "community based" ai training at home?

By @simonw - 8 months

Most of the information about this is in this PDF (I hate when people publish interesting information exclusively in PDFs): https://raw.githubusercontent.com/NousResearch/DisTrO/main/A...

I converted it to Markdown (using Gemini 1.5 Pro) and pasted it into a Gist here: https://gist.github.com/simonw/46a33d66e069efe5c10b63625fdab...

From the abstract:

> Training large scale neural networks typically involves sharing gradients between all accelerators, which necessitates specialized, high-speed interconnects. To address this, we introduce DisTrO, a family of architecture-agnostic and network-agnostic distributed optimizers that reduces the inter-GPU communication requirements by four to five orders of magnitude without relying on amortized analysis, enabling low-latency training of large neural networks on slow internet bandwidths with heterogeneous networking hardware.

This could be a HUGE deal.

Currently if you want to train giant LLMs you need a big pile of GPUs in the same location as each other due to the amount of information that needs to shuffle between them during training.

If DisTrO works as intended, it will be possible to train models using GPUs in different places - potentially enabling SETI@home style training where thousands of people with gaming PCs at home could donate their GPU time to a large training effort.

Their tweet about this has more: https://twitter.com/NousResearch/status/1828121648383566270

> Nous Research is proud to release a preliminary report on DisTrO (Distributed Training Over-the-Internet) a family of architecture-agnostic and network-agnostic distributed optimizers that reduces the inter-GPU communication requirements by 1000x to 10,000x without relying on amortized analysis, and matches AdamW+All-Reduce in convergence rates. This enables low-latency training of large neural networks on slow internet bandwidths with heterogeneous networking hardware.

> DisTrO can increase the resilience and robustness of training LLMs by minimizing dependency on a single entity for computation. DisTrO is one step towards a more secure and equitable environment for all participants involved in building LLMs.

> Without relying on a single company to manage and control the training process, researchers and institutions can have more freedom to collaborate and experiment with new techniques, algorithms, and models. This increased competition fosters innovation, drives progress, and ultimately benefits society as a whole.

DETRs Beat YOLOs on Real-Time Object Detection

Dynolog: Open-Source System Observability

Disruptor-rs: better latency and throughput than crossbeam

Launch HN: Outerport (YC S24) – Instant hot-swapping for AI model weights

TRON Project

TRON, initiated in 1984, is a real-time operating system with global impact, particularly ITRON. Challenges arose with BTRON, and in 2017, μT-Kernel 2.0 was transferred to the IEEE.

DisTrO – a family of low latency distributed optimizers

Related

DETRs Beat YOLOs on Real-Time Object Detection

Dynolog: Open-Source System Observability

Disruptor-rs: better latency and throughput than crossbeam

Launch HN: Outerport (YC S24) – Instant hot-swapping for AI model weights

TRON Project

Related

DETRs Beat YOLOs on Real-Time Object Detection

Dynolog: Open-Source System Observability

Disruptor-rs: better latency and throughput than crossbeam

Launch HN: Outerport (YC S24) – Instant hot-swapping for AI model weights

TRON Project