August 28th, 2024

Tesla's TTPoE at Hot Chips 2024: Replacing TCP for Low Latency Applications

Tesla unveiled the Tesla Transport Protocol over Ethernet (TTPoE) to improve low-latency data transfer for its Dojo supercomputer, enhancing performance and efficiency in machine learning applications for automotive technologies.

Read original articleLink Icon
SkepticismCriticismConfusion
Tesla's TTPoE at Hot Chips 2024: Replacing TCP for Low Latency Applications

Tesla introduced the Tesla Transport Protocol over Ethernet (TTPoE) at Hot Chips 2024, aiming to replace TCP for low-latency applications in their Dojo supercomputer, which focuses on machine learning for automotive technologies. TTPoE is designed to enhance data throughput by minimizing latency, which is crucial for processing large video data, such as 1.7 GB tensors used in vision applications. Unlike traditional TCP, TTPoE simplifies connection management by eliminating the TIME_WAIT state and reducing the connection opening and closing sequences from three transmissions to two. This hardware-based protocol allows for microsecond-scale latency and is optimized for high-quality intra-supercomputer networks, avoiding the complexities of TCP's congestion control mechanisms. Instead of dynamically adjusting the congestion window based on network conditions, TTPoE uses a fixed congestion window managed by a SRAM buffer, which allows for straightforward packet retransmission. The protocol is implemented on a cost-effective "Dumb-NIC" designed to support numerous host nodes, enhancing the Dojo's performance without incurring high costs. Overall, TTPoE represents a significant advancement in networking for supercomputing, providing a tailored solution that meets the specific needs of Tesla's applications.

- Tesla introduced TTPoE to enhance low-latency data transfer for its Dojo supercomputer.

- TTPoE simplifies connection management compared to traditional TCP, reducing latency.

- The protocol uses a fixed congestion window and hardware-based management for efficiency.

- TTPoE is designed for high-quality intra-supercomputer networks, not for the open internet.

- The implementation focuses on cost-effectiveness with "Dumb-NIC" technology to support scalability.

Related

P4TC Hits a Brick Wall

P4TC Hits a Brick Wall

P4TC, a networking device programming language, faces integration challenges into the Linux kernel's traffic-control subsystem. Hardware support, code duplication, and performance concerns spark debate on efficiency and necessity. Stalemate persists amid technical and community feedback complexities.

Tenstorrent Unveils High-End Wormhole AI Processors, Featuring RISC-V

Tenstorrent Unveils High-End Wormhole AI Processors, Featuring RISC-V

Tenstorrent launches Wormhole AI chips on RISC-V, emphasizing cost-effectiveness and scalability. Wormhole n150 offers 262 TFLOPS, n300 doubles power with 24 GB GDDR6. Priced from $999, undercutting NVIDIA. New workstations from $1,500.

AI Development Kits: Tenstorrent Update

AI Development Kits: Tenstorrent Update

Tenstorrent launches new AI development kits with PCIe cards Grayskull e75, e150, Wormhole n150, and n300. Emphasizes networking capabilities, offers developer workstations TT-LoudBox and TT-QuietBox with high-end components. Aims to enhance AI development.

Comparing TCP and QUIC (2022)

Comparing TCP and QUIC (2022)

Geoff Huston compares TCP and QUIC protocols in the October 2022 ISP Column. QUIC is seen as a transformative protocol with enhanced privacy, speed, and flexibility, potentially replacing TCP on the Internet. QUIC offers improved performance for encrypted traffic and independent transport control for applications.

DisTrO – a family of low latency distributed optimizers

DisTrO – a family of low latency distributed optimizers

DisTrO is a GitHub project aimed at reducing inter-GPU communication in distributed training, with a preliminary report released on August 26, 2024, and plans for future publications and community collaboration.

AI: What people are saying
The introduction of Tesla's Transport Protocol over Ethernet (TTPoE) has generated a variety of reactions among commenters.
  • Many commenters question the necessity of creating a custom protocol when established solutions like TCP Offload Engines and InfiniBand already exist.
  • Concerns are raised about the protocol's design, particularly the lack of congestion control and the initial roundtrip delay before data transmission.
  • Some commenters highlight the potential inefficiencies and suggest that Tesla may be reinventing existing technologies rather than innovating.
  • There is skepticism regarding the performance of Tesla's system compared to current high-speed networking technologies.
  • Overall, the comments reflect a mix of technical critique and curiosity about Tesla's engineering decisions.
Link Icon 16 comments
By @SilverBirch - 8 months
This screams "Not Invented Here" syndrome. Massive yikes at the digram showing TCP in software in the OSI model. There have been hardware accelerated TCP stacks for decades. They called TCP Offload Engines, they work great, have done for ages. Why are you building one and giving it a new name? Seems like a pretty enormous amount of work and you would've gotten 90+% of the gains by just implementing a standard TOE. I guess the only good reason I can think to do this yourself is that they'd left it so late to get to this that all the companies that were good at this got bought (Solareflare, Mellanox etc).
By @cout - 8 months
This kind of computing must be a different kind of world than the one I work in. 80 microseconds of latency seems high to me when infiniband can do single digit latency with unreliable datagrams, which turn out to be mostly reliable due to the credit system.
By @choilive - 8 months
This is all technically impressive but was it all technically necessary? Was infiniband really just not good enough? All this R&D for a custom protocol and custom NICs seems to just be a massive flex of Tesla's engineering muscle.
By @xtacy - 8 months
It's also a bit odd that they do not implement congestion control. Congestion control is fundamental unless you only have point-to-point data transfers, which is rarely the case. All-reduce operation during training requires N to 1 data transfer. In these scenarios the sender needs to control its data transfer rates so as to not overwhelm not just the receiver, but also the network... if this is not done, it will cause congestion collapse (https://en.wikipedia.org/wiki/Network_congestion#:~:text=ser...).
By @speransky - 8 months
Tuned RoCE with udp is really low latency and no need to implement extra layer of silicon. May be there are more motivation then described in article
By @sgt - 8 months
Isn't this what Dolphin Interconnect have been doing for a couple of decades? https://www.dolphinics.com/
By @jeroenvlek - 8 months
Seems like Tesla could really benefit from this about to be released optimizer that reduces intra-GPU communication [0].

[0] https://github.com/NousResearch/DisTrO/blob/main/A_Prelimina...

By @MisterTea - 8 months
Congrats. You just reinvented the wheel: http://doc.cat-v.org/plan_9/4th_edition/papers/il/
By @iamleppert - 8 months
There are already high speed, low latency video interfaces that have been around for ages. MIPI and HDMI.

There are ICs you can buy off the shelf for electronic routing and switching of these interfaces.

By @chronicileiee - 8 months
Meanwhile high frequency traders work 1-2 orders of manicure faster in the tens to hundreds of nanoseconds.
By @darby_nine - 8 months
The "Tesla" in the article appears to refer to the car manufacturer, for those as confused as I.
By @gruturo - 8 months
If they are so concerned with low latency, how come they are wasting an entire roundtrip (the "OPEN + OPEN/ACK") before sending any data?

I mean in TCP it's not allowed (Even though, super-theoretically, it's not completely forbidden) to carry a payload in the initial TCP SYN. If you're so latency-obsessed to create your own protocol, that's the first thing I'd address.

By @fragmede - 8 months
Interesting! No mention of UDP, or the application being run, or the GPU/TPUs on the nodes, so it'll have to be a mystery as to how much bang for their buck they're getting with this particular bit of work.

What's disappointing is that it's impossible to do a new protocol on the Internet because of all the middleware boxes that drop packets that aren't IMCP or TCP or UDP.

By @efitz - 8 months
Yeah I never FIN my connections eithRST
By @7e - 8 months
100Gbps Ethernet cards? The world has moved way past that for training. Their accelerator stack must be really slow if this is good enough for them.