November 1st, 2024

Forking Chromium to Bolt on a High-Throughput Shared Memory Ringbuffer

Recall.ai optimized its video processing by implementing a custom shared memory ring buffer, reducing CPU usage by 50% and saving over a million dollars annually on AWS costs.

Read original articleLink Icon
Forking Chromium to Bolt on a High-Throughput Shared Memory Ringbuffer

Recall.ai has successfully optimized its video processing infrastructure by forking Chromium to implement a high-throughput shared memory ring buffer. The company, which operates on AWS, aimed to reduce its CPU usage and cloud computing costs, initially requiring four CPU cores for its bots. Profiling revealed that most CPU time was spent on memory copying functions, particularly in the WebSocket implementation used for video data transport. The high bandwidth of raw video streams necessitated a more efficient transport mechanism.

After evaluating options like raw TCP/IP and Unix Domain Sockets, Recall.ai opted for shared memory to eliminate the overhead of user-space to kernel-space data copying. They designed a custom ring buffer that supports multiple producers and a single consumer, dynamic frame sizes, and zero-copy reads, ensuring low latency and efficient data handling. This implementation led to a 50% reduction in CPU usage, significantly lowering their AWS costs by over a million dollars annually.

The project highlights the importance of performance optimization in cloud computing environments, particularly for high-demand applications like video processing.

- Recall.ai reduced CPU usage by 50% through a custom shared memory ring buffer.

- The optimization led to over a million dollars in annual savings on AWS costs.

- Profiling identified memory copying as a major source of CPU consumption.

- The new transport mechanism bypasses the inefficiencies of WebSocket fragmentation and masking.

- The implementation supports multiple producers and zero-copy reads for efficient data handling.

Link Icon 1 comments