Forking Chromium to Bolt on a High-Throughput Shared Memory Ringbuffer
Recall.ai optimized its video processing by implementing a custom shared memory ring buffer, reducing CPU usage by 50% and saving over a million dollars annually on AWS costs.
Read original articleRecall.ai has successfully optimized its video processing infrastructure by forking Chromium to implement a high-throughput shared memory ring buffer. The company, which operates on AWS, aimed to reduce its CPU usage and cloud computing costs, initially requiring four CPU cores for its bots. Profiling revealed that most CPU time was spent on memory copying functions, particularly in the WebSocket implementation used for video data transport. The high bandwidth of raw video streams necessitated a more efficient transport mechanism.
After evaluating options like raw TCP/IP and Unix Domain Sockets, Recall.ai opted for shared memory to eliminate the overhead of user-space to kernel-space data copying. They designed a custom ring buffer that supports multiple producers and a single consumer, dynamic frame sizes, and zero-copy reads, ensuring low latency and efficient data handling. This implementation led to a 50% reduction in CPU usage, significantly lowering their AWS costs by over a million dollars annually.
The project highlights the importance of performance optimization in cloud computing environments, particularly for high-demand applications like video processing.
- Recall.ai reduced CPU usage by 50% through a custom shared memory ring buffer.
- The optimization led to over a million dollars in annual savings on AWS costs.
- Profiling identified memory copying as a major source of CPU consumption.
- The new transport mechanism bypasses the inefficiencies of WebSocket fragmentation and masking.
- The implementation supports multiple producers and zero-copy reads for efficient data handling.
Related
We increased our rendering speeds by 70x using the WebCodecs API
Revideo, a TypeScript framework, boosted rendering speeds by 70 times with WebCodecs API. Challenges overcome by browser-based video encoding. Limited audio processing and browser compatibility remain.
Claude just slashed the cost of building AI applications
ClaudeAI's new Prompt Caching feature allows developers to reuse text, potentially reducing input API costs by up to 90%, benefiting applications like AI assistants and prompting competitors to consider similar innovations.
Cerebras reaches 1800 tokens/s for 8B Llama3.1
Cerebras Systems is deploying Meta's LLaMA 3.1 model on its wafer-scale chip, achieving faster processing speeds and lower costs, while aiming to simplify developer integration through an API.
A good day to trie-hard: saving compute 1% at a time
Cloudflare launched the open-source Rust crate "trie-hard" to optimize CPU usage in HTTP request processing, reducing header clearing runtime to 0.93µs and achieving a 1.28% CPU utilization reduction.
Valkey achieved one million RPS 6 months after forking from Redis
Valkey 8.0 RC2 achieves over 1.19 million requests per second through advanced memory access techniques, including speculative execution and interleaving, with a guide for performance reproduction on AWS EC2.
Related
We increased our rendering speeds by 70x using the WebCodecs API
Revideo, a TypeScript framework, boosted rendering speeds by 70 times with WebCodecs API. Challenges overcome by browser-based video encoding. Limited audio processing and browser compatibility remain.
Claude just slashed the cost of building AI applications
ClaudeAI's new Prompt Caching feature allows developers to reuse text, potentially reducing input API costs by up to 90%, benefiting applications like AI assistants and prompting competitors to consider similar innovations.
Cerebras reaches 1800 tokens/s for 8B Llama3.1
Cerebras Systems is deploying Meta's LLaMA 3.1 model on its wafer-scale chip, achieving faster processing speeds and lower costs, while aiming to simplify developer integration through an API.
A good day to trie-hard: saving compute 1% at a time
Cloudflare launched the open-source Rust crate "trie-hard" to optimize CPU usage in HTTP request processing, reducing header clearing runtime to 0.93µs and achieving a 1.28% CPU utilization reduction.
Valkey achieved one million RPS 6 months after forking from Redis
Valkey 8.0 RC2 achieves over 1.19 million requests per second through advanced memory access techniques, including speculative execution and interleaving, with a guide for performance reproduction on AWS EC2.