A good day to trie-hard: saving compute 1% at a time
Cloudflare launched the open-source Rust crate "trie-hard" to optimize CPU usage in HTTP request processing, reducing header clearing runtime to 0.93µs and achieving a 1.28% CPU utilization reduction.
Read original articleCloudflare has introduced a new open-source Rust crate named "trie-hard" aimed at optimizing CPU utilization in its services, particularly in the pingora-origin function responsible for processing HTTP requests. The motivation for this development stems from the need to handle over 35 million requests per second while minimizing CPU time. The original function for clearing internal headers consumed 1.7% of the total CPU time, prompting the team to explore optimization strategies. Initial improvements were made by altering the method of header removal, which reduced the runtime from 3.65µs to 1.53µs. However, further enhancements were sought through the use of more efficient data structures. The trie data structure was identified as a potential solution, leading to the creation of trie-hard, which optimizes memory usage and speeds up the header clearing process. Benchmarking showed that trie-hard reduced the average runtime to 0.93µs, achieving a total CPU utilization reduction of 1.28%. The implementation has been running in production since July 2024, with performance metrics confirming the expected improvements. The key takeaway emphasizes the importance of understanding code performance and utilizing profiling tools for effective optimization.
- Cloudflare's new Rust crate "trie-hard" optimizes CPU usage in processing HTTP requests.
- The original function for clearing headers consumed 1.7% of CPU time, prompting optimization efforts.
- The trie data structure was implemented to enhance performance, reducing runtime to 0.93µs.
- The optimization achieved a total CPU utilization reduction of 1.28%.
- Performance metrics from production validate the improvements made through the new implementation.
Related
The CIDR trie data structure
This blog post explores CIDR trie data structure in Rust for efficient IP address management. It introduces CIDR trie for fast country lookup, explaining implementation details and time complexity benefits, suggesting optimizations. It mentions trie_rs crate for advanced trie use.
Four lines of code it was four lines of code
The programmer resolved a CPU utilization issue by removing unnecessary Unix domain socket code from a TCP and TLS service handler. This debugging process emphasized meticulous code review and system interaction understanding.
How we tamed Node.js event loop lag: a deepdive
Trigger.dev team resolved Node.js app performance issues caused by event loop lag. Identified Prisma timeouts, network congestion from excessive traffic, and nested loop inefficiencies. Fixes reduced event loop lag instances, aiming to optimize payload handling for enhanced reliability.
Comparing HTTP/3 vs. HTTP/2 Performance (2020)
Cloudflare has launched HTTP/3, enhancing internet performance with UDP, reducing head-of-line blocking, and offering 0-RTT support. Over 113,000 zones have activated it, showing mixed real-world performance results.
Speeding Up Your Website Using Cloudflare Cache
Pillser implemented Cloudflare caching to improve website performance and search rankings, reducing response times significantly. Challenges with stale content management and automated cache purging were also addressed.
- Several commenters question the statistical validity of the performance claims, suggesting more extensive sampling is needed.
- There is a discussion about alternative data structures, such as perfect hash tables and Bloom filters, which might offer better performance.
- Some users express concerns about the complexity and potential inefficiencies of the current header management approach.
- Commenters highlight the importance of memory access speed over comparison operations in performance optimization.
- There is interest in whether Rust's capabilities can optimize the trie implementation further, particularly through vectorization.
- An entire separate dictionary or other data structure.
- One single header containing all internal metadata.
- All headers have a prefix, and the internal ones start with I and the external ones start with E.
- All internal headers start with “CFInt”.
I would not have come up with a scheme in which headers in a particular list are internal. (What if someone else uses this name? What if something forgets to sanitize? What if different simultaneously running programs disagree in the list? What if the Connection header names a Cloudflare-internal header? What if the set-difference algorithm is annoyingly slow?)
The web is already full of obnoxiously ambiguous in-band signaling and header naming, and I find it bizarre that a company with Cloudflare’s scale uses such a tedious and error-prone mechanism internally.
One thing the author leaves out is this technique is technically a Bloom Filter. I find these kinds of things fascinating since they came about in an era where computing was far more resource bound than today (1970 in this case). But here we are, still finding real-world corners to use the same old optimizations.
A very simple analysis in R:
> prop.test(c(9, 4), c(1103,1171))
2-sample test for equality of proportions with continuity correction
data: c(9, 4) out of c(1103, 1171)
X-squared = 1.4915, df = 1, p-value = 0.222
alternative hypothesis: two.sided
95 percent confidence interval:
-0.002409831 0.011897193
sample estimates:
prop 1 prop 2
0.008159565 0.003415884
A p-value of 0.22 isn't below the magic 0.05 and the 95% confidence interval suggests that the trie might actually be slightly worse.I imagine the trie is better, given the prior analysis, and there is weak evidence for this. But take (a lot) more samples and know with confidence how much better.
I'd probably go with a custom perfect hash function. And Phil Bagwell's popcount trick. That would be faster than everyone else's solution that involves multiple lookups in memory.
The CPU is fast, memory is slow.
It appears the "remove_header" call is this one: https://docs.rs/pingora-http/0.3.0/src/pingora_http/lib.rs.h... which calls .remove() on two other data structures, both of which bottom out in this mountain of code: https://docs.rs/http/latest/src/http/header/map.rs.html#1550
1) Is this worthwhile? Looks like ~500 CPU cores were saved (are these real cores, or does this include hyperthread cores?). I don't know cloudflare's costs, but this seems like single digit servers and probably savings only in the $XX,XXX range. Not nothing, but do you expect a positive ROI on engineering?
2) If you do want to go to this detail, did you consider in putting the filter at the deserialization step, preventing the headers from being created in the first place?
Origin: The point at which something comes into existence or from which it derives or is derived.
How can the request's destination server be the origin, if it is the destination server?
What about for the other integer sizes that the trie supports?
Anybody knows why?
Then you wouldn't need to maintain a list of what is internal and you wouldn't have to do any trie lookups.
I have no experience with Rust so this is a naive question- why not use an array of strings to store these headers?
The “trie-hard” crate is a solution to the layers of cruft at CF
...and amazingly, it's still up from 2010: https://lorgonblog.wordpress.com/2010/02/21/dawg-gone/
Related
The CIDR trie data structure
This blog post explores CIDR trie data structure in Rust for efficient IP address management. It introduces CIDR trie for fast country lookup, explaining implementation details and time complexity benefits, suggesting optimizations. It mentions trie_rs crate for advanced trie use.
Four lines of code it was four lines of code
The programmer resolved a CPU utilization issue by removing unnecessary Unix domain socket code from a TCP and TLS service handler. This debugging process emphasized meticulous code review and system interaction understanding.
How we tamed Node.js event loop lag: a deepdive
Trigger.dev team resolved Node.js app performance issues caused by event loop lag. Identified Prisma timeouts, network congestion from excessive traffic, and nested loop inefficiencies. Fixes reduced event loop lag instances, aiming to optimize payload handling for enhanced reliability.
Comparing HTTP/3 vs. HTTP/2 Performance (2020)
Cloudflare has launched HTTP/3, enhancing internet performance with UDP, reducing head-of-line blocking, and offering 0-RTT support. Over 113,000 zones have activated it, showing mixed real-world performance results.
Speeding Up Your Website Using Cloudflare Cache
Pillser implemented Cloudflare caching to improve website performance and search rankings, reducing response times significantly. Challenges with stale content management and automated cache purging were also addressed.