September 10th, 2024

A good day to trie-hard: saving compute 1% at a time

Cloudflare launched the open-source Rust crate "trie-hard" to optimize CPU usage in HTTP request processing, reducing header clearing runtime to 0.93µs and achieving a 1.28% CPU utilization reduction.

Read original articleLink Icon
SkepticismCuriosityAppreciation
A good day to trie-hard: saving compute 1% at a time

Cloudflare has introduced a new open-source Rust crate named "trie-hard" aimed at optimizing CPU utilization in its services, particularly in the pingora-origin function responsible for processing HTTP requests. The motivation for this development stems from the need to handle over 35 million requests per second while minimizing CPU time. The original function for clearing internal headers consumed 1.7% of the total CPU time, prompting the team to explore optimization strategies. Initial improvements were made by altering the method of header removal, which reduced the runtime from 3.65µs to 1.53µs. However, further enhancements were sought through the use of more efficient data structures. The trie data structure was identified as a potential solution, leading to the creation of trie-hard, which optimizes memory usage and speeds up the header clearing process. Benchmarking showed that trie-hard reduced the average runtime to 0.93µs, achieving a total CPU utilization reduction of 1.28%. The implementation has been running in production since July 2024, with performance metrics confirming the expected improvements. The key takeaway emphasizes the importance of understanding code performance and utilizing profiling tools for effective optimization.

- Cloudflare's new Rust crate "trie-hard" optimizes CPU usage in processing HTTP requests.

- The original function for clearing headers consumed 1.7% of CPU time, prompting optimization efforts.

- The trie data structure was implemented to enhance performance, reducing runtime to 0.93µs.

- The optimization achieved a total CPU utilization reduction of 1.28%.

- Performance metrics from production validate the improvements made through the new implementation.

AI: What people are saying
The comments on Cloudflare's "trie-hard" crate reveal a mix of skepticism and curiosity regarding its implementation and performance benefits.
  • Several commenters question the statistical validity of the performance claims, suggesting more extensive sampling is needed.
  • There is a discussion about alternative data structures, such as perfect hash tables and Bloom filters, which might offer better performance.
  • Some users express concerns about the complexity and potential inefficiencies of the current header management approach.
  • Commenters highlight the importance of memory access speed over comparison operations in performance optimization.
  • There is interest in whether Rust's capabilities can optimize the trie implementation further, particularly through vectorization.
Link Icon 31 comments
By @amluto - 2 months
If you had asked me to make a wild guess as to how Cloudflare stores internal headers and then removes them, I would have come up with some options:

- An entire separate dictionary or other data structure.

- One single header containing all internal metadata.

- All headers have a prefix, and the internal ones start with I and the external ones start with E.

- All internal headers start with “CFInt”.

I would not have come up with a scheme in which headers in a particular list are internal. (What if someone else uses this name? What if something forgets to sanitize? What if different simultaneously running programs disagree in the list? What if the Connection header names a Cloudflare-internal header? What if the set-difference algorithm is annoyingly slow?)

The web is already full of obnoxiously ambiguous in-band signaling and header naming, and I find it bizarre that a company with Cloudflare’s scale uses such a tedious and error-prone mechanism internally.

By @pragma_x - 2 months
It took me a moment to ponder the effectiveness of mapping utf8 characters into a bitmask. At first, it seemed unwise. Then I realized that 32 bits would get you `a-z` plus six special characters, and 64 bits would get you 'A-Z' (uppercase) plus six more. That's plenty of room for HTTP headers, and for a blazing fast matching algorithm since it just masks and compares a couple integers. Even figuring out which bit corresponds to which character is a simple lookup into a 256-word table.

One thing the author leaves out is this technique is technically a Bloom Filter. I find these kinds of things fascinating since they came about in an era where computing was far more resource bound than today (1970 in this case). But here we are, still finding real-world corners to use the same old optimizations.

https://en.wikipedia.org/wiki/Bloom_filter

By @gregsexton - 2 months
The presented data for demonstrating a win doesn't have enough power to actually show this - not enough samples were taken.

A very simple analysis in R:

    > prop.test(c(9, 4), c(1103,1171))

    2-sample test for equality of proportions with continuity correction

    data:  c(9, 4) out of c(1103, 1171)
    X-squared = 1.4915, df = 1, p-value = 0.222
    alternative hypothesis: two.sided
    95 percent confidence interval:
    -0.002409831  0.011897193
    sample estimates:
        prop 1      prop 2 
    0.008159565 0.003415884
A p-value of 0.22 isn't below the magic 0.05 and the 95% confidence interval suggests that the trie might actually be slightly worse.

I imagine the trie is better, given the prior analysis, and there is weak evidence for this. But take (a lot) more samples and know with confidence how much better.

By @Validark - 2 months
It's silly to use Big-O to describe the number of comparisons when the goal is to analyze performance. Comparisons cost 1 cycle, and you can do many of them per cycle with instruction level parallelism and SIMD. The main bottleneck, the actual source of slowness, is memory. It costs thousands of cycles to go to memory, sometimes tens of thousands or hundreds of thousands (if you need a TLB walk or OS interrupt). If you want to use Big-O, use it to estimate the number of cache misses.

I'd probably go with a custom perfect hash function. And Phil Bagwell's popcount trick. That would be faster than everyone else's solution that involves multiple lookups in memory.

The CPU is fast, memory is slow.

By @mightyham - 2 months
I'm not very well versed in data structure optimization, but I was surprised they dismissed hash tables so quickly, especially when the table they are searching through is static. I find it somewhat hard to believe that a specially optimized hash table would not be faster than their trie implementation.
By @evmar - 2 months
This post uses a fancy data structure to construct a to_remove set, then iterates through it to remove these from the underlying header map.

It appears the "remove_header" call is this one: https://docs.rs/pingora-http/0.3.0/src/pingora_http/lib.rs.h... which calls .remove() on two other data structures, both of which bottom out in this mountain of code: https://docs.rs/http/latest/src/http/header/map.rs.html#1550

By @miso2024 - 2 months
Finally a blog post with a trie. Those trie leetcodes didn't go in vain ;)
By @aidenn0 - 2 months
Given that the set if items to match is static, I wonder if they tried a perfect hash table; that would reduce it to a few arithmetic operations followed by a single string compare. It would be interesting to see how it compares to the trie.
By @Scaevolus - 2 months
Did you try a (tiny) bloom filter? Doing a quick convolution of the header key and a test against a bloom filter (SIMD? Using builtin CRC ops? 256-bit bloom filter size?) might avoid walking the trie altogether in most cases at the cost of a few cycles.
By @unusualmonkey - 2 months
My 2c:

1) Is this worthwhile? Looks like ~500 CPU cores were saved (are these real cores, or does this include hyperthread cores?). I don't know cloudflare's costs, but this seems like single digit servers and probably savings only in the $XX,XXX range. Not nothing, but do you expect a positive ROI on engineering?

2) If you do want to go to this detail, did you consider in putting the filter at the deserialization step, preventing the headers from being created in the first place?

By @hyperpape - 2 months
I'm interested to know why the regex crate didn't do better. I believe it should compile a search for multiple literal strings into an Aho-Corasick automaton, which is structured very much like a trie.
By @cwilby - 2 months
> Internally, we call the request’s destination server its “origin”

Origin: The point at which something comes into existence or from which it derives or is derived.

How can the request's destination server be the origin, if it is the destination server?

By @sophacles - 2 months
So this trie uses a custom u256 type that is a wrapper around an array of 4 u64s. Is rust smart enough to vecorize it into AVX instructions for the bit twiddling?

What about for the other integer sizes that the trie supports?

By @maciek12 - 2 months
Am i missing something or is the big O notation in this article wrong? For example in "Sorted sets like BTreeSet use comparisons for searching, and that makes them logarithmic over key length O(log(L)), but they are also logarithmic in size too" how is the search done in logarithmic time? You could have a header that differs from an internal one only on the last character and then you have to read the whole thing. Also space-wise B-trees are linear, not O(nlogn). Additionally, at the end when talking about the trie, how do you achieve O(log(L)) for misses? Tries are not balanced, they do not halve the possible set on comparison (as I think the author states) and even if they did I don't see how that achieves the logarithmic time.
By @y04nn - 2 months
Why adding and then removing HTTP headers of a existing request for routing instead of adding a dedicated (custom) routing protocol above the HTTP stack? This would allow to just drop the added protocol on the outbound and be much more space efficient.
By @MehdiHK - 2 months
Slightly off topic: I was a bit surprised nobody ported space-efficient marisa trie to Rust, or any other languages.

Anybody knows why?

https://github.com/s-yata/marisa-trie

By @IncreasePosts - 2 months
Off topic: how does 60M requests a second stack up with other big players like google, Facebook, TikTok?
By @alexchamberlain - 2 months
Jumping on the bandwagon of other ideas... I wonder if it would be quicker to filter pit the internal headers when you write the request to the network (buffer)? ie something like `request_headers.filter(not_is_internal).for_each(...)`; that way you don't have to remove anything from the data structure at all, but does require you to break down a layer of abstraction for performance benefits.
By @wolf550e - 2 months
I don't know Rust. Can someone explain how this data structure stores whether a node is itself a valid word or whether it only leads to more nodes? For example the node "do" in their (“and”, “ant”, “dad”, “do”, & “dot”) example. I guess it's a discriminated union provided by Rust algebraic data types or something like that, but why not show the full bit pattern in memory?
By @bhawks - 2 months
Alternatively at request ingress time you could take all original header names and save them, and at egress time you could use that list to drive which keys to emit?

Then you wouldn't need to maintain a list of what is internal and you wouldn't have to do any trie lookups.

By @kristianp - 2 months
I would have stopped at the first step, saving almost 1% of cpu time (0.993%). Creating a new fast trie data type for the other 0.287% seems inefficient. Surely there are other parts of the code that take more than 1% of the cpu time elsewhere to be looked at first?
By @o11c - 2 months
That late in the process, does it even need to support efficient lookup? Could it just be a Vec?
By @hencoappel - 2 months
Given it's such a hot part of their code, I'm genuinely surprised they didn't go down the state machine route for as optimal a solution, or even further and made use of vector instructions. Maybe Rust can autovectorise the code though.
By @Buckaroo9 - 2 months
Shouldn't this line: > "using regular expressions clocks in as taking about twice as long as the hashmap-based solution." be written as the opposite? That the RE takes half as long/is twice as fast?
By @alohashirt99 - 2 months
This was an interesting blog - a real world example of tuning Rust code with actual measurements.

I have no experience with Rust so this is a naive question- why not use an array of strings to store these headers?

By @gwittel - 2 months
Neat optimization! Would it have been feasible to spend a bit of cpu/memory and tag headers as internal upon the request construction? That way filtering on output is trivial.
By @xyst - 2 months
It’s a bit wild to me that one needs to add so many http headers internally that they need to get stripped off.

The “trie-hard” crate is a solution to the layers of cruft at CF

By @mwexler - 2 months
I know, they aren't the first to use it, but I still appreciate the pun.
By @pianoben - 2 months
This solution reminds me of an old blog post, where two members of the F# team at Microsoft battled back and forth on making the best solver for some word game. Cloudflare's solution here looks a lot like the second-to-last solution; the final post involved collapsing the tree nodes into a 2D array of u32s for maximum cache-hits. It was a really neat solution!

...and amazingly, it's still up from 2010: https://lorgonblog.wordpress.com/2010/02/21/dawg-gone/

By @sixo - 2 months
this post makes me think I could work at cloudflare, and also that I shouldn't
By @mgaunard - 2 months
If it takes 3.5 us to remove a substring within a packet-sized string you're doing it wrong.