September 10th, 2024

When Bloom filters don't bloom (2020)

The author explored Bloom filters for IP spoofing but faced performance issues due to memory access times. A simpler hash table approach ultimately provided better performance for large datasets.

Read original articleLink Icon
When Bloom filters don't bloom (2020)

The blog post discusses the author's exploration of Bloom filters while working on a project related to IP spoofing. The author initially attempted to use traditional methods for deduplicating a massive dataset of IP addresses but found them inefficient. Bloom filters, which are probabilistic data structures designed to test membership and allow for some false positives, seemed like a suitable solution. However, the author encountered performance issues, particularly related to memory access times, as the data structure did not fit into the CPU's cache. Profiling revealed that a significant amount of CPU time was spent on memory fetches, leading to slower performance than expected. The author experimented with reducing the number of hash functions used in the Bloom filter but found that this increased memory requirements significantly without improving speed. Ultimately, the author shifted to a simpler hash table approach, which provided better performance and memory access patterns. The key takeaway is that while advanced data structures like Bloom filters can be useful, they may not be optimal for large datasets that exceed cache sizes, highlighting the importance of cache-optimized algorithms in modern computing.

- Bloom filters are efficient for membership testing but can struggle with large datasets that exceed CPU cache sizes.

- Memory access patterns significantly impact performance, with random access being much slower than sequential access.

- A simpler hash table approach can outperform more complex data structures in certain scenarios.

- Profiling tools are essential for identifying performance bottlenecks in data processing tasks.

- Optimizing for reduced memory accesses is often more beneficial than merely minimizing memory usage.

Link Icon 4 comments
By @the_gorilla - 3 months
I'm not really convinced by this article. His complaint is that bloom filters are too slow for his purpose, and the cause is the large memory requirements and random memory accesses causing constant cache misses. So, he has 40 million lines and 600 MB of data and it takes 12 seconds for his bloom filter to work. I'd have to dig into it a lot more but my instincts are telling me this is user error because these numbers don't really make sense even on older hardware.
By @chuckadams - 3 months
I wonder how the final version with the single hash table stacks up against:

    cat file.txt | perl -ne 'print unless $x{$_}++'
By @hinkley - 3 months
> For example, source IPs belonging to a legitimate Italian ISP should not arrive in a Brazilian datacenter.

You sure about that?

If the entire network is healthy. Yes, and maybe by two(?) transatlantic cables is not the efficient way to serve a page. NYC is probably closer to Italy by network topology than is Brazil.

But if CF had a data center in Italy and one in Germany, there are very valid reasons for Italian traffic to end up in Germany. Like a grid failure, flooding. Or fire. So how far you can take this and still meet SLAs is a little tricky.

By @thomasmg - 3 months
The Bloom filter he uses has k=8, that is 8 cache line misses per entry. (k=8 is in the command line output). A "blocked Bloom fiter" requires only one cache line miss per entry, with a slightly worse false positive rate.

With a good implementation it would be roughly 10 times faster, eg a SIMD implementation. For example here https://github.com/FastFilter/fastfilter_cpp/blob/master/src... (there are others)