June 24th, 2024

Compressing graphs and indexes with recursive graph bisection (2016)

Graph reordering is applied to boost compression in graphs and indexes. A new compression-friendly technique using recursive graph bisection shows improved rates for large-scale graphs, enhancing storage and processing efficiency.

Read original articleLink Icon
Compressing graphs and indexes with recursive graph bisection (2016)

The paper discusses the application of graph reordering to enhance graph and index compression. By extending a theoretical model, the authors propose a compression-friendly reordering technique for social networks, web graphs, and document identifiers in inverted indexes. They introduce a novel algorithm based on recursive graph bisection, demonstrating improved compression rates over existing methods. The new approach is straightforward, enabling efficient parallel and distributed implementations on large-scale graphs. Experimental results show significant enhancements in compression rates for graphs with billions of vertices and hundreds of billions of edges. The study falls under the subjects of Data Structures and Algorithms as well as Social and Information Networks. The work aims to optimize data representation for better storage and processing efficiency in various applications.

Related

Lessons Learned from Scaling to Multi-Terabyte Datasets

Lessons Learned from Scaling to Multi-Terabyte Datasets

Insights on scaling to multi-terabyte datasets, emphasizing algorithm evaluation before scaling. Tools like Joblib and GNU Parallel for single machine scaling, transitioning to multiple machines, and comparing performance/cost implications. Recommendations for parallel workloads and analytical tasks using AWS Batch, Dask, and Spark. Considerations for tool selection based on team size and workload.

Group Actions and Hashing Unordered Multisets

Group Actions and Hashing Unordered Multisets

Group actions are used to analyze hash functions for unordered sets and multisets, ensuring order-agnostic hashing. By leveraging group theory, particularly abelian groups, hash functions' structure is explored, emphasizing efficient and order-independent hashing techniques.

B-Trees Require Fewer Comparisons Than Balanced Binary Search Trees

B-Trees Require Fewer Comparisons Than Balanced Binary Search Trees

B-trees require fewer comparisons than balanced binary search trees due to better access locality. Larger k values in B-trees lead to tighter bounds, outperforming AVL trees for k>=8, offering practical advantages.

Resilient Sync for Local First

Resilient Sync for Local First

The "Local-First" concept emphasizes empowering users with data on their devices, using Resilient Sync for offline and online data exchange. It ensures consistency, security, and efficient synchronization, distinguishing content changes and optimizing processes. The method offers flexibility, conflict-free updates, and compliance documentation, with potential enhancements for data size, compression, and security.

Surfing the (Human-Made) Internet

Surfing the (Human-Made) Internet

The internet's evolution prompts a return to its human side, advocating for personal sites, niche content, and self-hosted platforms. Strategies include exploring blogrolls, creating link directories, and using alternative search engines. Embrace decentralized social media and RSS feeds for enriched online experiences.

Link Icon 4 comments
By @ot - 5 months
I'm one of the authors of the paper, nice to see it on HN.

I remember when we first experimented with this, the compression improvement compared to our previous heuristic was massive, but the algorithm took a day to run on a double-digit-node Giraph cluster for a single index shard. I was very skeptical we'd ever be able to use it in production, given that we had to run it on thousands of index shards every few days.

Eventually we reimplemented it in C++, optimized all the data structures to make them fit in memory, and we were able to run it in a couple of hours on a single (beefy) machine. Over the years it has been optimized further.

The algorithm has been reproduced externally with an open-source implementation [1], which AFAIR was pretty good when I looked at it.

[1] https://culpepper.io/publications/mm+19-ecir.pdf

By @CrimsonCape - 5 months
I wonder if there's any development of this, since this paper is from 2016.

This is an example of an NP-hard problem that surprisingly broadly applies to everyday situations, just like the bin-packing problem that has been on the front page recently. Do you know of other NP hard problems with surprising applicability to everyday situations?

By @aurumque - 5 months
This is such a nice and straightforward result. Encode the vertex ID's using locality-preserving integers, and then store the deltas between the ID's instead of the ID's themselves.