October 8th, 2024

Improving Parquet Dedupe on Hugging Face Hub

Hugging Face is optimizing its Hub's Parquet file storage for better deduplication, addressing challenges with modifications and deletions, and considering collaboration with Apache Arrow for further improvements.

Read original article

Improving Parquet Dedupe on Hugging Face Hub

Hugging Face is enhancing the efficiency of its Hub's storage architecture, particularly focusing on optimizing Parquet file storage, which currently accounts for over 2.2PB of its nearly 11PB dataset. The need for data deduplication is critical as users frequently update datasets, and the goal is to store all versions compactly without requiring full uploads for each update. The default storage algorithm employs byte-level Content-Defined Chunking (CDC), which performs well for insertions and deletions but faces challenges with the Parquet layout. Experiments with a 2GB Parquet file revealed that appending new rows resulted in 99.1% deduplication, while modifications led to only 89% deduplication due to the rewriting of column headers. Deleting rows caused significant new data blocks due to the reorganization of row groups. To improve deduplication, the team is considering using relative offsets instead of absolute offsets in file structures and applying CDC at the row level. These changes could enhance deduplication efficiency while maintaining compression. The team is also interested in collaborating with the Apache Arrow project to implement these ideas in the Parquet/Arrow code base and is exploring deduplication processes for other file types.

- Hugging Face is optimizing Parquet file storage to improve data deduplication.

- Current deduplication methods face challenges with modifications and deletions in Parquet files.

- Experiments show high deduplication rates for appending data but lower rates for modifications and deletions.

- Proposed improvements include using relative offsets and applying CDC at the row level.

- Collaboration with the Apache Arrow project is being considered for further enhancements.

Memory Efficient Data Streaming to Parquet Files

Estuary Flow has developed a 2-pass write method for streaming data into Apache Parquet files, minimizing memory usage while maintaining performance, suitable for real-time data integration and analytics.

Memory Management in DuckDB

DuckDB optimizes query processing with effective memory management, using a streaming execution engine and disk spilling for large datasets. Its buffer manager enhances performance by caching frequently accessed data.

HuggingFace to Replace Git LFS with Xet

Hugging Face has acquired XetHub to enhance AI development, improving storage and versioning for large files, facilitating efficient updates, and supporting growth in the AI community and infrastructure team.

Datomic and Content Addressable Techniques

Latacora has developed a data collection system using Datomic, focusing on deduplication and efficient querying. It supports dynamic schema inference, real-time analysis, and visualizations for tracking client environment changes.

The sorry state of Java deserialization

The article examines Java deserialization challenges in reading large datasets, highlighting performance issues with various methods. Benchmark tests show modern tools outperform traditional ones, emphasizing the need for optimization and custom serialization.

7 comments

By @ignoreusernames - 7 months

> Most Parquet files are bulk exports from various data analysis pipelines or databases, often appearing as full snapshots rather than incremental updates

I'm not really familiar of how datasets are managed by them, but all of the table formats (iceberg, delta and hudi) support appending and some form of "merge-on-read" deletes that could help with this use case. Instead of always fully replacing datasets on each dump, more granular operations could be done. The issue is that this requires changing pipelines and some extra knowledge about the datasets itself. A fun idea might involve taking a table format like iceberg, and instead of using parquet to store the data, just store the column data with the metadata externally defined somewhere else. On each new snapshot, a set of transformations (sorting, spiting blocks, etc) could be applied that minimizes that the potential byte diff between the previous snapshot.

By @YetAnotherNick - 7 months

I just don't understand how these guys could literally give terrabytes of free storage and free data transfer to everyone. I was doing some calculation of cost from my storage and transfers and if they used something like S3 it would costed them 1000s of dollar. And I don't pay them anything.

By @kwillets - 7 months

One additional thought regarding query performance is that content-defined row groups allow localized joins and aggregations which are much faster than the globally-shuffled kind.

If the sharding key matches (or is a subset of) a join or group-by key, then identical values are local to a single shard, which can be processed independently.

This type of thing is typically done at large granularity (eg one shard per MPP compute node), but there are also benefits down to the core or thread level.

Another tip is that if no shard key is defined, hash the whole row as a default.

By @kwillets - 7 months

I'm surprised that Parquet didn't maintain the Arrow practice of using mmap-able relative offsets for everything. Although these could be called relative to the beginning of the file.

By @jmakov - 7 months

Wouldn't be it easier to extend delta-rs to support deduplication?

By @skadamat - 7 months

Love this post and the visuals! Great work

By @kwillets - 7 months

How does this compare to rsync/rdiff?

Improving Parquet Dedupe on Hugging Face Hub

Related