October 8th, 2024

Improving Parquet Dedupe on Hugging Face Hub

Hugging Face is optimizing its Hub's Parquet file storage for better deduplication, addressing challenges with modifications and deletions, and considering collaboration with Apache Arrow for further improvements.

Read original articleLink Icon
Improving Parquet Dedupe on Hugging Face Hub

Hugging Face is enhancing the efficiency of its Hub's storage architecture, particularly focusing on optimizing Parquet file storage, which currently accounts for over 2.2PB of its nearly 11PB dataset. The need for data deduplication is critical as users frequently update datasets, and the goal is to store all versions compactly without requiring full uploads for each update. The default storage algorithm employs byte-level Content-Defined Chunking (CDC), which performs well for insertions and deletions but faces challenges with the Parquet layout. Experiments with a 2GB Parquet file revealed that appending new rows resulted in 99.1% deduplication, while modifications led to only 89% deduplication due to the rewriting of column headers. Deleting rows caused significant new data blocks due to the reorganization of row groups. To improve deduplication, the team is considering using relative offsets instead of absolute offsets in file structures and applying CDC at the row level. These changes could enhance deduplication efficiency while maintaining compression. The team is also interested in collaborating with the Apache Arrow project to implement these ideas in the Parquet/Arrow code base and is exploring deduplication processes for other file types.

- Hugging Face is optimizing Parquet file storage to improve data deduplication.

- Current deduplication methods face challenges with modifications and deletions in Parquet files.

- Experiments show high deduplication rates for appending data but lower rates for modifications and deletions.

- Proposed improvements include using relative offsets and applying CDC at the row level.

- Collaboration with the Apache Arrow project is being considered for further enhancements.

Link Icon 7 comments
By @ignoreusernames - 7 months
> Most Parquet files are bulk exports from various data analysis pipelines or databases, often appearing as full snapshots rather than incremental updates

I'm not really familiar of how datasets are managed by them, but all of the table formats (iceberg, delta and hudi) support appending and some form of "merge-on-read" deletes that could help with this use case. Instead of always fully replacing datasets on each dump, more granular operations could be done. The issue is that this requires changing pipelines and some extra knowledge about the datasets itself. A fun idea might involve taking a table format like iceberg, and instead of using parquet to store the data, just store the column data with the metadata externally defined somewhere else. On each new snapshot, a set of transformations (sorting, spiting blocks, etc) could be applied that minimizes that the potential byte diff between the previous snapshot.

By @YetAnotherNick - 7 months
I just don't understand how these guys could literally give terrabytes of free storage and free data transfer to everyone. I was doing some calculation of cost from my storage and transfers and if they used something like S3 it would costed them 1000s of dollar. And I don't pay them anything.
By @kwillets - 7 months
One additional thought regarding query performance is that content-defined row groups allow localized joins and aggregations which are much faster than the globally-shuffled kind.

If the sharding key matches (or is a subset of) a join or group-by key, then identical values are local to a single shard, which can be processed independently.

This type of thing is typically done at large granularity (eg one shard per MPP compute node), but there are also benefits down to the core or thread level.

Another tip is that if no shard key is defined, hash the whole row as a default.

By @kwillets - 7 months
I'm surprised that Parquet didn't maintain the Arrow practice of using mmap-able relative offsets for everything. Although these could be called relative to the beginning of the file.
By @jmakov - 7 months
Wouldn't be it easier to extend delta-rs to support deduplication?
By @skadamat - 7 months
Love this post and the visuals! Great work
By @kwillets - 7 months
How does this compare to rsync/rdiff?