Improving Parquet Dedupe on Hugging Face Hub
Hugging Face is optimizing its Hub's Parquet file storage for better deduplication, addressing challenges with modifications and deletions, and considering collaboration with Apache Arrow for further improvements.
Read original articleHugging Face is enhancing the efficiency of its Hub's storage architecture, particularly focusing on optimizing Parquet file storage, which currently accounts for over 2.2PB of its nearly 11PB dataset. The need for data deduplication is critical as users frequently update datasets, and the goal is to store all versions compactly without requiring full uploads for each update. The default storage algorithm employs byte-level Content-Defined Chunking (CDC), which performs well for insertions and deletions but faces challenges with the Parquet layout. Experiments with a 2GB Parquet file revealed that appending new rows resulted in 99.1% deduplication, while modifications led to only 89% deduplication due to the rewriting of column headers. Deleting rows caused significant new data blocks due to the reorganization of row groups. To improve deduplication, the team is considering using relative offsets instead of absolute offsets in file structures and applying CDC at the row level. These changes could enhance deduplication efficiency while maintaining compression. The team is also interested in collaborating with the Apache Arrow project to implement these ideas in the Parquet/Arrow code base and is exploring deduplication processes for other file types.
- Hugging Face is optimizing Parquet file storage to improve data deduplication.
- Current deduplication methods face challenges with modifications and deletions in Parquet files.
- Experiments show high deduplication rates for appending data but lower rates for modifications and deletions.
- Proposed improvements include using relative offsets and applying CDC at the row level.
- Collaboration with the Apache Arrow project is being considered for further enhancements.
Related
Memory Efficient Data Streaming to Parquet Files
Estuary Flow has developed a 2-pass write method for streaming data into Apache Parquet files, minimizing memory usage while maintaining performance, suitable for real-time data integration and analytics.
Memory Management in DuckDB
DuckDB optimizes query processing with effective memory management, using a streaming execution engine and disk spilling for large datasets. Its buffer manager enhances performance by caching frequently accessed data.
HuggingFace to Replace Git LFS with Xet
Hugging Face has acquired XetHub to enhance AI development, improving storage and versioning for large files, facilitating efficient updates, and supporting growth in the AI community and infrastructure team.
Datomic and Content Addressable Techniques
Latacora has developed a data collection system using Datomic, focusing on deduplication and efficient querying. It supports dynamic schema inference, real-time analysis, and visualizations for tracking client environment changes.
The sorry state of Java deserialization
The article examines Java deserialization challenges in reading large datasets, highlighting performance issues with various methods. Benchmark tests show modern tools outperform traditional ones, emphasizing the need for optimization and custom serialization.
I'm not really familiar of how datasets are managed by them, but all of the table formats (iceberg, delta and hudi) support appending and some form of "merge-on-read" deletes that could help with this use case. Instead of always fully replacing datasets on each dump, more granular operations could be done. The issue is that this requires changing pipelines and some extra knowledge about the datasets itself. A fun idea might involve taking a table format like iceberg, and instead of using parquet to store the data, just store the column data with the metadata externally defined somewhere else. On each new snapshot, a set of transformations (sorting, spiting blocks, etc) could be applied that minimizes that the potential byte diff between the previous snapshot.
If the sharding key matches (or is a subset of) a join or group-by key, then identical values are local to a single shard, which can be processed independently.
This type of thing is typically done at large granularity (eg one shard per MPP compute node), but there are also benefits down to the core or thread level.
Another tip is that if no shard key is defined, hash the whole row as a default.
Related
Memory Efficient Data Streaming to Parquet Files
Estuary Flow has developed a 2-pass write method for streaming data into Apache Parquet files, minimizing memory usage while maintaining performance, suitable for real-time data integration and analytics.
Memory Management in DuckDB
DuckDB optimizes query processing with effective memory management, using a streaming execution engine and disk spilling for large datasets. Its buffer manager enhances performance by caching frequently accessed data.
HuggingFace to Replace Git LFS with Xet
Hugging Face has acquired XetHub to enhance AI development, improving storage and versioning for large files, facilitating efficient updates, and supporting growth in the AI community and infrastructure team.
Datomic and Content Addressable Techniques
Latacora has developed a data collection system using Datomic, focusing on deduplication and efficient querying. It supports dynamic schema inference, real-time analysis, and visualizations for tracking client environment changes.
The sorry state of Java deserialization
The article examines Java deserialization challenges in reading large datasets, highlighting performance issues with various methods. Benchmark tests show modern tools outperform traditional ones, emphasizing the need for optimization and custom serialization.