January 7th, 2025

Parquet and ORC's many shortfalls for machine learning, and what to do about it?

Apache Parquet and ORC face challenges in machine learning due to metadata overhead, lack of vector type support, and compliance issues with data privacy regulations, necessitating architectural improvements for better efficiency.

Read original article

Parquet and ORC's many shortfalls for machine learning, and what to do about it?

Apache Parquet and ORC, two popular columnar storage formats, face significant challenges when applied to machine learning (ML) workloads. Originally designed for traditional SQL data analysis, these formats excel in scenarios involving large data scans and simple aggregations. However, they struggle with the complexities of modern ML tasks, which often involve wide and sparse datasets with numerous features. The performance of Parquet and ORC can degrade due to the overhead of deserializing metadata to access specific columns, particularly in datasets with thousands of features. Additionally, these formats lack native support for vector types, which are crucial for many ML algorithms, limiting their efficiency. Compliance with data privacy regulations, such as GDPR and CCPA, poses further challenges, as the current architecture complicates the physical deletion of user data. This can lead to inefficiencies and potential non-compliance with legal requirements. To address these issues, future iterations of columnar storage formats must incorporate architectural redesigns that optimize for ML workloads, including better metadata access, native vector handling, and efficient in-place deletion mechanisms.

- Parquet and ORC are not optimized for modern machine learning workloads.

- Performance issues arise from metadata overhead in wide and sparse datasets.

- Lack of native support for vector types limits efficiency in ML applications.

- Compliance with data privacy regulations complicates data deletion processes.

- Future formats need architectural improvements to better support ML requirements.

Memory Efficient Data Streaming to Parquet Files

Estuary Flow has developed a 2-pass write method for streaming data into Apache Parquet files, minimizing memory usage while maintaining performance, suitable for real-time data integration and analytics.

The sorry state of Java deserialization

The article examines Java deserialization challenges in reading large datasets, highlighting performance issues with various methods. Benchmark tests show modern tools outperform traditional ones, emphasizing the need for optimization and custom serialization.

Improving Parquet Dedupe on Hugging Face Hub

Hugging Face is optimizing its Hub's Parquet file storage for better deduplication, addressing challenges with modifications and deletions, and considering collaboration with Apache Arrow for further improvements.

Nulls: Revisiting null representation in modern columnar formats

The paper "NULLS!" examines null value handling in columnar formats, critiques outdated methods, introduces the SmartNull strategy for optimization, and highlights layout efficiency based on data characteristics and null ratios.

Should you ditch Spark for DuckDB or Polars?

DuckDB and Polars are emerging as alternatives to Apache Spark for small workloads, outperforming it in smaller configurations, while Spark excels in larger setups, maintaining strong performance and cost-effectiveness.

1 comments

By @abadid - 4 months

This article summarizes research from my lab in collaboration with ByteDance published in CIDR (a computer science conference held in Amsterdam two weeks from now) on a new columnar format designed for ML workloads.

Parquet and ORC's many shortfalls for machine learning, and what to do about it?

Related

Memory Efficient Data Streaming to Parquet Files

The sorry state of Java deserialization

Improving Parquet Dedupe on Hugging Face Hub

Nulls: Revisiting null representation in modern columnar formats

Should you ditch Spark for DuckDB or Polars?

Related

Memory Efficient Data Streaming to Parquet Files

The sorry state of Java deserialization

Improving Parquet Dedupe on Hugging Face Hub

Nulls: Revisiting null representation in modern columnar formats

Should you ditch Spark for DuckDB or Polars?