January 7th, 2025

Parquet and ORC's many shortfalls for machine learning, and what to do about it?

Apache Parquet and ORC face challenges in machine learning due to metadata overhead, lack of vector type support, and compliance issues with data privacy regulations, necessitating architectural improvements for better efficiency.

Read original articleLink Icon
Parquet and ORC's many shortfalls for machine learning, and what to do about it?

Apache Parquet and ORC, two popular columnar storage formats, face significant challenges when applied to machine learning (ML) workloads. Originally designed for traditional SQL data analysis, these formats excel in scenarios involving large data scans and simple aggregations. However, they struggle with the complexities of modern ML tasks, which often involve wide and sparse datasets with numerous features. The performance of Parquet and ORC can degrade due to the overhead of deserializing metadata to access specific columns, particularly in datasets with thousands of features. Additionally, these formats lack native support for vector types, which are crucial for many ML algorithms, limiting their efficiency. Compliance with data privacy regulations, such as GDPR and CCPA, poses further challenges, as the current architecture complicates the physical deletion of user data. This can lead to inefficiencies and potential non-compliance with legal requirements. To address these issues, future iterations of columnar storage formats must incorporate architectural redesigns that optimize for ML workloads, including better metadata access, native vector handling, and efficient in-place deletion mechanisms.

- Parquet and ORC are not optimized for modern machine learning workloads.

- Performance issues arise from metadata overhead in wide and sparse datasets.

- Lack of native support for vector types limits efficiency in ML applications.

- Compliance with data privacy regulations complicates data deletion processes.

- Future formats need architectural improvements to better support ML requirements.

Link Icon 1 comments
By @abadid - 4 months
This article summarizes research from my lab in collaboration with ByteDance published in CIDR (a computer science conference held in Amsterdam two weeks from now) on a new columnar format designed for ML workloads.