January 29th, 2025

Query Engines: Gatekeepers of the Parquet File Format

Mainstream query engines struggle with newer Parquet encodings, forcing DuckDB to use older formats for compatibility. This limits efficiency, prompting calls for developers to adopt newer features for better data management.

Read original article

Query Engines: Gatekeepers of the Parquet File Format

Mainstream query engines currently face limitations in reading newer Parquet encodings, which compels systems like DuckDB to revert to older encodings, resulting in less efficient compression. Apache Parquet is a widely used column-oriented data storage format that allows for efficient querying and smaller file sizes compared to formats like CSV and JSON. While DuckDB has made strides in supporting various Parquet encodings, it opts not to write the latest encodings by default to ensure compatibility with other query engines that may not support them. This backward compatibility is crucial for both DuckDB and Parquet, allowing older files to be read by newer systems. The article emphasizes the importance of query engines adopting newer Parquet features to minimize wasted storage space, as many terabytes of data are written in Parquet daily without utilizing its full potential. The authors advocate for a collective effort among query engine developers to implement these encodings, which could lead to significant reductions in data storage needs and improve overall efficiency in data management.

- Mainstream query engines struggle to read newer Parquet encodings, leading to reliance on older formats.

- DuckDB supports various Parquet encodings but does not write the latest by default for compatibility reasons.

- Backward compatibility is essential for both DuckDB and Parquet to ensure older files remain accessible.

- Many terabytes of data are wasted daily due to the lack of implementation of newer Parquet encodings.

- Query engine developers are encouraged to adopt newer features to enhance data storage efficiency.

The sorry state of Java deserialization

The article examines Java deserialization challenges in reading large datasets, highlighting performance issues with various methods. Benchmark tests show modern tools outperform traditional ones, emphasizing the need for optimization and custom serialization.

Improving Parquet Dedupe on Hugging Face Hub

Hugging Face is optimizing its Hub's Parquet file storage for better deduplication, addressing challenges with modifications and deletions, and considering collaboration with Apache Arrow for further improvements.

Pg_parquet: An extension to connect Postgres and parquet

Crunchy Data released pg_parquet, an open-source PostgreSQL extension for reading and writing Parquet files, enabling data export/import, cloud storage integration, and schema inspection for enhanced analytical capabilities.

Should you ditch Spark for DuckDB or Polars?

DuckDB and Polars are emerging as alternatives to Apache Spark for small workloads, outperforming it in smaller configurations, while Spark excels in larger setups, maintaining strong performance and cost-effectiveness.

Parquet and ORC's many shortfalls for machine learning, and what to do about it?

Apache Parquet and ORC face challenges in machine learning due to metadata overhead, lack of vector type support, and compliance issues with data privacy regulations, necessitating architectural improvements for better efficiency.

1 comments

By @wiml - 3 months

It would be nice to have an equivalent of caniuse for Parquet settings (encodings, compression codecs, other features)

Query Engines: Gatekeepers of the Parquet File Format

Related

The sorry state of Java deserialization

Improving Parquet Dedupe on Hugging Face Hub

Pg_parquet: An extension to connect Postgres and parquet

Should you ditch Spark for DuckDB or Polars?

Parquet and ORC's many shortfalls for machine learning, and what to do about it?

Related

The sorry state of Java deserialization

Improving Parquet Dedupe on Hugging Face Hub

Pg_parquet: An extension to connect Postgres and parquet

Should you ditch Spark for DuckDB or Polars?

Parquet and ORC's many shortfalls for machine learning, and what to do about it?