January 29th, 2025

Query Engines: Gatekeepers of the Parquet File Format

Mainstream query engines struggle with newer Parquet encodings, forcing DuckDB to use older formats for compatibility. This limits efficiency, prompting calls for developers to adopt newer features for better data management.

Read original articleLink Icon
Query Engines: Gatekeepers of the Parquet File Format

Mainstream query engines currently face limitations in reading newer Parquet encodings, which compels systems like DuckDB to revert to older encodings, resulting in less efficient compression. Apache Parquet is a widely used column-oriented data storage format that allows for efficient querying and smaller file sizes compared to formats like CSV and JSON. While DuckDB has made strides in supporting various Parquet encodings, it opts not to write the latest encodings by default to ensure compatibility with other query engines that may not support them. This backward compatibility is crucial for both DuckDB and Parquet, allowing older files to be read by newer systems. The article emphasizes the importance of query engines adopting newer Parquet features to minimize wasted storage space, as many terabytes of data are written in Parquet daily without utilizing its full potential. The authors advocate for a collective effort among query engine developers to implement these encodings, which could lead to significant reductions in data storage needs and improve overall efficiency in data management.

- Mainstream query engines struggle to read newer Parquet encodings, leading to reliance on older formats.

- DuckDB supports various Parquet encodings but does not write the latest by default for compatibility reasons.

- Backward compatibility is essential for both DuckDB and Parquet to ensure older files remain accessible.

- Many terabytes of data are wasted daily due to the lack of implementation of newer Parquet encodings.

- Query engine developers are encouraged to adopt newer features to enhance data storage efficiency.

Link Icon 1 comments
By @wiml - 3 months
It would be nice to have an equivalent of caniuse for Parquet settings (encodings, compression codecs, other features)