October 30th, 2024

Nulls: Revisiting null representation in modern columnar formats

The paper "NULLS!" examines null value handling in columnar formats, critiques outdated methods, introduces the SmartNull strategy for optimization, and highlights layout efficiency based on data characteristics and null ratios.

Read original articleLink Icon
Nulls: Revisiting null representation in modern columnar formats

The paper "NULLS!: Revisiting Null Representation in Modern Columnar Formats" discusses the handling of null values in columnar data storage formats, which are prevalent in real-world datasets. Traditional formats like Parquet and ORC primarily store non-null values contiguously, a method that has not significantly evolved in nearly two decades. The authors analyze various approaches to null representation, highlighting the advantages and disadvantages of each under different data distributions and encoding schemes. They introduce an optimization technique using AVX512 to address performance bottlenecks in existing methods. Additionally, the paper proposes a new null-filling strategy called SmartNull, which optimizes the compression ratio during encoding by intelligently determining the best representation for null values. The findings suggest that the effectiveness of null compression is influenced by factors such as decoding speed, data distribution, and the ratio of null values. The study concludes that the Compact layout is more efficient for high null ratios, while the Placeholder layout performs better when null ratios are low or when data is serially correlated.

- The paper addresses the challenges of null representation in modern columnar formats.

- It critiques traditional storage methods that have not evolved significantly over the years.

- The authors propose a new strategy, SmartNull, for optimizing null value representation.

- Performance analysis indicates that the choice of layout for null values affects efficiency based on data characteristics.

- The study emphasizes the importance of adapting storage techniques to improve data management and retrieval.

Link Icon 3 comments
By @zX41ZdbW - 6 months
How did it go through peer review without a comparison with ClickHouse?

> Our analysis shows that the Compact layout performs better when Null ratio is high and the Placeholder layout is better when the Null ratio is low or the data is serial-correlated.

ClickHouse uses a placeholder value with a separate stream with NULL-masks, and additionally, it has the Sparse column format, which is named Compact in the paper (but currently, the Sparse format applies to encode default values more efficiently rather than NULL values).

By @mwexler - 6 months
While the other authors are from Tsinghua University, two more recognizable names include Wes McKinney of Pandas and Apache Arrow fame and Andy Pavlo at CMU, who has done some fun work on columnar stores and database optimization.

Always fun to see the mix of authors globally linking up.