August 6th, 2024

Why Polars rewrote its Arrow string data type

Polars has refactored its string data structure for improved performance, implementing a new storage method inspired by Hyper/Umbra, allowing inline storage of small strings and enhancing filter operation efficiency.

Read original articleLink Icon
CuriositySkepticismInterest
Why Polars rewrote its Arrow string data type

Polars has undergone a significant refactor of its string and binary data structure, driven by the need for improved performance and control over its implementation. Previously, Polars relied on the Arrow2 crate, which limited its ability to modify the string type due to adherence to the Arrow specification. However, after forking parts of Arrow2 into a tailored version for Polars, the team was able to implement a new string storage method inspired by the Hyper/Umbra database system. This new approach allows for more efficient handling of string data, particularly for small strings, which can now be stored inline, reducing access overhead. The refactor also addresses performance issues related to gather and filter operations, which previously scaled linearly with string size. Benchmarks indicate that the new string type significantly improves performance, with operations now running in constant time regardless of string length. While there are some trade-offs, such as increased overhead for unique strings and the need for garbage collection, the overall benefits of the new implementation are substantial. The Polars team anticipates further performance enhancements as they continue to optimize their memory management.

- Polars has rewritten its string data type for better performance and control.

- The new string storage method is based on the Hyper/Umbra database system.

- Performance benchmarks show significant improvements in filter operations.

- The refactor allows for inline storage of small strings, reducing access overhead.

- Future optimizations are expected as Polars continues to refine its data management.

AI: What people are saying
The comments on the article about Polars' string data structure refactor reveal several key concerns and insights from the community.
  • Concerns about performance, particularly regarding the linear time complexity of gather and filter operations on string sizes.
  • Discussion on the implications of short string optimization and its compatibility with existing libraries like DuckDB and Arrow.
  • Questions about memory management, including byte alignment and the challenges of reallocations during string operations.
  • References to similar concepts in other programming languages, such as Julia's ShortStrings.jl.
  • General appreciation for the improvements made in the latest release, alongside minor critiques regarding the article's clarity.
Link Icon 15 comments
By @Rygian - 4 months
> “short string optimization”: A short enough string can be stored “in place” [...] An optimization that’s impossible in Rust, by the way ;).

Author is not aware of https://docs.rs/compact_str/latest/compact_str/ or https://github.com/bodil/smartstring

By @Rygian - 4 months
> When you access a transient string, the string itself won’t know whether the data it points to is still valid, so you as a programmer need to ensure that every transient string you use is actually still valid.

In my mind this reads identical to "if you're a security practitioner, worry about this bit here."

By @kristianp - 4 months
Don't you lose the in-memory interop with other libraries by doing this? I'm thinking that duckdb will no longer be able to read polars data that has been loaded into memory, as it can currently do, due to duckdb supporting Arrow. Isn't the benefit of arrow that it's supported by many languages and libraries as a standard?

Will there be an option to use the "compatible" string format?

By @xiaodai - 4 months
For short strings, this is basically the same concept as ShortStrings.jl in julia

https://github.com/JuliaString/ShortStrings.jl

By @lmeyerov - 4 months
I wonder if something like arrow's custom extension types mechanism would streamline building novel data representations without full forks? It may highlight gaps in the extension mechanism

For similar reasons, we've been curious about new compression modes on indexes

By @ismailmaj - 4 months
I might have missed it in the article but I'm not sure why the prefix is stored for strings that can't be inlined.
By @Joker_vD - 4 months
> As I mentioned above already pre-allocating the required size of data is hard. This leads to many reallocations and memcopy’s during the building of this type.

Well. Reallocations have to happen mostly because the virtual memory space is flat, so you can't just grow your allocations without the possibility to accidentally bumping into some other object. But having non-flat virtual memory space is really inconvenient (Segment selectors! CHERI! And what about muh address arithmetic?) for other reasons, so here we are.

I toyed with the idea of having a specialized memory allocator for incrementally growing, potentially very large buffers by having it space allocations by, say, 16 GiB, and then there would be the "finalize" operation which would hand over the buffer's contents to malloc by asking malloc to allocate the exact final buffer size (rounded up to the page size) and then, instead of memcpy-ing the data, I'd persuade the OS to remap the physical pages of the existing allocation into the virtual address returned by malloc. The original buffer's virtual addresses then would become unmapped and could be reused.

Unfortunately, I couldn't quite persuade the OS to do that with user-available memory management API so it all came to nothing. I believe there was similar research in the early 90s and it failed because it too required custom OS modifications.

By @zeristor - 4 months
No mention of byte alignment, am I just thinking too low level for optimisation?
By @sixtram - 4 months
Does anyone know what tool and font was used in the illustrations in the article?
By @mijoharas - 4 months
> The huge (in my opinion pathological) downside of the this data type is that the gather and filter operation on this type scale linear in time complexity with the string sizes.

Can anyone explain why this is?

By @justinsaccount - 4 months
Is there a standalone c/c++ implementation of this string type anywhere?
By @renewiltord - 4 months
Haha, quite a huge improvement in a point-release. 0.20.5 to 0.20.6.

This is a cool article. Good motivation. Good explanation. Plots. ~~One small thing is that the plots are missing a legend so I had to hover~~. Nevermind, I don't know why it didn't show (or why I thought that) because I can clearly see them on the x-axis now.

By @Kalanos - 4 months
interesting. strings was supposed to be one of arrow's big features in comparison to numpy, right?
By @padthai - 4 months
I am getting Vietnam flashbacks of all the small incompatibilities between Numpy arrays and Pandas Series