Why Polars rewrote its Arrow string data type
Polars has refactored its string data structure for improved performance, implementing a new storage method inspired by Hyper/Umbra, allowing inline storage of small strings and enhancing filter operation efficiency.
Read original articlePolars has undergone a significant refactor of its string and binary data structure, driven by the need for improved performance and control over its implementation. Previously, Polars relied on the Arrow2 crate, which limited its ability to modify the string type due to adherence to the Arrow specification. However, after forking parts of Arrow2 into a tailored version for Polars, the team was able to implement a new string storage method inspired by the Hyper/Umbra database system. This new approach allows for more efficient handling of string data, particularly for small strings, which can now be stored inline, reducing access overhead. The refactor also addresses performance issues related to gather and filter operations, which previously scaled linearly with string size. Benchmarks indicate that the new string type significantly improves performance, with operations now running in constant time regardless of string length. While there are some trade-offs, such as increased overhead for unique strings and the need for garbage collection, the overall benefits of the new implementation are substantial. The Polars team anticipates further performance enhancements as they continue to optimize their memory management.
- Polars has rewritten its string data type for better performance and control.
- The new string storage method is based on the Hyper/Umbra database system.
- Performance benchmarks show significant improvements in filter operations.
- The refactor allows for inline storage of small strings, reducing access overhead.
- Future optimizations are expected as Polars continues to refine its data management.
Related
Announcing Polars 1.0 (Blog Post)
Polars releases Python version 1.0 after 4 years, gaining popularity with 27.5K GitHub stars and 7M monthly downloads. Plans include improving performance, GPU acceleration, Polars Cloud, and new features.
Some Tricks from the Scrapscript Compiler
The Scrapscript compiler implements optimization tricks like immediate objects, small strings, and variants for better performance. It introduces immediate variants and const heap to enhance efficiency without complexity, seeking suggestions for future improvements.
Why German Strings Are Everywhere
CedarDB introduced "German Strings" for efficient data processing, adopted by systems like DuckDB, Apache Arrow, Polars, and Facebook Velox. German Strings optimize function calls, offer performance benefits, and controlled lifetime for improved application use.
Polylith
Polylith is a software architecture that enhances backend development by promoting functional thinking, offering composable building blocks, and simplifying code sharing, while being adaptable across various programming languages.
Does PostgreSQL respond to the challenge of analytical queries?
PostgreSQL has advanced in handling analytical queries with foreign data wrappers and partitioning, improving efficiency through optimizer enhancements, while facing challenges in pruning and statistical data. Ongoing community discussions aim for further improvements.
- Concerns about performance, particularly regarding the linear time complexity of gather and filter operations on string sizes.
- Discussion on the implications of short string optimization and its compatibility with existing libraries like DuckDB and Arrow.
- Questions about memory management, including byte alignment and the challenges of reallocations during string operations.
- References to similar concepts in other programming languages, such as Julia's ShortStrings.jl.
- General appreciation for the improvements made in the latest release, alongside minor critiques regarding the article's clarity.
Author is not aware of https://docs.rs/compact_str/latest/compact_str/ or https://github.com/bodil/smartstring
In my mind this reads identical to "if you're a security practitioner, worry about this bit here."
Will there be an option to use the "compatible" string format?
For similar reasons, we've been curious about new compression modes on indexes
Well. Reallocations have to happen mostly because the virtual memory space is flat, so you can't just grow your allocations without the possibility to accidentally bumping into some other object. But having non-flat virtual memory space is really inconvenient (Segment selectors! CHERI! And what about muh address arithmetic?) for other reasons, so here we are.
I toyed with the idea of having a specialized memory allocator for incrementally growing, potentially very large buffers by having it space allocations by, say, 16 GiB, and then there would be the "finalize" operation which would hand over the buffer's contents to malloc by asking malloc to allocate the exact final buffer size (rounded up to the page size) and then, instead of memcpy-ing the data, I'd persuade the OS to remap the physical pages of the existing allocation into the virtual address returned by malloc. The original buffer's virtual addresses then would become unmapped and could be reused.
Unfortunately, I couldn't quite persuade the OS to do that with user-available memory management API so it all came to nothing. I believe there was similar research in the early 90s and it failed because it too required custom OS modifications.
Can anyone explain why this is?
This is a cool article. Good motivation. Good explanation. Plots. ~~One small thing is that the plots are missing a legend so I had to hover~~. Nevermind, I don't know why it didn't show (or why I thought that) because I can clearly see them on the x-axis now.
Related
Announcing Polars 1.0 (Blog Post)
Polars releases Python version 1.0 after 4 years, gaining popularity with 27.5K GitHub stars and 7M monthly downloads. Plans include improving performance, GPU acceleration, Polars Cloud, and new features.
Some Tricks from the Scrapscript Compiler
The Scrapscript compiler implements optimization tricks like immediate objects, small strings, and variants for better performance. It introduces immediate variants and const heap to enhance efficiency without complexity, seeking suggestions for future improvements.
Why German Strings Are Everywhere
CedarDB introduced "German Strings" for efficient data processing, adopted by systems like DuckDB, Apache Arrow, Polars, and Facebook Velox. German Strings optimize function calls, offer performance benefits, and controlled lifetime for improved application use.
Polylith
Polylith is a software architecture that enhances backend development by promoting functional thinking, offering composable building blocks, and simplifying code sharing, while being adaptable across various programming languages.
Does PostgreSQL respond to the challenge of analytical queries?
PostgreSQL has advanced in handling analytical queries with foreign data wrappers and partitioning, improving efficiency through optimizer enhancements, while facing challenges in pruning and statistical data. Ongoing community discussions aim for further improvements.