September 9th, 2024

What is the best pointer tagging method?

Pointer tagging methods show varying performance based on architecture and use case, with nan boxing being efficient for floating-point languages. Overall performance is more influenced by memory access patterns and cache efficiency.

Read original articleLink Icon
What is the best pointer tagging method?

pointer tagging methods yield similar performance, with specific advantages depending on the architecture and use case. The article discusses various pointer tagging techniques, including lower bits, lower byte, upper byte, upper bits, and nan boxing, each utilizing different parts of the pointer to store metadata. Benchmarks reveal that while untagged pointers generally perform better, certain tagged methods can outperform in specific scenarios, particularly when compiler optimizations are applied. The performance of these methods can vary significantly between architectures, such as ARM and x86, due to differences in instruction sets. Notably, nan boxing is highlighted for its efficiency in languages that primarily use floating-point numbers, as it allows for direct embedding of pointers into floating-point values. Ultimately, the article concludes that while pointer tagging can optimize certain operations, the overall system performance is more heavily influenced by memory access patterns and cache efficiency than by the choice of pointer tagging method.

- Pointer tagging encodes metadata into pointers for compact representation.

- Performance varies by architecture; ARM and x86 show different efficiencies.

- Nan boxing is beneficial for floating-point-heavy languages.

- Untagged pointers generally outperform tagged ones, but specific scenarios favor tagging.

- Memory access patterns and cache efficiency are critical for overall system performance.

Link Icon 10 comments
By @chc4 - 8 months
> Currently, ARM is the only major vendor with support for TBI

is not true. Intel and AMD both have variants of TBI on their chips, called Linear Address Masking and Upper Address Ignore respectively. It's a bit of a mess, unfortunately, with both masking off different bits from the top of the address (and different bits than ARM TBI does), but it does exist.

By @jandrewrogers - 8 months
It is worth mentioning that on Intel x86 going all the way back to Haswell you can use PDEP/PEXT instructions to efficiently combine the high bits and the low bits into a single seamless tag. This can provide a lot of bit width. The caveat is AMD x86 implemented these instructions as uselessly slow microcode until quite recently, which created some portability issues.
By @naasking - 8 months
> However, that is not something that is easy to create a microbenchmark for. The benefit of nan-boxing is that we don’t have to dereference a pointer to get the float.

That's not the only benefit. The main benefit is arguably that you don't have to allocate floats on the heap and garbage collect them. Numerical code allocates lots of numbers, so having these all be inline rather than heap-allocated saves lots of space and time.

By @dzaima - 8 months
On NaN-boxing, it's possible to put tags in the top instead of low bits - 64-bit floats have 52 bits of mantissa, 4 of which are in the top 16; though you only get 7 tags due to having to leave qNaN & infinity (8 may be possible if you can guarantee that the zero tag never has zero payload), or 4 for potentially simpler tag checks. Or, going the other direction, you can double the tag count to 14 or 16 by also using the sign, at the cost of a "<<1" in the is-float check.
By @erik_seaberg - 8 months
I had never heard of "top byte ignore," but it reminds me of the macOS migration to "32-bit clean" software as the hardware began to support more than the low 24 address lines.

https://en.wikipedia.org/wiki/Classic_Mac_OS_memory_manageme...

The other approach is CompressedOops, where instead of wasting pointer bits (and maybe using them for tags), Java's HotSpot VM chooses to only store a 32-bit offset for an eight-aligned heap object if the entire heap is known to fit within 2^(32+3) which is 32 GB from its base address.

https://news.ycombinator.com/item?id=22398251

And didn't somebody write about creating a large aligned arena for each type and essentially grabbing the base address of the arena as a (non-unique) type tag for its objects? Then the moving GC would use these arenas as semispaces.

By @tempodox - 8 months
I like how SBCL does it. Heap box addresses have their bit 0 set, which makes them odd and thus unfit for direct access. But real accesses are register indirect with offset, with odd offsets to get an even address. So each time you see an odd address offset in SBCL-generated assembly, you know you're dealing with a heap box. I can only surmise this was a deliberate choice to aid orientation when reading generated assembly. If so, somebody among the designers of SBCL has a heart for crazy people like me.
By @JonChesterfield - 8 months
I suspect, but haven't properly measured, that pointer tagging upsets speculative loads / branch prediction (when you're loading an address) to varying extent across different tagging schemes and different cpu implementations. I'd hope setting low bits are cheaper than high bits but really should write the microbenchmark to find out someday. Anyone know of existing attempts to characterise that?
By @o11c - 8 months
This doesn't mention split tagging (or of course, the best answer of "no tagging", which is often best implemented as "tag stored in debug mode, but not normally since static types are tracked").

If you can reduce your tag to a single bit (object vs primitive), a single byte of tag data can cover 8 variables, and a bigger integer can cover a whole 64 variables, plenty for most functions.

By @saagarjha - 8 months
Sorry, if the tests were run on a MacBook why are there Intel assembly snippets?