September 9th, 2024

What is the best pointer tagging method?

Pointer tagging methods show varying performance based on architecture and use case, with nan boxing being efficient for floating-point languages. Overall performance is more influenced by memory access patterns and cache efficiency.

Read original article

What is the best pointer tagging method?

pointer tagging methods yield similar performance, with specific advantages depending on the architecture and use case. The article discusses various pointer tagging techniques, including lower bits, lower byte, upper byte, upper bits, and nan boxing, each utilizing different parts of the pointer to store metadata. Benchmarks reveal that while untagged pointers generally perform better, certain tagged methods can outperform in specific scenarios, particularly when compiler optimizations are applied. The performance of these methods can vary significantly between architectures, such as ARM and x86, due to differences in instruction sets. Notably, nan boxing is highlighted for its efficiency in languages that primarily use floating-point numbers, as it allows for direct embedding of pointers into floating-point values. Ultimately, the article concludes that while pointer tagging can optimize certain operations, the overall system performance is more heavily influenced by memory access patterns and cache efficiency than by the choice of pointer tagging method.

- Pointer tagging encodes metadata into pointers for compact representation.

- Performance varies by architecture; ARM and x86 show different efficiencies.

- Nan boxing is beneficial for floating-point-heavy languages.

- Untagged pointers generally outperform tagged ones, but specific scenarios favor tagging.

- Memory access patterns and cache efficiency are critical for overall system performance.

Do not taunt happy fun branch predictor

The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.

Ampere: Making Future Software Memory-Safe, a Path Towards Secure Cloud

The White House emphasizes reducing cyber attack surfaces by addressing memory safety vulnerabilities. Ampere's memory tagging technology enhances security by preventing memory-related exploits, offering robust protection for cloud services without performance impact.

Some Tricks from the Scrapscript Compiler

The Scrapscript compiler implements optimization tricks like immediate objects, small strings, and variants for better performance. It introduces immediate variants and const heap to enhance efficiency without complexity, seeking suggestions for future improvements.

ARM or x86? ISA Doesn't Matter (2021)

The debate between ARM and x86 ISAs reveals that performance differences are diminishing, with microarchitectural design being more influential than ISA. Both architectures employ similar efficiency techniques, emphasizing implementation choices.

High-Performance Binary Search

Optimized binary search algorithms can be up to four times faster than std::lower_bound by eliminating branching and improving memory layout, enhancing cache performance and data locality for specific contexts.

10 comments

By @chc4 - 8 months

> Currently, ARM is the only major vendor with support for TBI

is not true. Intel and AMD both have variants of TBI on their chips, called Linear Address Masking and Upper Address Ignore respectively. It's a bit of a mess, unfortunately, with both masking off different bits from the top of the address (and different bits than ARM TBI does), but it does exist.

By @jandrewrogers - 8 months

It is worth mentioning that on Intel x86 going all the way back to Haswell you can use PDEP/PEXT instructions to efficiently combine the high bits and the low bits into a single seamless tag. This can provide a lot of bit width. The caveat is AMD x86 implemented these instructions as uselessly slow microcode until quite recently, which created some portability issues.

By @naasking - 8 months

> However, that is not something that is easy to create a microbenchmark for. The benefit of nan-boxing is that we don’t have to dereference a pointer to get the float.

That's not the only benefit. The main benefit is arguably that you don't have to allocate floats on the heap and garbage collect them. Numerical code allocates lots of numbers, so having these all be inline rather than heap-allocated saves lots of space and time.

By @dzaima - 8 months

On NaN-boxing, it's possible to put tags in the top instead of low bits - 64-bit floats have 52 bits of mantissa, 4 of which are in the top 16; though you only get 7 tags due to having to leave qNaN & infinity (8 may be possible if you can guarantee that the zero tag never has zero payload), or 4 for potentially simpler tag checks. Or, going the other direction, you can double the tag count to 14 or 16 by also using the sign, at the cost of a "<<1" in the is-float check.

By @erik_seaberg - 8 months

I had never heard of "top byte ignore," but it reminds me of the macOS migration to "32-bit clean" software as the hardware began to support more than the low 24 address lines.

https://en.wikipedia.org/wiki/Classic_Mac_OS_memory_manageme...

The other approach is CompressedOops, where instead of wasting pointer bits (and maybe using them for tags), Java's HotSpot VM chooses to only store a 32-bit offset for an eight-aligned heap object if the entire heap is known to fit within 2^(32+3) which is 32 GB from its base address.

https://news.ycombinator.com/item?id=22398251

And didn't somebody write about creating a large aligned arena for each type and essentially grabbing the base address of the arena as a (non-unique) type tag for its objects? Then the moving GC would use these arenas as semispaces.

By @tempodox - 8 months

I like how SBCL does it. Heap box addresses have their bit 0 set, which makes them odd and thus unfit for direct access. But real accesses are register indirect with offset, with odd offsets to get an even address. So each time you see an odd address offset in SBCL-generated assembly, you know you're dealing with a heap box. I can only surmise this was a deliberate choice to aid orientation when reading generated assembly. If so, somebody among the designers of SBCL has a heart for crazy people like me.

By @JonChesterfield - 8 months

I suspect, but haven't properly measured, that pointer tagging upsets speculative loads / branch prediction (when you're loading an address) to varying extent across different tagging schemes and different cpu implementations. I'd hope setting low bits are cheaper than high bits but really should write the microbenchmark to find out someday. Anyone know of existing attempts to characterise that?

By @o11c - 8 months

This doesn't mention split tagging (or of course, the best answer of "no tagging", which is often best implemented as "tag stored in debug mode, but not normally since static types are tracked").

If you can reduce your tag to a single bit (object vs primitive), a single byte of tag data can cover 8 variables, and a bigger integer can cover a whole 64 variables, plenty for most functions.

By @saagarjha - 8 months

Sorry, if the tests were run on a MacBook why are there Intel assembly snippets?

What is the best pointer tagging method?