August 11th, 2024

Another variable-length integer encoding

The article presents two encoding schemes for small integers in binary formats: metric varint and imperial varint, highlighting their efficiency, advantages, and the use of zig-zag encoding for signed integers.

Read original articleLink Icon
Another variable-length integer encoding

The article discusses the need for efficient encoding of small integer values in binary file formats, introducing two encoding schemes: metric varint and imperial varint. Metric varint, commonly used in Google’s Protocol Buffers, encodes integers using a variable-length sequence of bytes, where the most-significant bit (MSB) serves as a continuation marker. This method allows for efficient serialization and deserialization of small integers, reducing file size without employing traditional compression techniques. The article also highlights the limitations of using two's complement for signed integers, proposing zig-zag encoding to optimize space for both positive and negative values. The imperial varint is then introduced as an alternative, which modifies the metric varint by using big-endian order and consolidating continuation bits at the start of the encoded value. This change simplifies the encoding and decoding processes, making them faster and more efficient. The author encourages the use of imperial varints for future binary file format designs, citing its advantages over the traditional metric varint.

- Metric varint is widely used for encoding small integers in binary formats.

- Zig-zag encoding optimizes space for signed integers in varint encoding.

- Imperial varint improves upon metric varint by simplifying the encoding and decoding processes.

- Both encoding schemes aim to reduce file size without traditional compression.

- The article suggests considering imperial varints for future binary file format designs.

Related

The Byte Order Fiasco

The Byte Order Fiasco

Handling endianness in C/C++ programming poses challenges, emphasizing correct integer deserialization to prevent undefined behavior. Adherence to the C standard is crucial to avoid unexpected compiler optimizations. Code examples demonstrate proper deserialization techniques using masking and shifting for system compatibility. Mastery of these concepts is vital for robust C code, despite available APIs for byte swapping.

Some Tricks from the Scrapscript Compiler

Some Tricks from the Scrapscript Compiler

The Scrapscript compiler implements optimization tricks like immediate objects, small strings, and variants for better performance. It introduces immediate variants and const heap to enhance efficiency without complexity, seeking suggestions for future improvements.

Summing ASCII encoded integers on Haswell at almost the speed of memcpy

Summing ASCII encoded integers on Haswell at almost the speed of memcpy

Matt Stuchlik presents a high-performance algorithm for summing ASCII-encoded integers on Haswell systems. It utilizes SIMD instructions, lookup tables, and efficient operations to achieve speed enhancements, showcasing innovative approaches in integer sum calculations.

The Elegance of the ASCII Table

The Elegance of the ASCII Table

The article explores the elegance and historical importance of the ASCII table in computing. It discusses design choices, historical context, compatibility with Unicode, practical applications, and enduring relevance in technology.

Counting Bytes Faster Than You'd Think Possible

Counting Bytes Faster Than You'd Think Possible

Matt Stuchlik's high-performance computing method counts bytes with a value of 127 in a 250MB stream, achieving 550 times faster performance using SIMD instructions and an innovative memory read pattern.

Link Icon 13 comments
By @sethev - 5 months
> Note that even though the goal, and end result, is that the file is smaller, this is not a compression scheme, since it only works when your values are distributed in one particular way

This statement doesn't sound right. I think he's trying to say it's not a general purpose compression scheme, but every compression scheme only works on certain patterns. In fact, most possible streams of bytes are not compressible at all. Even the general purpose approaches only work on very specific types of inputs (such as natural language) which contain a lot of redundancy.

By @maxmuen - 5 months
IMHO wasting 12.5% of all bits just to store the length appears wasteful to me. The encoding in QUIC[1] that encodes the length in the leading to 2bits. Very fast to read and write, just one caveat at most 62bits allowed.

[1]: https://www.rfc-editor.org/rfc/rfc9000.html#name-variable-le...

By @Veserv - 5 months
Unary-encoded length prefix and big-endian serialization.

There is no good reason to use big-endian serialization. I think they are doing that because they only know about "count leading zeros (CLZ)" to extract the unary length prefix. "Count trailing zeros (CTZ)" is also a commonly supported operation (and can be efficiently emulated with CLZ if it does not exist) which makes it easy to extract the unary length prefix even if it is little-endian.

By @MaskRay - 5 months
The scheme proposed in this blog post is also called PrefixVarInt.

Signed integers can be represented with either zigzag encoding or sign extension. For the most common one-byte encoding, zigzag encoding is a worse scheme. https://maskray.me/blog/2024-03-09-a-compact-relocation-form...

My blog post is about a relocation format. I investigated a few schemes and concluded that LEB128 is the best for my use case. There are multiple reasons including super simple implementation:

    static uint64_t read_leb128(unsigned char **buf, uint64_t sleb_uleb) {
      uint64_t acc = 0, shift = 0, byte;
      do {
        byte = *(*buf)++;
        acc |= (byte - 128*(byte >= sleb_uleb)) << shift;
        shift += 7;
      } while (byte >= 128);
      return acc;
    }
    
    uint64_t read_uleb128(unsigned char **buf) { return read_leb128(buf, 128); }
    int64_t read_sleb128(unsigned char **buf) { return read_leb128(buf, 64); }
By @mananaysiempre - 5 months
So, UTF-8 without the self-synchronization part. It’s a reasonable encoding, sure. I’m not sure what the cost of making the whole thing big endian is, though (guess I need to look at simdutf8!); at first blush it feels like making this wholly little-endian-adapted, with the continuation bits = byte count in unary located in the low bits, would result in very cheap decoding on little-endian machines.
By @Retr0id - 5 months
I like this a lot.

A third approach not mentioned in the article is to use some form of length prefix (which is itself a fixed number of bits long). This is usually cheaper to decode, but a fixed-length prefix disproportionally "wastes" space in short numbers.

The approach in the article gives the best of both worlds, all while being very cheap to decode. I'm guessing it might also be fairly SIMD-friendly.

By @fritzo - 5 months
It seems like a more SIMD-friendly version would separately store a byte mask and the packed byte data. Then you can simultaneously expand 16 int32s using e.g. AVX512's _mm512_maskz_expand_epi8(mask,_mm512_loadu_epi8(data)).
By @Const-me - 5 months
The encoding proposed by that blog post matches the encoding used in MKV container, see section #4 “Variable-Size Integer” in that specification: https://www.rfc-editor.org/rfc/rfc8794.html

All modern CPUs have instruction to swap order of bytes in the integers to big endian. And many programming languages have an intrinsic function to emit these instructions. Quite useful for both encoder and decoder of these integers.

By @rocqua - 5 months
Cool, certainly makes sense to let the parser know ahead of time how many bytes to read. Not just for cleaner code, but for faster code aswell.
By @ivanjermakov - 5 months
How this compares in compression ratio to VLQ and VU128?
By @ralferoo - 5 months
> And, for reasons we’ll see in a bit, we flip the meaning of the continuation bits: we’ll use 0 to mean “another byte follows”, and 1 to mean “this is the last byte”.

EDIT: just noticed that I'd skim-read over: "imperial varint places all of the continuation bits at the beginning of the encoded value", which makes my comment completely wrong.

~Doesn't even mention possibly one of the most useful features of encoding this way, which is that you can never [1] end up with a 0 byte in the output, meaning that you can treat the output as a C string.~

~[1] or at least never assuming you're trying to create the shortest encoding~

By @jeffbee - 5 months
Unary length prefix is a solid technique. I would be careful with the performance claims, though. Sometimes what the machine can do is surprising and the performance of a system that looks like it needs to loop and branch is faster than you expected. The protobuf project, unsurprisingly for a project of its age, has several different varint parsing strategies. One is described at https://github.com/protocolbuffers/protobuf/blob/main/src/go... and another at https://github.com/protocolbuffers/protobuf/blob/main/src/go...