Another variable-length integer encoding
The article presents two encoding schemes for small integers in binary formats: metric varint and imperial varint, highlighting their efficiency, advantages, and the use of zig-zag encoding for signed integers.
Read original articleThe article discusses the need for efficient encoding of small integer values in binary file formats, introducing two encoding schemes: metric varint and imperial varint. Metric varint, commonly used in Google’s Protocol Buffers, encodes integers using a variable-length sequence of bytes, where the most-significant bit (MSB) serves as a continuation marker. This method allows for efficient serialization and deserialization of small integers, reducing file size without employing traditional compression techniques. The article also highlights the limitations of using two's complement for signed integers, proposing zig-zag encoding to optimize space for both positive and negative values. The imperial varint is then introduced as an alternative, which modifies the metric varint by using big-endian order and consolidating continuation bits at the start of the encoded value. This change simplifies the encoding and decoding processes, making them faster and more efficient. The author encourages the use of imperial varints for future binary file format designs, citing its advantages over the traditional metric varint.
- Metric varint is widely used for encoding small integers in binary formats.
- Zig-zag encoding optimizes space for signed integers in varint encoding.
- Imperial varint improves upon metric varint by simplifying the encoding and decoding processes.
- Both encoding schemes aim to reduce file size without traditional compression.
- The article suggests considering imperial varints for future binary file format designs.
Related
The Byte Order Fiasco
Handling endianness in C/C++ programming poses challenges, emphasizing correct integer deserialization to prevent undefined behavior. Adherence to the C standard is crucial to avoid unexpected compiler optimizations. Code examples demonstrate proper deserialization techniques using masking and shifting for system compatibility. Mastery of these concepts is vital for robust C code, despite available APIs for byte swapping.
Some Tricks from the Scrapscript Compiler
The Scrapscript compiler implements optimization tricks like immediate objects, small strings, and variants for better performance. It introduces immediate variants and const heap to enhance efficiency without complexity, seeking suggestions for future improvements.
Summing ASCII encoded integers on Haswell at almost the speed of memcpy
Matt Stuchlik presents a high-performance algorithm for summing ASCII-encoded integers on Haswell systems. It utilizes SIMD instructions, lookup tables, and efficient operations to achieve speed enhancements, showcasing innovative approaches in integer sum calculations.
The Elegance of the ASCII Table
The article explores the elegance and historical importance of the ASCII table in computing. It discusses design choices, historical context, compatibility with Unicode, practical applications, and enduring relevance in technology.
Counting Bytes Faster Than You'd Think Possible
Matt Stuchlik's high-performance computing method counts bytes with a value of 127 in a 250MB stream, achieving 550 times faster performance using SIMD instructions and an innovative memory read pattern.
This statement doesn't sound right. I think he's trying to say it's not a general purpose compression scheme, but every compression scheme only works on certain patterns. In fact, most possible streams of bytes are not compressible at all. Even the general purpose approaches only work on very specific types of inputs (such as natural language) which contain a lot of redundancy.
[1]: https://www.rfc-editor.org/rfc/rfc9000.html#name-variable-le...
There is no good reason to use big-endian serialization. I think they are doing that because they only know about "count leading zeros (CLZ)" to extract the unary length prefix. "Count trailing zeros (CTZ)" is also a commonly supported operation (and can be efficiently emulated with CLZ if it does not exist) which makes it easy to extract the unary length prefix even if it is little-endian.
Signed integers can be represented with either zigzag encoding or sign extension. For the most common one-byte encoding, zigzag encoding is a worse scheme. https://maskray.me/blog/2024-03-09-a-compact-relocation-form...
My blog post is about a relocation format. I investigated a few schemes and concluded that LEB128 is the best for my use case. There are multiple reasons including super simple implementation:
static uint64_t read_leb128(unsigned char **buf, uint64_t sleb_uleb) {
uint64_t acc = 0, shift = 0, byte;
do {
byte = *(*buf)++;
acc |= (byte - 128*(byte >= sleb_uleb)) << shift;
shift += 7;
} while (byte >= 128);
return acc;
}
uint64_t read_uleb128(unsigned char **buf) { return read_leb128(buf, 128); }
int64_t read_sleb128(unsigned char **buf) { return read_leb128(buf, 64); }
A third approach not mentioned in the article is to use some form of length prefix (which is itself a fixed number of bits long). This is usually cheaper to decode, but a fixed-length prefix disproportionally "wastes" space in short numbers.
The approach in the article gives the best of both worlds, all while being very cheap to decode. I'm guessing it might also be fairly SIMD-friendly.
All modern CPUs have instruction to swap order of bytes in the integers to big endian. And many programming languages have an intrinsic function to emit these instructions. Quite useful for both encoder and decoder of these integers.
EDIT: just noticed that I'd skim-read over: "imperial varint places all of the continuation bits at the beginning of the encoded value", which makes my comment completely wrong.
~Doesn't even mention possibly one of the most useful features of encoding this way, which is that you can never [1] end up with a 0 byte in the output, meaning that you can treat the output as a C string.~
~[1] or at least never assuming you're trying to create the shortest encoding~
Related
The Byte Order Fiasco
Handling endianness in C/C++ programming poses challenges, emphasizing correct integer deserialization to prevent undefined behavior. Adherence to the C standard is crucial to avoid unexpected compiler optimizations. Code examples demonstrate proper deserialization techniques using masking and shifting for system compatibility. Mastery of these concepts is vital for robust C code, despite available APIs for byte swapping.
Some Tricks from the Scrapscript Compiler
The Scrapscript compiler implements optimization tricks like immediate objects, small strings, and variants for better performance. It introduces immediate variants and const heap to enhance efficiency without complexity, seeking suggestions for future improvements.
Summing ASCII encoded integers on Haswell at almost the speed of memcpy
Matt Stuchlik presents a high-performance algorithm for summing ASCII-encoded integers on Haswell systems. It utilizes SIMD instructions, lookup tables, and efficient operations to achieve speed enhancements, showcasing innovative approaches in integer sum calculations.
The Elegance of the ASCII Table
The article explores the elegance and historical importance of the ASCII table in computing. It discusses design choices, historical context, compatibility with Unicode, practical applications, and enduring relevance in technology.
Counting Bytes Faster Than You'd Think Possible
Matt Stuchlik's high-performance computing method counts bytes with a value of 127 in a 250MB stream, achieving 550 times faster performance using SIMD instructions and an innovative memory read pattern.