July 16th, 2024

Optimizing a Bignum Library for Fun

Austin Z. Henley from Carnegie Mellon University optimizes a bignum library by switching to 30-bit digits, boosting performance significantly. Despite improvements, he plans to add more core features for a comprehensive library.

Read original article

InterestExpertiseSkepticism

Austin Z. Henley, an Associate Teaching Professor at Carnegie Mellon University, delves into optimizing a bignum library for fun. Initially implementing a basic arbitrary-precision number library using decimal digits, he later enhances the storage by using larger base "digits" for more efficient memory usage and reduced computational steps. By transitioning to 30-bit digits, the library can handle significantly larger numbers with fewer digits. The addition and multiplication functions are updated to operate on these 30-bit digits, resulting in substantial performance improvements. Benchmarking the new implementation against the original base-10 digits version reveals an 83% improvement in addition and a 99% improvement in multiplication speed. Further exploration leads to experimenting with the Karatsuba algorithm for faster multiplication, showcasing a 59% improvement with larger numbers. Despite these advancements, Henley acknowledges the need to incorporate core features like negative numbers, subtraction, division, exponentiation, bitwise operations, type conversions, and rigorous testing for a more comprehensive bignum library.

Identifying Leap Years (2020)

David Turner explores optimizing leap year calculations for performance gains by using bitwise operations and integer bounds. He presents efficient methods, mathematical proofs, and considerations for signed integers, highlighting limitations pre-Gregorian calendar.

Y292B Bug

The Y292B bug is a potential timekeeping issue in Unix systems due to a rollover in the year 292,277,026,596. Solutions involve using dynamic languages or GNU Multiple Precision Arithmetic Library in C, emphasizing the need for kernel-level fixes.

Do not taunt happy fun branch predictor

The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.

Q Numbers

Tim Bray's ongoing fragment delves into Quamina's numeric representation and finite automata intersection. It addresses efficient event pattern matching, normalizing numbers for consistent comparison, and potential precision improvements. Introduced "Q numbers" aim for accurate matching despite performance considerations.

Summing ASCII encoded integers on Haswell at almost the speed of memcpy

Matt Stuchlik presents a high-performance algorithm for summing ASCII-encoded integers on Haswell systems. It utilizes SIMD instructions, lookup tables, and efficient operations to achieve speed enhancements, showcasing innovative approaches in integer sum calculations.

AI: What people are saying

The article on optimizing a bignum library by Austin Z. Henley from Carnegie Mellon University has sparked a detailed discussion among readers.

Several commenters share their experiences and tips on implementing and optimizing bignum libraries, emphasizing the importance of word size, endianity, and CPU-specific optimizations.
There is a discussion on the use of different algorithms and techniques, such as Knuth's classical algorithms, Karatsuba's algorithm, and the use of assembly for performance improvements.
Some commenters express surprise or disappointment at the lack of certain algorithm discussions, like Karatsuba's, and the choice of terminology used in the article.
Others share resources and links to their own implementations and related projects, providing additional context and tools for those interested in bignum libraries.
A few comments touch on the broader implications and potential improvements in programming languages and libraries, such as the idea of making arbitrary precision integers the default in Go.

16 comments

By @worstspotgain - 9 months

I've had to implement BigNum on an embedded system that had very little RAM and where the initial optimized version of ModExp took many minutes to complete. After much hair-pulling, the final version took 40 seconds.

First, you should work with unsigned numbers, and use a power of 2 as your word size. The fastest choice for a word size is very operation- and CPU-dependent.

A key trick is to lay out your bignums in the same order as the endianity of each word in memory, e.g. least-significant word first on little-endian systems. This will allow you to choose your word size dynamically for each operation: in memory, a number with M words of N bits each is identical to a number with M / 2 words of N * 2 bits each.

For multiplication, identify the CPU instruction with the widest result, then use half that size as your word size. Each step through the arrays generates a word result in the low half and a word carry in the top half. The carry gets added to the result of the next step, possibly overflowing.

For addition, use the widest result as your word size. This can also overflow.

How you deal with overflows is very CPU-dependent. You can use adc/addc as someone else mentioned, which will be faster on embedded and may be faster on fatter chips. Alternatively, you can halve the word size and use the top half as the carry.

If addc is not available, you can test for overflows as follows:

    uint32_t a = ..., b = ...;
    uint32_t res = a + b;
    uint32_t carry = res < a;

On overflow, res must necessarily be less than both a and b, so no need to check b.

If SIMD instructions are available, that will almost always be the fastest choice by far. While it doesn't change the above guidelines in principle, there are often e.g. optimized overflow mechanisms.

By @haberman - 9 months

Coincidentally I was just writing a bignum library from scratch two weeks ago.

A few interesting things I learned:

1. Knuth volume 2 has an extensive discussion of the problem space. I've only gotten a chance to skim it so far, but it looks interesting and approachable.

2. I need to support bitwise operations, which operate on a two's complement representation, so I figured it would be simpler to use two's complement internally, despite seeing that most (all?) bignum libraries use signed magnitude. I'm starting to regret this: two's complement introduces a lot of complexity.

The most fun thing about working on a bignum library is that it makes the algorithms you learned in grade school for add/subtract/multiply/divide relevant again. The basic ("classical") algorithms on bignum are basically exactly the same thing you learned in grade school, except on a much larger base than base-10.

By @minimize - 9 months

For maximum efficiency, you should work in binary instead of base 10. Handling carries becomes more straightforward with the right primitives, for example __builtin_addc with GCC: https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins...

You can also implement it in C if you want a more portable solution: https://github.com/983/bigint/blob/ee0834c65a27d18fa628e6c52...

If you scroll around, you can also find my implementations for multiplication and such.

By @globular-toast - 9 months

Funny, I did this myself 10 years ago. Shit, it's been that long..?

For my undergrad project I wrote a computer algebra system to do symbolic integration. The supervisor was a hardcore, old school C guy, so naturally I was going to just use C and no libraries. He told me I'd need bignums first, so I got to work (this is because many algorithms like polynomial GCD create massive numbers as they go, even if the inputs and final outputs are generally very small).

I just couldn't figure out how to do better than the largest power of 10 per digit at the time. Working with non base 10 arithmetic was a mind fuck for me at the time. So I did it with digits holding 10^9 and the classical algorithms from Knuth. Division is the hardest!

At some point I discovered the GNU multiple precision library (GMP) and made my program work with that instead of mine. I was shocked at how much faster GMP was! I finished my project with my own code, but I knew I had to come back to do it better.

The breakthrough came when I got a copy of Hacker's Delight. It has stuff like how to detect overflow after it's happened (in C). Something twigged and then I just understood how to fill each word completely rather than use a power of 10. I don't know what confused me before.

But, of course, the real way to do it is to use assembly. You can't get close to high performance in C alone. In assembly you get the overflow bit. It's actually easier in a way! So you write tiny platform specific bits for the digits and build on that in C. My add and subtract were then as fast as GMP. I lost interest when it came to implement faster multiplication algorithms.

Code in case anyone is interested: https://github.com/georgek/bignums

By @lifthrasiir - 10 months

Modulo is surprisingly expensive even when you combine it with a quotient. It is almost always better to use binary "limbs", in this case 31 or 32 bits wide, because decimal parsing and printing should be much rarer than individual operations in general.

By @guyomes - 9 months

For a small survey of practical efficient methods for bignum arithmetic operations, the Algorithms section of the documentation of GMP [1] is excellent.

[1]: https://gmplib.org/manual/Algorithms

By @tgot - 9 months

I think that your description is almost excellent, but that you're fundamentally misleading in describing what you are doing as a "30-bit" digit.

It's a 10^9 digit mathematically, occupying 30 bits of storage. You do briefly mention that it's 10^9, but repeatedly say 30-bits.

By @nj5rq - 9 months

Fascinating article, I have always been wondering how these big number libraries worked.

As a side question, does anyone the program that the author used when making that "addition" and "multiplication" performance graph? Thanks.

By @kazinator - 9 months

LOL, I've been there: https://www.kylheku.com/cgit/txr/commit/mpi?id=98dedd310b1d5...

(Patch was originally from 2011; it was bugfixed once, and then in 2015 converted to git commit:

https://www.kylheku.com/cgit/txr/commit/mpi-patches/faster-s... )

Another one: faster highest-bit search:

https://www.kylheku.com/cgit/txr/tree/mpi-patches/bit-search...

That does use GCC built-ins today: https://www.kylheku.com/cgit/txr/commit/?id=15b7c542dc44899e...

By @parentheses - 9 months

I am so surprised that there's no exploration of Karatsuba's algorithm. That's what makes the Python implementation perform.

I actually came here hoping to find discussion on Karatsuba. https://en.m.wikipedia.org/wiki/Karatsuba_algorithm

By @chubot - 9 months

Very useful post! Also it's cool to see how many people in this thread have worked on this problem -- lots of new info here I haven't seen

I wonder if anyone is interested in implementing a big numbers in Oils? It's a Unix shell with TWO complete implementations - the "executable spec" in Python, and an automatic translation to pure C++ (which is 30x-50x faster)

We currently use 64-bit integers in C++, but big nums are a better semantic. Some trivia about bad shell semantics here:

Integers - Don't do whatever Python or C++ does - https://www.oilshell.org/blog/2024/03/release-0.21.0.html#in...

This is a very self-contained project: the interface is defined by 200 lines of Python:

https://github.com/oilshell/oil/blob/master/mycpp/mops.py

and the trivial 64-bit overflowing implementation is also about 200 lines:

https://github.com/oilshell/oil/blob/master/mycpp/gc_mops.h

https://github.com/oilshell/oil/blob/master/mycpp/gc_mops.cc

(We have a fast Ninja-based build system, so you can probably iterate on this in 100 milliseconds or less -- it should be fun for the right person)

---

I think the main reason it is specific to Oils is that the bigger number should become GC objects. Details on our GC here:

Pictures of a Working Garbage Collector - https://www.oilshell.org/blog/2023/01/garbage-collector.html

It's been very solid for the last 18 months, basically because it's well tested by ASAN and #ifdef testing modes.

The main thing I'd be concerned with is how to TEST that big number operations are correct. I think there are probably some interesting strategies, which I'd love to discuss.

You're of course welcome to adapt existing open source code, including code you've already written -- I probably even prefer that, i.e. something that has had some real world testing. We want all the operations in that file, and it should be integrated with our GC.

---

We've had ~6 contributors funded by grants from https://nlnet.nl for the past couple years, so you can even be paid (there's a chance it depends on the country you live in, but generally a wide range of situations is OK).

Contact me at andy at oilshell.org or https://oilshell.zulipchat.com/ if interested!

By @amelius - 9 months

The title is a little bit dishonest, because they are optimizing their own bignum library.

By @paldepind2 - 9 months

Speaking of bignum libraries, I recently watched a talk with Rob Pike where he mentioned that one thing he regretted about Go was not making the default integer implementation arbitrary precision. Supposedly the performance overhead for normal numbers is very small, and you avoid the weirdness and complicated semantics of fixed precision integers. I found that to be quite fascinating, especially coming from a "low-level guy" like Rob Pike. Ever since I've been wanting a language with that feature and to understand how bignum implementations work.

By @styczen - 9 months

not + -

By @booleandilemma - 9 months

Except he's not doing it for fun at all. He's doing it for clout and publicity on HN.

Optimizing a Bignum Library for Fun

Related

Identifying Leap Years (2020)

Y292B Bug

Do not taunt happy fun branch predictor

Q Numbers

Summing ASCII encoded integers on Haswell at almost the speed of memcpy

Related

Identifying Leap Years (2020)

Y292B Bug

Do not taunt happy fun branch predictor

Q Numbers

Summing ASCII encoded integers on Haswell at almost the speed of memcpy