November 15th, 2024

1-Bit AI Infrastructure

Recent advancements in 1-bit Large Language Models, particularly BitNet b1.58, improve efficiency with speed enhancements of 2.37x to 6.17x on x86 CPUs and broader device deployment.

Read original articleLink Icon
1-Bit AI Infrastructure

Recent advancements in 1-bit Large Language Models (LLMs), particularly BitNet and BitNet b1.58, have shown potential for improving the efficiency of LLMs in terms of speed and energy consumption. This paper introduces a specialized software stack designed to maximize the capabilities of 1-bit LLMs, focusing on the development of kernels that facilitate fast and lossless inference of ternary BitNet b1.58 LLMs on CPUs. Experimental results indicate significant performance improvements, with speed enhancements ranging from 2.37x to 6.17x on x86 CPUs and from 1.37x to 5.07x on ARM CPUs, depending on the model size. The findings suggest that this approach not only boosts inference speed but also supports the deployment of LLMs on a wider array of devices, making them more accessible for local use. The code for this implementation is made available for further research and application.

- 1-bit LLMs like BitNet b1.58 enhance efficiency in speed and energy use.

- A new software stack has been developed for fast and lossless inference on CPUs.

- Speed improvements range from 2.37x to 6.17x on x86 CPUs and 1.37x to 5.07x on ARM CPUs.

- The advancements enable broader deployment of LLMs across various devices.

- The code for the implementation is publicly accessible for further exploration.

Link Icon 9 comments
By @dailykoder - 1 day
I have read about it quite a few weeks ago the first time and I found it very interesting.

Now that I have done more than enough CPU design inside FPGAs, I wanted to try something new, some computation heavy things that could benefit from an FPGA. Does anyone here know how feasable it'd be to implement something like that on an FPGA? I only have rather small chips (artix-7 35T and polarfire SoC with 95k logic slices). So I know I won't be able to press a full LLM into that, but something should be possible.

Maybe I should refresh the fundamentals though and start with MNIST. But the question is rather: What is a realistic goal that I could possibly reach with these small FPGAs? Performance might be secondary, I am rather interested in what's possible regarding complexity/features on a small device.

Also has anyone here compiled openCL (or GL?) kernels for FPGAs and can give me a starting point? I was wondering if it's possible to have a working backend for something like tinygrad[1]. I think this would be a good way to learn all the different layers on how such frameworks actually work

- [1] https://github.com/tinygrad/tinygrad

By @sva_ - 1 day
It seems like arxiv replaced 'bitnet.cpp' with a link 'this http url', even though '.cpp' is clearly not a tld. Poor regex?
By @ttyprintk - 6 days
By @js8 - about 24 hours
It's technically not 1-bit, but 2-bit.

Anyway, I wonder if there is some HW support in modern CPUs/GPUs for linear algebra (like matrix multiplication) over Z_2^n ? I think it would be useful for SAT solving.

By @WiSaGaN - 1 day
I would expect research along this way to pick up quite a bit if we confirm the pretrain stage is not scaling as previous expected, thus the scale and architecture would be more stable in the near future, especially if the focus shifts to inference time scaling.
By @yalok - about 18 hours
So basically the idea is to pack 3 ternary weights (-1,0,1) into 5 bits instead of 6, but they compare the results with fp16 model which would use 48 bits for those 3 weights…

And speed up comes from the memory io, compensated a bit by the need to unpack these weights before using them…

Did I get this right?

By @hidelooktropic - about 18 hours
Does anyone have the actual "this http url"?