November 15th, 2024

1-Bit AI Infrastructure

Recent advancements in 1-bit Large Language Models, particularly BitNet b1.58, improve efficiency with speed enhancements of 2.37x to 6.17x on x86 CPUs and broader device deployment.

Read original article

Recent advancements in 1-bit Large Language Models (LLMs), particularly BitNet and BitNet b1.58, have shown potential for improving the efficiency of LLMs in terms of speed and energy consumption. This paper introduces a specialized software stack designed to maximize the capabilities of 1-bit LLMs, focusing on the development of kernels that facilitate fast and lossless inference of ternary BitNet b1.58 LLMs on CPUs. Experimental results indicate significant performance improvements, with speed enhancements ranging from 2.37x to 6.17x on x86 CPUs and from 1.37x to 5.07x on ARM CPUs, depending on the model size. The findings suggest that this approach not only boosts inference speed but also supports the deployment of LLMs on a wider array of devices, making them more accessible for local use. The code for this implementation is made available for further research and application.

- 1-bit LLMs like BitNet b1.58 enhance efficiency in speed and energy use.

- A new software stack has been developed for fast and lossless inference on CPUs.

- Speed improvements range from 2.37x to 6.17x on x86 CPUs and 1.37x to 5.07x on ARM CPUs.

- The advancements enable broader deployment of LLMs across various devices.

- The code for the implementation is publicly accessible for further exploration.

Fine-Tuning LLMs to 1.58bit

BitNet introduces extreme quantization for large language models, achieving 1.58 bits per parameter, enhancing efficiency and performance, particularly in fine-tuning Llama3 8B models while integrating into existing frameworks.

LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs

The paper presents an FPGA-based accelerator for large language models, achieving 14.3-15.8 times speedup and 6.1 times power efficiency, enhancing deployment in resource-constrained environments.

Serving 70B-Scale LLMs Efficiently on Low-Resource Edge Devices [pdf]

The paper presents TPI-LLM, a system for efficiently running 70B-scale LLMs on low-resource edge devices, reducing memory requirements by 90% and improving latency through tensor parallelism and local data handling.

Everything I've learned so far about running local LLMs

Local Large Language Models (LLMs) now run on modest hardware, enhancing accessibility. The llama.cpp software simplifies usage, while Hugging Face offers various models. Understanding specifications is vital for optimization.

1-bit architecture is turbocharging LLM efficiency

Microsoft Research's BitNet a4.8 improves one-bit large language models by combining quantization and sparsification, achieving 10-fold memory reduction and 2x speedup, enhancing on-device processing for privacy and security.

9 comments

By @dailykoder - 1 day

I have read about it quite a few weeks ago the first time and I found it very interesting.

Now that I have done more than enough CPU design inside FPGAs, I wanted to try something new, some computation heavy things that could benefit from an FPGA. Does anyone here know how feasable it'd be to implement something like that on an FPGA? I only have rather small chips (artix-7 35T and polarfire SoC with 95k logic slices). So I know I won't be able to press a full LLM into that, but something should be possible.

Maybe I should refresh the fundamentals though and start with MNIST. But the question is rather: What is a realistic goal that I could possibly reach with these small FPGAs? Performance might be secondary, I am rather interested in what's possible regarding complexity/features on a small device.

Also has anyone here compiled openCL (or GL?) kernels for FPGAs and can give me a starting point? I was wondering if it's possible to have a working backend for something like tinygrad[1]. I think this would be a good way to learn all the different layers on how such frameworks actually work

- [1] https://github.com/tinygrad/tinygrad

By @sva_ - 1 day

It seems like arxiv replaced 'bitnet.cpp' with a link 'this http url', even though '.cpp' is clearly not a tld. Poor regex?

By @ttyprintk - 6 days

Later a4.8 quantization by some of the same team:

https://news.ycombinator.com/item?id=42092724

https://arxiv.org/abs/2411.04965

By @js8 - about 24 hours

It's technically not 1-bit, but 2-bit.

Anyway, I wonder if there is some HW support in modern CPUs/GPUs for linear algebra (like matrix multiplication) over Z_2^n ? I think it would be useful for SAT solving.

By @WiSaGaN - 1 day

I would expect research along this way to pick up quite a bit if we confirm the pretrain stage is not scaling as previous expected, thus the scale and architecture would be more stable in the near future, especially if the focus shifts to inference time scaling.

By @yalok - about 18 hours

So basically the idea is to pack 3 ternary weights (-1,0,1) into 5 bits instead of 6, but they compare the results with fp16 model which would use 48 bits for those 3 weights…

And speed up comes from the memory io, compensated a bit by the need to unpack these weights before using them…

Did I get this right?

By @hidelooktropic - about 18 hours

Does anyone have the actual "this http url"?

Fine-Tuning LLMs to 1.58bit

LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs

The paper presents an FPGA-based accelerator for large language models, achieving 14.3-15.8 times speedup and 6.1 times power efficiency, enhancing deployment in resource-constrained environments.

1-Bit AI Infrastructure

Related

Fine-Tuning LLMs to 1.58bit

LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs

Serving 70B-Scale LLMs Efficiently on Low-Resource Edge Devices [pdf]

Everything I've learned so far about running local LLMs

1-bit architecture is turbocharging LLM efficiency

Related

Fine-Tuning LLMs to 1.58bit

LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs

Serving 70B-Scale LLMs Efficiently on Low-Resource Edge Devices [pdf]

Everything I've learned so far about running local LLMs

1-bit architecture is turbocharging LLM efficiency