Launch HN: Deepsilicon (YC S24) – Software and hardware for ternary transformers
Abhi and Alex from deepsilicon are developing custom silicon for ternary transformer models to enhance performance, reduce hardware demands, and improve efficiency, while seeking feedback on their approach and deployment interest.
Abhi and Alex from deepsilicon are developing software and hardware aimed at training and inferring ternary transformer models, which are designed to address the increasing hardware demands of large transformer models. Their approach utilizes ternary values, allowing for significant weight compression and reduced arithmetic intensity, which can enhance performance on existing hardware. They have identified that current hardware is not optimized for low bit-width operations, which limits the speed of their implementations. By creating custom silicon tailored for ternary large language models (LLMs), they aim to improve inference efficiency, including better VRAM usage and throughput. Their work is inspired by the BitNet paper from Microsoft, and they are focused on open-sourcing their framework for training and data generation. The ultimate goal is to develop custom silicon that offers superior compression, throughput, latency, and energy efficiency compared to existing solutions. However, they acknowledge the challenges in the market, particularly the difficulty in persuading companies to switch hardware due to established software infrastructures. They are seeking feedback on their approach and potential interest in deploying these models.
- deepsilicon is focused on ternary transformer models to reduce hardware demands.
- Their method allows for significant weight compression and improved arithmetic efficiency.
- Custom silicon is being developed to enhance inference performance for ternary LLMs.
- They aim to open-source their training framework and address market challenges.
- Feedback and interest in deploying their models are welcomed.
Related
Researchers run high-performing LLM on the energy needed to power a lightbulb
Researchers at UC Santa Cruz developed an energy-efficient method for large language models. By using custom hardware and ternary numbers, they achieved high performance with minimal power consumption, potentially revolutionizing model power efficiency.
Etched Is Making the Biggest Bet in AI
Etched invests in AI with Sohu, a specialized chip for transformers, surpassing traditional models like DLRMs and CNNs. Sohu optimizes transformer models like ChatGPT, aiming to excel in AI superintelligence.
Intel vs. Samsung vs. TSMC
Competition intensifies among Intel, Samsung, and TSMC in the foundry industry. Focus on 3D transistors, AI/ML applications, and chiplet assemblies drives advancements in chip technology for high-performance, low-power solutions.
The Ternary Computing Manifesto
Douglas W. Jones advocates for ternary computing to boost security and cut data leakage. Ternary logic offers efficient data representation, potentially reducing malware threats and enhancing computer architecture with smaller wiring. Jones explores fast addition, heptavintimal encoding, and ternary data types, proposing Trillium and Tritium architectures for future systems.
Hardware Acceleration of LLMs: A comprehensive survey and comparison
The paper reviews hardware acceleration techniques for Large Language Models, comparing frameworks across platforms like FPGA and GPU, addressing evaluation challenges, and contributing to advancements in natural language processing.
- Performance vs. Existing Solutions: Many commenters question whether the performance improvements will be significant enough to entice developers away from established solutions like CUDA and Nvidia.
- Technical Challenges: There are concerns about the technical hurdles of implementing ternary models efficiently, particularly regarding non-linear layers and the potential need for more ternary weights.
- Market Viability: Commenters express skepticism about breaking into competitive markets, especially in sectors like defense and automotive, where precision is critical.
- Power Efficiency: The potential for low power consumption in edge applications is highlighted as a significant advantage of the technology.
- Feedback on Presentation: Some users suggest improvements to the video demo, noting issues with cropping and usability.
From a software development standpoint, usability looks great, requiring only one import,
import deepsilicon as ds
and then, later on, a single line of Python, model = ds.convert(model)
which takes care of converting all possible layers (e.g., nn.Linear layers) in the model to use ternary values. Very nice!The question for which I don't have a good answer is whether the improvement in real-world performance, using your hardware, will be sufficient to entice developers to leave the comfortable garden of CUDA and Nvidia, given that the latter is continually improving the performance of its hardware.
I, for one, hope you guys are hugely successful.
---
[a] At the moment, the YouTube video demo has some cropping issues, but that can be easily fixed.
This could have insane implications for edge capabilities, robots with massively better swarm dynamics, smart glasses with super low latency speech to text, etc.
I think the biggest technical hurdle would be simulating the non linear layers in an efficient way, but you can also solve that since you already re-train your models and could use custom activation functions that better approximate a HW efficient non linear layer.
For instance, if you wanted to train a multimodal transformer to do inference on CCTV footage I think that this will have a big advantage over Jetson. And I think there are a lot of potentially novel use cases for a technology like that (eg. if I'm looking for a suspect wearing a red hoodie, I'm not training a new classifier to identify all possible candidates)
But for sectors like automotive and defense, is the accuracy loss from quantization tolerable? If you're investing so much money in putting together a model, even considering procuring custom hardware and software, is the loss in precision worth it?
https://intapi.sciendo.com/pdf/10.2478/ijanmc-2022-0036#:~:t...
Surely you’d need more ternary weights though to achieve same performance outcome?
A bit like a Q4 quant is smaller than a Q8 but also tangibly worse so the “compression” isn’t really like for like
Either way excited about more tenary progress.
I will be archiving the full report with more results soon.
1. They are everywhere and aren't going anywhere.. 2. Network infrastructure to ingest and analyze thousands of cameras producing video footage is very demanding.. 3. Low power and low latency scream asic to me
Also FYI, your mail server seems to be down.
Related
Researchers run high-performing LLM on the energy needed to power a lightbulb
Researchers at UC Santa Cruz developed an energy-efficient method for large language models. By using custom hardware and ternary numbers, they achieved high performance with minimal power consumption, potentially revolutionizing model power efficiency.
Etched Is Making the Biggest Bet in AI
Etched invests in AI with Sohu, a specialized chip for transformers, surpassing traditional models like DLRMs and CNNs. Sohu optimizes transformer models like ChatGPT, aiming to excel in AI superintelligence.
Intel vs. Samsung vs. TSMC
Competition intensifies among Intel, Samsung, and TSMC in the foundry industry. Focus on 3D transistors, AI/ML applications, and chiplet assemblies drives advancements in chip technology for high-performance, low-power solutions.
The Ternary Computing Manifesto
Douglas W. Jones advocates for ternary computing to boost security and cut data leakage. Ternary logic offers efficient data representation, potentially reducing malware threats and enhancing computer architecture with smaller wiring. Jones explores fast addition, heptavintimal encoding, and ternary data types, proposing Trillium and Tritium architectures for future systems.
Hardware Acceleration of LLMs: A comprehensive survey and comparison
The paper reviews hardware acceleration techniques for Large Language Models, comparing frameworks across platforms like FPGA and GPU, addressing evaluation challenges, and contributing to advancements in natural language processing.