September 9th, 2024

Launch HN: Deepsilicon (YC S24) – Software and hardware for ternary transformers

Abhi and Alex from deepsilicon are developing custom silicon for ternary transformer models to enhance performance, reduce hardware demands, and improve efficiency, while seeking feedback on their approach and deployment interest.

ConcernExcitementSkepticism
Launch HN: Deepsilicon (YC S24) – Software and hardware for ternary transformers

Abhi and Alex from deepsilicon are developing software and hardware aimed at training and inferring ternary transformer models, which are designed to address the increasing hardware demands of large transformer models. Their approach utilizes ternary values, allowing for significant weight compression and reduced arithmetic intensity, which can enhance performance on existing hardware. They have identified that current hardware is not optimized for low bit-width operations, which limits the speed of their implementations. By creating custom silicon tailored for ternary large language models (LLMs), they aim to improve inference efficiency, including better VRAM usage and throughput. Their work is inspired by the BitNet paper from Microsoft, and they are focused on open-sourcing their framework for training and data generation. The ultimate goal is to develop custom silicon that offers superior compression, throughput, latency, and energy efficiency compared to existing solutions. However, they acknowledge the challenges in the market, particularly the difficulty in persuading companies to switch hardware due to established software infrastructures. They are seeking feedback on their approach and potential interest in deploying these models.

- deepsilicon is focused on ternary transformer models to reduce hardware demands.

- Their method allows for significant weight compression and improved arithmetic efficiency.

- Custom silicon is being developed to enhance inference performance for ternary LLMs.

- They aim to open-source their training framework and address market challenges.

- Feedback and interest in deploying their models are welcomed.

Related

Researchers run high-performing LLM on the energy needed to power a lightbulb

Researchers run high-performing LLM on the energy needed to power a lightbulb

Researchers at UC Santa Cruz developed an energy-efficient method for large language models. By using custom hardware and ternary numbers, they achieved high performance with minimal power consumption, potentially revolutionizing model power efficiency.

Etched Is Making the Biggest Bet in AI

Etched Is Making the Biggest Bet in AI

Etched invests in AI with Sohu, a specialized chip for transformers, surpassing traditional models like DLRMs and CNNs. Sohu optimizes transformer models like ChatGPT, aiming to excel in AI superintelligence.

Intel vs. Samsung vs. TSMC

Intel vs. Samsung vs. TSMC

Competition intensifies among Intel, Samsung, and TSMC in the foundry industry. Focus on 3D transistors, AI/ML applications, and chiplet assemblies drives advancements in chip technology for high-performance, low-power solutions.

The Ternary Computing Manifesto

The Ternary Computing Manifesto

Douglas W. Jones advocates for ternary computing to boost security and cut data leakage. Ternary logic offers efficient data representation, potentially reducing malware threats and enhancing computer architecture with smaller wiring. Jones explores fast addition, heptavintimal encoding, and ternary data types, proposing Trillium and Tritium architectures for future systems.

Hardware Acceleration of LLMs: A comprehensive survey and comparison

Hardware Acceleration of LLMs: A comprehensive survey and comparison

The paper reviews hardware acceleration techniques for Large Language Models, comparing frameworks across platforms like FPGA and GPU, addressing evaluation challenges, and contributing to advancements in natural language processing.

AI: What people are saying
The discussion around Abhi and Alex's custom silicon for ternary transformer models reveals several key themes and concerns.
  • Performance vs. Existing Solutions: Many commenters question whether the performance improvements will be significant enough to entice developers away from established solutions like CUDA and Nvidia.
  • Technical Challenges: There are concerns about the technical hurdles of implementing ternary models efficiently, particularly regarding non-linear layers and the potential need for more ternary weights.
  • Market Viability: Commenters express skepticism about breaking into competitive markets, especially in sectors like defense and automotive, where precision is critical.
  • Power Efficiency: The potential for low power consumption in edge applications is highlighted as a significant advantage of the technology.
  • Feedback on Presentation: Some users suggest improvements to the video demo, noting issues with cropping and usability.
Link Icon 28 comments
By @danjl - 3 months
In my experience, trying to switch VFX companies from CPU-based rendering to GPU-based rendering 10+ years ago, a 2-5x performance improvement wasn't enough. We even provided a compatible renderer that accepted Renderman files and generated matching images. Given the rate of improvement of standard hardware (CPUs in our case, and GPU-based inference in yours), a 2-5x improvement will only last a few years, and the effort to get there is large (even larger in your case). Plus, I doubt you'll be able to get your HW everywhere (i.e. mobile) where inference is important, which means they'll need to support their existing and your new SW stack. The other issue is entirely non-technical, and may be an even bigger blocker -- switching the infrastructure of a major LLM provider to a new upstart is just plain risky. If you do a fantastic job, though, you should get aquahired, probably with a small individual bonus, not enough to pay off your investors.
By @cs702 - 3 months
Watching the video demo was key for me. I highly recommend everyone else here watches it.[a]

From a software development standpoint, usability looks great, requiring only one import,

  import deepsilicon as ds
and then, later on, a single line of Python,

  model = ds.convert(model)
which takes care of converting all possible layers (e.g., nn.Linear layers) in the model to use ternary values. Very nice!

The question for which I don't have a good answer is whether the improvement in real-world performance, using your hardware, will be sufficient to entice developers to leave the comfortable garden of CUDA and Nvidia, given that the latter is continually improving the performance of its hardware.

I, for one, hope you guys are hugely successful.

---

[a] At the moment, the YouTube video demo has some cropping issues, but that can be easily fixed.

By @0xDA7A - 3 months
I think the part I find most interesting about this is the potential power implications. Ternary models may perform better in terms of RAM and that's great, but if you manage to build a multiplication-free accelerator in silicon, you can start thinking about running things like vision models in < 0.1W of power.

This could have insane implications for edge capabilities, robots with massively better swarm dynamics, smart glasses with super low latency speech to text, etc.

I think the biggest technical hurdle would be simulating the non linear layers in an efficient way, but you can also solve that since you already re-train your models and could use custom activation functions that better approximate a HW efficient non linear layer.

By @jacobgorm - 3 months
I was part of a startup called Grazper that did the same thing for CNNs in 2016, using FPGAs. I left to found my own thing after realizing that new better architectures, SqueezeNet followed by MobileNets, could run even faster than our ternary nets on off-the-shelf hardware. I’d worry that a similar development might happen in the LLMs space.
By @nicoty - 3 months
Could the compression efficiency you're seeing somehow be related to 3 being the closest natural number to the number e, which also happens to be the optimal radix choice (https://en.wikipedia.org/wiki/Optimal_radix_choice) for storage efficiency?
By @nostrebored - 3 months
What do you think about the tension between inference accuracy and the types of edge applications used today?

For instance, if you wanted to train a multimodal transformer to do inference on CCTV footage I think that this will have a big advantage over Jetson. And I think there are a lot of potentially novel use cases for a technology like that (eg. if I'm looking for a suspect wearing a red hoodie, I'm not training a new classifier to identify all possible candidates)

But for sectors like automotive and defense, is the accuracy loss from quantization tolerable? If you're investing so much money in putting together a model, even considering procuring custom hardware and software, is the loss in precision worth it?

By @henning - 3 months
I applaud the chutzpah of doing a company where you develop both hardware and software for the hardware. If you execute well, you could build yourself a moat that is very difficult for would-be competitors to breach.
By @sidcool - 3 months
Congrats on launching. This is inspiring. .
By @transfire - 3 months
Combine it with TOC, and then you’d really be off to the races!

https://intapi.sciendo.com/pdf/10.2478/ijanmc-2022-0036#:~:t...

By @Havoc - 3 months
> This represents an almost 8x compression ratio for every weight matrix in the transformer model

Surely you’d need more ternary weights though to achieve same performance outcome?

A bit like a Q4 quant is smaller than a Q8 but also tangibly worse so the “compression” isn’t really like for like

Either way excited about more tenary progress.

By @stephen_cagle - 3 months
Is one expectation from moving from a 2^16 state parameter to a tristate one that the tristate one will only need to learn the number of states of the 2^16 states that were actually significant? I.E. we can prune the "extra" bits from the 2^16 that did not really affect the result?
By @mikewarot - 3 months
Since you're flexible on the silicon side, perhaps consider designing things so that the ternary weights are loaded from an external configuration rom into a shift register chain, instead of fixed. This would allow updating the weights without having to go through the whole production chain again.
By @tejasvaidhya - 3 months
There’s more to it. https://x.com/NolanoOrg/status/1813969329308021167

I will be archiving the full report with more results soon.

By @99112000 - 3 months
An area worth exploring are IP cameras imho

1. They are everywhere and aren't going anywhere.. 2. Network infrastructure to ingest and analyze thousands of cameras producing video footage is very demanding.. 3. Low power and low latency scream asic to me

By @bjornsing - 3 months
Have you tried implementing your ternary transformers on AVX(-512)? I think it fits relatively well with the hardware philosophy, and being able to run inference without a GPU would be a big plus.
By @marmaduke - 3 months
What kind of code did you try on the CPU for, say, ternary gemm? I imagine ternary values maps nicely to vectorized mask instructions, and much of tiling etc from usual gemm
By @dnnssl2 - 3 months
What is the upper bound on the level of improvement (high performance networking, memory and compute) you can achieve with ternary weights?
By @maratc - 3 months
Is there a possibility where this can run on a specialized hardware which is neither a CPU nor GPU, e.g. NextSilicon Maverick chips?
By @lappa - 3 months
Great project, looking forward to seeing more as this develops.

Also FYI, your mail server seems to be down.

By @ccamrobertson - 3 months
Congrats, always cool to see YC founders working on silicon!
By @luke-stanley - 3 months
The most popular interfaces (human, API and network) I can imagine are ChatGPT, OpenAI compatible HTTP API, Transformers HuggingFace API and models, Llama.cpp / Ollama / Llamafile, Pytorch. USB C, USB A, RJ45, HDMI/video(?) If you can run a frontier model or a comparable model with the ChatGPT clone like Open UI, with a USB or LAN interface, that can work on private data quickly, securely and competitively to a used 3090 it would be super badass. It should be easy to plug in and be used for running chat or API use or fine-tune or use with raw primitives via Pytorch or a very similar compatible API. I've thought about this a bit. There's more I could say but I've got to sleep soon... Good luck, it's an awesome opportunity.
By @anirudhrahul - 3 months
Can this run crysis?
By @Taniwha - 3 months
Yeah I've been thinking about this problem for a while from the making gates level, I've been thinking that the problem essentially breaks down to a couple of pop counts and a subtract, it's eminently pipelineable
By @hy3na - 3 months
ternary transformers have existed for a long time before you guys TerDit, vision ones etc. Competing in the edge inference space is likely going to require a lot of capex and opex + breaking into markets like defense thatre hard asf without connections and a strong team. neither of you guys are chip architects either and taping out silicon requires a lot of foresight to changing market demands. good luck, hopefully it works out.
By @felarof - 3 months
Very interesting!
By @_zoltan_ - 3 months
you might want to redo the video as it's cropped too much, and maybe it's only me but it's _really_ annoying to watch like this.