June 20th, 2024

Optimizing AI Inference at Character.ai

Character.AI optimizes AI inference for LLMs, handling 20,000+ queries/sec globally. Innovations like Multi-Query Attention and int8 quantization reduced serving costs by 33x since late 2022, aiming to enhance AI capabilities worldwide.

Read original articleLink Icon
Optimizing AI Inference at Character.ai

Character.AI is focused on optimizing AI inference to enhance daily life with large language models (LLMs). They aim to achieve highly efficient inference to serve a global audience, currently handling over 20,000 queries per second. By implementing memory-efficient architecture design techniques like Multi-Query Attention and Hybrid Attention Horizons, they have significantly reduced KV cache size without compromising quality. Stateful caching innovation allows for efficient caching of attention KV on host memory between chat turns, achieving a 95% cache rate. Additionally, they utilize int8 quantization for training and serving, improving efficiency and reducing costs. These innovations have led to a 33x reduction in serving costs since late 2022. Character.AI envisions a future where LLMs drive innovation and improve experiences globally, inviting others to join them in advancing the capabilities of AI systems.

Related

We no longer use LangChain for building our AI agents

We no longer use LangChain for building our AI agents

Octomind switched from LangChain due to its inflexibility and excessive abstractions, opting for modular building blocks instead. This change simplified their codebase, increased productivity, and emphasized the importance of well-designed abstractions in AI development.

GitHub – Karpathy/LLM101n: LLM101n: Let's Build a Storyteller

GitHub – Karpathy/LLM101n: LLM101n: Let's Build a Storyteller

The GitHub repository "LLM101n: Let's build a Storyteller" offers a course on creating a Storyteller AI Large Language Model using Python, C, and CUDA. It caters to beginners, covering language modeling, deployment, programming, data types, deep learning, and neural nets. Additional chapters and appendices are available for further exploration.

LibreChat: Enhanced ChatGPT clone for self-hosting

LibreChat: Enhanced ChatGPT clone for self-hosting

LibreChat introduces a new Resources Hub, featuring a customizable AI chat platform supporting various providers and services. It aims to streamline AI interactions, offering documentation, blogs, and demos for users.

Lessons About the Human Mind from Artificial Intelligence

Lessons About the Human Mind from Artificial Intelligence

In 2022, a Google engineer claimed AI chatbot LaMDA was self-aware, but further scrutiny revealed it mimicked human-like responses without true understanding. This incident underscores AI limitations in comprehension and originality.

Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]

Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]

The video discusses limitations of large language models in AI, emphasizing genuine understanding and problem-solving skills. A prize incentivizes AI systems showcasing these abilities. Adaptability and knowledge acquisition are highlighted as crucial for true intelligence.

Link Icon 4 comments
By @hackernewds - 4 months
Noam has been cooking at character.ai in stealth. Their model is impressively engaging
By @eachro - 4 months
Training in int8 is noteable (to me). I've been out of date with ML research for a bit now but last I recall, people were mostly training at full precision and then quantizing after training and finetuning a bit on the quantized model afterwards.
By @janalsncm - 4 months
> we implemented customized int8 kernels for matrix multiplications and attention

I would be curious how this differs from [1] which is supported in Huggingface’s transformers library.

[1] https://arxiv.org/abs/2208.07339