March 5th, 2025

Writing an LLM from scratch, part 8 – trainable self-attention

The blog details the implementation of self-attention in large language models, emphasizing trainable weights, tokenization, context vectors, and scaled dot product attention, while reflecting on the author's learning journey.

Read original article

Writing an LLM from scratch, part 8 – trainable self-attention

the attention mechanism is effective is due to its ability to weigh the importance of different tokens in a sequence based on their contextual relevance. In this eighth installment of the blog series on building a large language model (LLM) from scratch, the author discusses the implementation of self-attention with trainable weights, a crucial component for understanding how LLMs process language. The author outlines the steps involved in creating a self-attention mechanism, starting from tokenization to generating context vectors that encapsulate the meaning of each token in relation to others. The process involves mapping tokens to embeddings, generating position embeddings, and calculating attention scores through scaled dot product attention. This method allows the model to determine how much focus to place on each token when interpreting the input sequence. The author emphasizes the importance of understanding the mechanics behind these calculations, as they form the foundation for the model's ability to predict subsequent tokens. The post serves as both a personal reflection on the learning process and a guide for others navigating similar challenges in LLM development.

- The blog discusses the implementation of self-attention in LLMs, focusing on trainable weights.

- It outlines the steps from tokenization to generating context vectors for understanding language.

- Scaled dot product attention is highlighted as a key method for calculating attention scores.

- The author reflects on the learning process and aims to clarify complex concepts for readers.

The Illustrated Transformer

Jay Alammar's blog explores The Transformer model, highlighting its attention mechanism for faster training. It outperforms Google's NMT in some tasks, emphasizing parallelizability. The blog simplifies components like self-attention and multi-headed attention for better understanding.

The Role of Anchor Tokens in Self-Attention Networks

The paper "Anchor-based Large Language Models" presents AnLLMs, which enhance efficiency by compressing sequence information, achieving up to 99% cache reduction and 3.5 times faster inference, while maintaining accuracy.

Why AI language models choke on too much text

Large language models have improved their context windows but still face challenges with extensive text. Current systems use retrieval-augmented generation, while research aims to enhance attention efficiency in LLMs.

TL;DR of Deep Dive into LLMs Like ChatGPT by Andrej Karpathy

Andrej Karpathy's video on large language models covers their architecture, training, and applications, emphasizing data collection, tokenization, hallucinations, and the importance of structured prompts and ongoing research for improvement.

I built a large language model "from scratch"

Brett Fitzgerald built a large language model inspired by Sebastian Raschka's book, emphasizing hands-on coding, tokenization, fine-tuning for tasks, and the importance of debugging and physical books for learning.

6 comments

By @andsoitis - about 1 month

> For me, it took eight re-reads of Raschka's (emininently clear and readable) explanation to get to a level where I felt I understood it.

It’s interesting to observe in oneself how repetition can result in internalizing new concepts. It is less about rote memorization but more about becoming aware of nuance and letting our minds “see” things from different angles, integrating it with existing world models either through augmentation, replacement, or adjustment. Similar for practicing activities that require some form of motor ability.

Some concepts are internalized less explicitly, like when we “learn” through role-modeling behaviors or feedback loops through interaction with people, objects, and ideas (like how to fit into a society).

By @penguin_booze - about 1 month

One sees from scratch, and then one also sees

  from fancy_module import magic_functions

I'm semi-serious here, of course. To me, for something to be called 'from scratch', requisite knowledge should be built ground up. To wit, I'd want to write the tokenizer myself but don't want to derive laws of quantum physics that makes the computation happen.

By @kureikain - about 1 month

In case if anyone want to read this book and live in bay areas, you can also access Oreilly media through your local library online and will be granted access to orelly media. this book is available there.

By @ForOldHack - about 1 month

Part 8? Wait... Is this a story that wrote itself?

1) I am kidding. 2) At what point does it become self replicating? 3) skynet. 4) kidding - not kidding.

Writing an LLM from scratch, part 8 – trainable self-attention

Related

The Illustrated Transformer

The Role of Anchor Tokens in Self-Attention Networks

Why AI language models choke on too much text

TL;DR of Deep Dive into LLMs Like ChatGPT by Andrej Karpathy

I built a large language model "from scratch"

Related

The Illustrated Transformer

The Role of Anchor Tokens in Self-Attention Networks

Why AI language models choke on too much text

TL;DR of Deep Dive into LLMs Like ChatGPT by Andrej Karpathy

I built a large language model "from scratch"