March 5th, 2025

Writing an LLM from scratch, part 8 – trainable self-attention

The blog details the implementation of self-attention in large language models, emphasizing trainable weights, tokenization, context vectors, and scaled dot product attention, while reflecting on the author's learning journey.

Read original articleLink Icon
Writing an LLM from scratch, part 8 – trainable self-attention

the attention mechanism is effective is due to its ability to weigh the importance of different tokens in a sequence based on their contextual relevance. In this eighth installment of the blog series on building a large language model (LLM) from scratch, the author discusses the implementation of self-attention with trainable weights, a crucial component for understanding how LLMs process language. The author outlines the steps involved in creating a self-attention mechanism, starting from tokenization to generating context vectors that encapsulate the meaning of each token in relation to others. The process involves mapping tokens to embeddings, generating position embeddings, and calculating attention scores through scaled dot product attention. This method allows the model to determine how much focus to place on each token when interpreting the input sequence. The author emphasizes the importance of understanding the mechanics behind these calculations, as they form the foundation for the model's ability to predict subsequent tokens. The post serves as both a personal reflection on the learning process and a guide for others navigating similar challenges in LLM development.

- The blog discusses the implementation of self-attention in LLMs, focusing on trainable weights.

- It outlines the steps from tokenization to generating context vectors for understanding language.

- Scaled dot product attention is highlighted as a key method for calculating attention scores.

- The author reflects on the learning process and aims to clarify complex concepts for readers.

Link Icon 6 comments
By @andsoitis - about 1 month
> For me, it took eight re-reads of Raschka's (emininently clear and readable) explanation to get to a level where I felt I understood it.

It’s interesting to observe in oneself how repetition can result in internalizing new concepts. It is less about rote memorization but more about becoming aware of nuance and letting our minds “see” things from different angles, integrating it with existing world models either through augmentation, replacement, or adjustment. Similar for practicing activities that require some form of motor ability.

Some concepts are internalized less explicitly, like when we “learn” through role-modeling behaviors or feedback loops through interaction with people, objects, and ideas (like how to fit into a society).

By @penguin_booze - about 1 month
One sees from scratch, and then one also sees

  from fancy_module import magic_functions
I'm semi-serious here, of course. To me, for something to be called 'from scratch', requisite knowledge should be built ground up. To wit, I'd want to write the tokenizer myself but don't want to derive laws of quantum physics that makes the computation happen.
By @kureikain - about 1 month
In case if anyone want to read this book and live in bay areas, you can also access Oreilly media through your local library online and will be granted access to orelly media. this book is available there.
By @ForOldHack - about 1 month
Part 8? Wait... Is this a story that wrote itself?

1) I am kidding. 2) At what point does it become self replicating? 3) skynet. 4) kidding - not kidding.