July 6th, 2024

Here’s how you can build and train GPT-2 from scratch using PyTorch

A guide on building a GPT-2 language model from scratch using PyTorch. Emphasizes simplicity, suitable for various expertise levels. Involves training on Taylor Swift and Ed Sheeran songs dataset. Includes code snippets and references.

Read original articleLink Icon
Here’s how you can build and train GPT-2 from scratch using PyTorch

This article provides a detailed guide on building and training a GPT-2 language model from scratch using PyTorch. It explains the process step by step, starting from building a custom tokenizer to training a simple language model. The author emphasizes simplicity, making it accessible for individuals with varying levels of Python or machine learning expertise. The project involves creating a GPT-2 model and training it on a dataset containing Taylor Swift and Ed Sheeran songs. The article includes code snippets, explanations, and references to external resources for further understanding. It also outlines the architecture of the model, the data loading process, and the training loop. The goal is to empower readers to construct their own language model and delve into the world of natural language processing. The article hints at a continuation in Part 2 for further exploration.

Link Icon 8 comments
By @rty32 - 3 months
Andrej Karpathy's video is probably much better than this:

https://youtu.be/l8pRSuU81PU

By @cjtrowbridge - 3 months
Also check out Andrej's new llm.c library which includes a script to do this from scratch with fineweb.
By @omerhac - 3 months
Cool blog, thanks!

I did a similar project a couple of years ago for a university course, only I also added style transfer, it turned out pretty cool. I scraped a bunch of news data together with it's news section and trained a self attention language model from scratch, turned out pretty hilarious. Data was in Hebrew, which is a challenge to tokenize because of the morphology. I posted it on ArXiV if someone's interested in the style transfer and tokenization process: https://arxiv.org/abs/2212.03019

By @moffkalast - 3 months
That's cool as a learning experience, but if you're gonna build a language transformer, why not instead of ClosedAI's outdated nonsense learn something with a more established open architecture like llama, so whatever you end up training ends up plug and play compatible with every LLM tool in the universe when converted to a GGUF?

Otherwise it's like learning to build a website and stopping short of actually doing the final bit where you put it on a webserver and run it live.

By @aziis98 - 3 months
The link to the repo looks broken

https://github.com/ajeetkharel/gpt2-from-scratch/

By @KTibow - 3 months
Reminds me of TinyStories. I wonder if this architecture is better or worse than the ones it tested.