February 10th, 2025

TL;DR of Deep Dive into LLMs Like ChatGPT by Andrej Karpathy

Andrej Karpathy's video on large language models covers their architecture, training, and applications, emphasizing data collection, tokenization, hallucinations, and the importance of structured prompts and ongoing research for improvement.

Read original articleLink Icon
TL;DR of Deep Dive into LLMs Like ChatGPT by Andrej Karpathy

Andrej Karpathy's recent video, "Deep dive into LLMs like ChatGPT," provides an extensive overview of large language models (LLMs), focusing on their architecture, training processes, and practical applications. The video, lasting over three hours, covers essential topics such as pretraining data collection, tokenization, neural network operations, and the distinction between pre-training and post-training phases. LLMs begin by crawling the internet to gather vast datasets, which are then filtered and tokenized for processing. The model learns to predict the next token based on patterns in the data, adjusting its parameters through backpropagation. The stochastic nature of LLM outputs leads to variability in responses, which can result in hallucinations—instances where the model generates incorrect information. To mitigate this, techniques like reinforcement learning and supervised fine-tuning are employed, allowing models to learn from human interactions and improve their conversational abilities. The video also discusses the importance of structured prompts and the use of tools to enhance the model's accuracy and reduce hallucinations. Overall, Karpathy emphasizes the need for ongoing research and development in LLMs to refine their reasoning capabilities and practical utility.

- Andrej Karpathy's video provides a comprehensive overview of LLMs, focusing on their training and operational mechanisms.

- LLMs utilize vast datasets from the internet, which undergo extensive filtering and tokenization before training.

- The stochastic nature of LLM outputs can lead to hallucinations, which can be mitigated through reinforcement learning and fine-tuning.

- Structured prompts and tool usage are essential for improving the accuracy of LLM responses.

- Ongoing research is crucial for enhancing the reasoning capabilities of LLMs.

Link Icon 15 comments
By @albert_e - 2 months
OT: What is a good place to discuss the original video -- once it has dropped out of the HN front-page?

I am going through the video myself -- roughly halfway through -- and have a fw things to bring up.

Here they are now that we have a fresh opportunity to discuss:

1 - MATH and LLMs

I am curious why many of the examples Andrej chose to pose to the LLM were "computational" questions -- for instance "what is 2+2" or some numerical puzzles that needed algebraic thinking and then some addition/subtraction/multiplication (example at 1:50 mins about buying Apples and Oranges).

I can understand these abilities of LLMs are becoming powerful and useful too -- but in my mind these are not the "basic" abilities of a next token predictor.

I would have appreciated a more clear distinction of prompts that showcase core LLM ability -- to generate text that is acceptable as generally grammatically correct, based in facts and context, without necessarily needing the ability of a working memory / assigning values to algebraic variables / doing arithmetic etc.

If there are any good references to discussion on the mathematical abilities of LLMs and the wisdom of trying to make them do math -- versus simply recognizing when a math is needed and generating the necessary python/expressions and let the tools handle it.

2 - META

While Andrej briefly acknowledges the "meta" situation where LLMs are being used to create training data for the training of and judge the outputs of newer LLMs ... there is not much discussion on that here.

There are just many more examples of how LLMs are used to prepare mitigations for hallucinations by preparing Q&A training sets with "correct" answers etc

I am curious to know more about the limitations / perils of using LLMs to train/evaluate other LLMs.

I kind of feel that this is a bit like the Manhattan project and atomic weapons -- in that early results and advances are being looped back immediately into the development of more powerful technology. (A smaller fission charge at the core of a larger fusion weapon -- to be very loose with analogies)

<I am sure I will have a few more questions as I go through the rets of the video and digest it>

By @thomasahle - 2 months
I find Meta’s approach to hallucinations delightfully counter intuitive. Basically they (and presumably OpenAI and others):

   - Extract a snippet of training data.
   - Generate a factual question about it using Llama 3.
   - Have Llama 3 generate an answer.
   - Score the response against the original data.
   - If incorrect, train the model to recognize and refuse incorrect responses.
In a way this is obvious in hindsight, but it goes against ML engineers natural tendency when detecting a wrong answer: Teaching the model the right answer.

Instead of teaching the model to recognize what it doesn't know, why not teach it using those same examples? Of course the idea is to "connect the unused uncertainty neuron", which makes sense for out-of-context generalization. But we can at least appreciate why this wasn't an obvious thing to do for generation 1 LLMs.

By @quantumspandex - 2 months
Andrej's video is great but the explanation on the RL part is a bit vague to me. How exactly do we train on the right answers? Do we collect the reasoning traces and train on them like supervised learning or do we compute some scores and use them as a loss function ? Isn't the reward then very sparse? What if LLMs can't generate any right answers cause the problems are too hard?

Also how can the training of LLMs be parallelized when updating parameters are sequential? Sure we can train on several samples simultaneously, but the parameter updates are with respect to the first step.

By @p0w3n3d - 2 months
On 53 minutes from the original video, he shows how exact is the quotation of an LLM based on the text it was learning from. I wonder how did the bigtech convince the courts that this is not copyright violation (especially when ChatGPT was quoting some GPL code). I can imagine that the same thing would happen opposite, if I trained a model to draw a disney character, and my ass would be sued in a fraction of a second.
By @dzogchen - 2 months
For a model to be ‘fully’ open source you need more than the model itself and a way to run it. You also need the data and the program that can be used to train it.

See The Open Source AI Definition from OSI: https://opensource.org/ai

By @est - 2 months
I have read many articles about LLMs, and understand how it works in general, but one thing always bothers me: why other models did't work as good as SOTA ones? What's the history and reason behind the current model architecture?
By @khazhoux - 2 months
I'm still seeking an answer to what DeepSeek really is, especially in the context of their $5M versus ChatGPT's >$1B (source: internet). What did they do versus not do?
By @sylware - 2 months
It is sad to see that much attention given to LLM in comparison to the other types of AIs like those doing maths (strapped to a formal solver), folding proteins, etc.

We had a talk about those physics AIs using those maths AIs to design hard mathematical models to fit fundamental physics data.

By @miletus - 2 months
By @bluelightning2k - 2 months
Great write up of what is presumably a truly great lecture. Debating trying to follow the original now.
By @9999_points - 2 months
Its a shame his LLC in C was just a launch board for his course.
By @wolfhumble - 2 months
I haven't watched the video, but was wondering about the Tokenization part from the TL;DR:

"|" "View" "ing" "Single"

Just looking at the text being tokenized in the linked article, it looked like (to me) that the text was: "I View", but the "I" is actually a pipe "|".

From Step 3 in the link that @miletus posted in the Hacker News comment: https://x.com/0xmetaschool/status/1888873667624661455 the text that is being tokenized is:

|Viewing Single (Post From) . . .

The capitals used (View, Single) also makes more sense when seeing this part of the sentence.

By @EncomLab - 2 months
It would be great if the hardware issues were discussed more - too little is made of the distinction between silicon substrate, fixed threshold, voltage moderated brittle networks of solid-state switches and protein substrate, variable threshold, chemically moderated plastic networks of biological switches.

To be clear, neither possesses any magical "woo" outside of physics that gives one or the other some secret magical properties - but these are not arbitrary meaningless distinctions in the way they are often discussed.