TL;DR of Deep Dive into LLMs Like ChatGPT by Andrej Karpathy
Andrej Karpathy's video on large language models covers their architecture, training, and applications, emphasizing data collection, tokenization, hallucinations, and the importance of structured prompts and ongoing research for improvement.
Read original articleAndrej Karpathy's recent video, "Deep dive into LLMs like ChatGPT," provides an extensive overview of large language models (LLMs), focusing on their architecture, training processes, and practical applications. The video, lasting over three hours, covers essential topics such as pretraining data collection, tokenization, neural network operations, and the distinction between pre-training and post-training phases. LLMs begin by crawling the internet to gather vast datasets, which are then filtered and tokenized for processing. The model learns to predict the next token based on patterns in the data, adjusting its parameters through backpropagation. The stochastic nature of LLM outputs leads to variability in responses, which can result in hallucinations—instances where the model generates incorrect information. To mitigate this, techniques like reinforcement learning and supervised fine-tuning are employed, allowing models to learn from human interactions and improve their conversational abilities. The video also discusses the importance of structured prompts and the use of tools to enhance the model's accuracy and reduce hallucinations. Overall, Karpathy emphasizes the need for ongoing research and development in LLMs to refine their reasoning capabilities and practical utility.
- Andrej Karpathy's video provides a comprehensive overview of LLMs, focusing on their training and operational mechanisms.
- LLMs utilize vast datasets from the internet, which undergo extensive filtering and tokenization before training.
- The stochastic nature of LLM outputs can lead to hallucinations, which can be mitigated through reinforcement learning and fine-tuning.
- Structured prompts and tool usage are essential for improving the accuracy of LLM responses.
- Ongoing research is crucial for enhancing the reasoning capabilities of LLMs.
Related
Overcoming the Limits of Large Language Models
Large language models (LLMs) like chatbots face challenges such as hallucinations, lack of confidence estimates, and citations. MIT researchers suggest strategies like curated training data and diverse worldviews to enhance LLM performance.
Has LLM killed traditional NLP?
Large Language Models (LLMs) streamline Natural Language Processing by using zero-shot prompts, reducing the need for extensive training data and retraining, potentially challenging traditional NLP methods' relevance and efficiency.
I am going through the video myself -- roughly halfway through -- and have a fw things to bring up.
Here they are now that we have a fresh opportunity to discuss:
1 - MATH and LLMs
I am curious why many of the examples Andrej chose to pose to the LLM were "computational" questions -- for instance "what is 2+2" or some numerical puzzles that needed algebraic thinking and then some addition/subtraction/multiplication (example at 1:50 mins about buying Apples and Oranges).
I can understand these abilities of LLMs are becoming powerful and useful too -- but in my mind these are not the "basic" abilities of a next token predictor.
I would have appreciated a more clear distinction of prompts that showcase core LLM ability -- to generate text that is acceptable as generally grammatically correct, based in facts and context, without necessarily needing the ability of a working memory / assigning values to algebraic variables / doing arithmetic etc.
If there are any good references to discussion on the mathematical abilities of LLMs and the wisdom of trying to make them do math -- versus simply recognizing when a math is needed and generating the necessary python/expressions and let the tools handle it.
2 - META
While Andrej briefly acknowledges the "meta" situation where LLMs are being used to create training data for the training of and judge the outputs of newer LLMs ... there is not much discussion on that here.
There are just many more examples of how LLMs are used to prepare mitigations for hallucinations by preparing Q&A training sets with "correct" answers etc
I am curious to know more about the limitations / perils of using LLMs to train/evaluate other LLMs.
I kind of feel that this is a bit like the Manhattan project and atomic weapons -- in that early results and advances are being looped back immediately into the development of more powerful technology. (A smaller fission charge at the core of a larger fusion weapon -- to be very loose with analogies)
<I am sure I will have a few more questions as I go through the rets of the video and digest it>
- Extract a snippet of training data.
- Generate a factual question about it using Llama 3.
- Have Llama 3 generate an answer.
- Score the response against the original data.
- If incorrect, train the model to recognize and refuse incorrect responses.
In a way this is obvious in hindsight, but it goes against ML engineers natural tendency when detecting a wrong answer: Teaching the model the right answer.Instead of teaching the model to recognize what it doesn't know, why not teach it using those same examples? Of course the idea is to "connect the unused uncertainty neuron", which makes sense for out-of-context generalization. But we can at least appreciate why this wasn't an obvious thing to do for generation 1 LLMs.
Also how can the training of LLMs be parallelized when updating parameters are sequential? Sure we can train on several samples simultaneously, but the parameter updates are with respect to the first step.
See The Open Source AI Definition from OSI: https://opensource.org/ai
We had a talk about those physics AIs using those maths AIs to design hard mathematical models to fit fundamental physics data.
"|" "View" "ing" "Single"
Just looking at the text being tokenized in the linked article, it looked like (to me) that the text was: "I View", but the "I" is actually a pipe "|".
From Step 3 in the link that @miletus posted in the Hacker News comment: https://x.com/0xmetaschool/status/1888873667624661455 the text that is being tokenized is:
|Viewing Single (Post From) . . .
The capitals used (View, Single) also makes more sense when seeing this part of the sentence.
To be clear, neither possesses any magical "woo" outside of physics that gives one or the other some secret magical properties - but these are not arbitrary meaningless distinctions in the way they are often discussed.
Related
Overcoming the Limits of Large Language Models
Large language models (LLMs) like chatbots face challenges such as hallucinations, lack of confidence estimates, and citations. MIT researchers suggest strategies like curated training data and diverse worldviews to enhance LLM performance.
Has LLM killed traditional NLP?
Large Language Models (LLMs) streamline Natural Language Processing by using zero-shot prompts, reducing the need for extensive training data and retraining, potentially challenging traditional NLP methods' relevance and efficiency.