What Happened to Bert and T5?
Yi Tay analyzes transformer model evolution, emphasizing denoising methods over BERT-like models. He discusses encoder-decoder structures, bidirectional attention, and the value of denoising objectives for efficient language modeling.
Read original articleYi Tay discusses the evolution of transformer models like BERT and T5 in the context of encoder-decoder architectures, PrefixLM, and denoising objectives. He explains the shift from single-task finetuning to multi-task models and the diminishing popularity of BERT-like models due to the emergence of more efficient denoising methods using autoregressive models. The denoising objective, focusing on "fill in the blank" tasks, is compared to regular language modeling, highlighting its limitations in terms of loss exposure and sample efficiency. The discussion also touches on the value of bidirectional attention, the pros and cons of encoder-decoder architectures, and the potential of denoising objectives as complementary to regular language modeling. The post emphasizes the importance of understanding these model architectures in the era of large language models for effective downstream task performance.
Related
Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs
The study presents a method to boost Large Language Models' retrieval and reasoning abilities for long-context inputs by fine-tuning on a synthetic dataset. Results show significant improvements in information retrieval and reasoning skills.
The Illustrated Transformer
Jay Alammar's blog explores The Transformer model, highlighting its attention mechanism for faster training. It outperforms Google's NMT in some tasks, emphasizing parallelizability. The blog simplifies components like self-attention and multi-headed attention for better understanding.
Math Behind Transformers and LLMs
This post introduces transformers and large language models, focusing on OpenGPT-X and transformer architecture. It explains language models, training processes, computational demands, GPU usage, and the superiority of transformers in NLP.
Transformer Layers as Painters
The study "Transformer Layers as Painters" by Qi Sun et al. delves into transformer models, showcasing layer impact variations and potential for model optimization through strategic layer adjustments.
Electra: Pre-Training Text Encoders as Discriminators Rather Than Generators
The paper introduces ELECTRA, a text encoder pre-training method using replaced token detection instead of masked language modeling like BERT. ELECTRA outperforms BERT in contextual representation learning, especially for small models, with superior efficiency and effectiveness.
- Many users highlight that BERT models are still widely used and effective, especially for tasks requiring fast and cheap execution.
- There is a trend towards scaling down BERT models (e.g., RoBERTa, ALBERT, DistilBERT) to improve efficiency.
- Some users express confusion about the differences between encoder, decoder, and encoder-decoder models.
- Comments mention that large language models (LLMs) have overshadowed BERT due to their scalability and performance in zero-shot tasks.
- Despite the rise of LLMs, BERT models remain competitive in many tasks and are still heavily downloaded and used in various applications.
I have a small Swiss army collection of custom BERT fine tunes that are equal or better than the best LLM and execute document classification tasks in 2.4ms. Find me an LLM that can do anything in 2.4ms.
T5 I have worked less with but I would be curious about its head to head performance with decoder-only models these days. My guess is the downsides from before (context window limitations) are less of a factor than they used to be.
Is the input/output of these models any different? Are they all just "text context goes in, scores for all tokens in the vocabulary come out" ? Is the difference only in how they achieve this output?
When you have hundreds or thousands of examples, BERT works great. But that is very restricting.
In the end, LLMs using causal language modeling or masked language modeling learn to best solve their objectives by creating an efficient global model of language patterns: CLM is actually a harder problem to solve since MLM can leak information through surrounding context, and with transformer scaling law research post-BERT/GPT it's not a surprise CLM won out in the long run.
Can someone explain this to me? I'm not sure how the compute costs are the same between the 2N and N nets.
I mean, the scaling already happened in 2019 with RoBERTa, my guess is that these models are already good enough at what they need to do (creating meaningful text embeddings), and making them extremely large wasn't feasible for deployment.
Luckily, it is now trivial to drop the post into Claude and say "Re-write this without <list of things that bother me>"
So, just in case you also felt like you were driving over a road filled with potholes trying to read this post, don't just click away, have your handy LLM take a pass at it. There's good stuff to be found.
Related
Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs
The study presents a method to boost Large Language Models' retrieval and reasoning abilities for long-context inputs by fine-tuning on a synthetic dataset. Results show significant improvements in information retrieval and reasoning skills.
The Illustrated Transformer
Jay Alammar's blog explores The Transformer model, highlighting its attention mechanism for faster training. It outperforms Google's NMT in some tasks, emphasizing parallelizability. The blog simplifies components like self-attention and multi-headed attention for better understanding.
Math Behind Transformers and LLMs
This post introduces transformers and large language models, focusing on OpenGPT-X and transformer architecture. It explains language models, training processes, computational demands, GPU usage, and the superiority of transformers in NLP.
Transformer Layers as Painters
The study "Transformer Layers as Painters" by Qi Sun et al. delves into transformer models, showcasing layer impact variations and potential for model optimization through strategic layer adjustments.
Electra: Pre-Training Text Encoders as Discriminators Rather Than Generators
The paper introduces ELECTRA, a text encoder pre-training method using replaced token detection instead of masked language modeling like BERT. ELECTRA outperforms BERT in contextual representation learning, especially for small models, with superior efficiency and effectiveness.