July 19th, 2024

What Happened to Bert and T5?

Yi Tay analyzes transformer model evolution, emphasizing denoising methods over BERT-like models. He discusses encoder-decoder structures, bidirectional attention, and the value of denoising objectives for efficient language modeling.

Read original articleLink Icon
BERTModelsPerformance
What Happened to Bert and T5?

Yi Tay discusses the evolution of transformer models like BERT and T5 in the context of encoder-decoder architectures, PrefixLM, and denoising objectives. He explains the shift from single-task finetuning to multi-task models and the diminishing popularity of BERT-like models due to the emergence of more efficient denoising methods using autoregressive models. The denoising objective, focusing on "fill in the blank" tasks, is compared to regular language modeling, highlighting its limitations in terms of loss exposure and sample efficiency. The discussion also touches on the value of bidirectional attention, the pros and cons of encoder-decoder architectures, and the potential of denoising objectives as complementary to regular language modeling. The post emphasizes the importance of understanding these model architectures in the era of large language models for effective downstream task performance.

AI: What people are saying
The comments discuss the relevance and performance of BERT models in the context of modern NLP research and applications.
  • Many users highlight that BERT models are still widely used and effective, especially for tasks requiring fast and cheap execution.
  • There is a trend towards scaling down BERT models (e.g., RoBERTa, ALBERT, DistilBERT) to improve efficiency.
  • Some users express confusion about the differences between encoder, decoder, and encoder-decoder models.
  • Comments mention that large language models (LLMs) have overshadowed BERT due to their scalability and performance in zero-shot tasks.
  • Despite the rise of LLMs, BERT models remain competitive in many tasks and are still heavily downloaded and used in various applications.
Link Icon 16 comments
By @hdhshdhshdjd - 8 months
Maybe in SOTA ml/nlp research, but in the world of building useful tools and products, BERT models are dead simple to tune, work great if you have decent training data, and most importantly are very very fast and very very cheap to run.

I have a small Swiss army collection of custom BERT fine tunes that are equal or better than the best LLM and execute document classification tasks in 2.4ms. Find me an LLM that can do anything in 2.4ms.

By @janalsncm - 8 months
BERT didn’t go anywhere and I have seen fine-tuned BERT backbones everywhere. They are useful for generating embeddings to be used downstream, and small enough to be handled on consumer (pre Ampere) hardware. One of the trends I have seen is scaling BERT down rather than up, since BERT already gave good performance, we want to be able to do it faster and cheaper. That gave rise to RoBERTa, ALBERT and distillBERT.

T5 I have worked less with but I would be curious about its head to head performance with decoder-only models these days. My guess is the downsides from before (context window limitations) are less of a factor than they used to be.

By @vintermann - 8 months
For people like me who gave up trying to follow Arxiv ML papers 3+ years ago, articles like these are gold. I would love a Youtube channel or blog which does retrospectives on "big" papers of the last decade (those that everyone paid attention to at the time) and look at where the ideas are today.
By @empiko - 8 months
BERT is still the most downloaded LM at huggingface with 46M downloads last month. XLM Roberta has 24M and Distilbert is at 15M. I feel like BERTs are doing okay.
By @andy_xor_andrew - 8 months
I'm a bit embarrassed to admit, but I still don't understand decoder vs encoder vs decoder/encoder models.

Is the input/output of these models any different? Are they all just "text context goes in, scores for all tokens in the vocabulary come out" ? Is the difference only in how they achieve this output?

By @lalaland1125 - 8 months
I think the big reason why BERT and T5 have fallen out of favor is the lack of zero shot (or few shot) ability.

When you have hundreds or thousands of examples, BERT works great. But that is very restricting.

By @minimaxir - 8 months
What happened is that "transformers go whrrrrrr." (yes, that's the academic term)

In the end, LLMs using causal language modeling or masked language modeling learn to best solve their objectives by creating an efficient global model of language patterns: CLM is actually a harder problem to solve since MLM can leak information through surrounding context, and with transformer scaling law research post-BERT/GPT it's not a surprise CLM won out in the long run.

By @k8si - 8 months
I believe many high-quality embedding models are still based on BERT, even recent ones, so I don't think it's entirely fair to characterize it as "deprecated".
By @a_bonobo - 8 months
DNABERT-S came out half a year ago: seems like xBERT is still useful in genomics/DNA? https://arxiv.org/abs/2402.08777
By @htrp - 8 months
feels like large language models sucked all the air out of the room because it was a lot easier to scale compute and data, and after roberta, no one was willing to continue exploring.
By @jszymborski - 8 months
> It is also worth to note that, generally speaking, an Encoder-Decoders of 2N parameters has the same compute cost as a decoder-only model of N parameters which gives it a different FLOP to parameter count ratio.

Can someone explain this to me? I'm not sure how the compute costs are the same between the 2N and N nets.

By @bugglebeetle - 8 months
Wasn’t there a recent paper that demonstrated BERT models are still competitive or beat LLMs in many tasks?
By @caprock - 8 months
Yi is a good source in this area, and a good follow on Twitter.
By @IAmBurger - 8 months
IMO GenAI gets all the hype, but in the industry, the robustness (ig. does not hallucinate) of Extractive models is much appreciated.
By @GaggiX - 8 months
>If BERT worked so well, why not scale it?

I mean, the scaling already happened in 2019 with RoBERTa, my guess is that these models are already good enough at what they need to do (creating meaningful text embeddings), and making them extremely large wasn't feasible for deployment.

By @iandanforth - 8 months
nit: I find the writing in this post very distracting. (Grammar and style pet peeves)

Luckily, it is now trivial to drop the post into Claude and say "Re-write this without <list of things that bother me>"

So, just in case you also felt like you were driving over a road filled with potholes trying to read this post, don't just click away, have your handy LLM take a pass at it. There's good stuff to be found.