Electra: Pre-Training Text Encoders as Discriminators Rather Than Generators
The paper introduces ELECTRA, a text encoder pre-training method using replaced token detection instead of masked language modeling like BERT. ELECTRA outperforms BERT in contextual representation learning, especially for small models, with superior efficiency and effectiveness.
Read original articleThe paper titled "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators" introduces a new pre-training method for text encoders. Unlike traditional methods like BERT that use masked language modeling (MLM), ELECTRA proposes a more efficient approach called replaced token detection. Instead of masking input tokens, ELECTRA corrupts them by replacing some with alternatives from a small generator network. A discriminative model is then trained to predict whether each token was replaced by a generator sample. Experiments show that ELECTRA outperforms BERT in contextual representation learning, especially for small models. It achieves superior results even with less compute power compared to models like GPT, RoBERTa, and XLNet. Notably, ELECTRA trained on one GPU for 4 days surpasses GPT's performance on the GLUE benchmark, despite GPT using 30 times more compute. At scale, ELECTRA performs comparably to RoBERTa and XLNet while using significantly less compute, showcasing its efficiency and effectiveness in pre-training text encoders.
Related
Researchers run high-performing LLM on the energy needed to power a lightbulb
Researchers at UC Santa Cruz developed an energy-efficient method for large language models. By using custom hardware and ternary numbers, they achieved high performance with minimal power consumption, potentially revolutionizing model power efficiency.
Math Behind Transformers and LLMs
This post introduces transformers and large language models, focusing on OpenGPT-X and transformer architecture. It explains language models, training processes, computational demands, GPU usage, and the superiority of transformers in NLP.
New AI Training Technique Is Drastically Faster, Says Google
Google's DeepMind introduces JEST, a new AI training technique speeding up training by 13 times and boosting efficiency by 10 times. JEST optimizes data selection, reducing energy consumption and improving model effectiveness.
Image Self Supervised Learning on a Shoestring
A new cost-effective approach in machine learning, IJEPA, enhances image encoder training by predicting missing parts internally. Released on GitHub, it optimizes image embeddings, reducing computational demands for researchers.
Exploring the Limits of Transfer Learning with a Unified Transformer (2019)
The study by Colin Raffel et al. presents a unified text-to-text transformer for transfer learning in NLP. It introduces new techniques, achieves top results in various tasks, and provides resources for future research.
Unfortunately BERT models are dead. Even the cross between BERT and GPT - the T5 architecture (encode-decoder) is rarely used.
The issue with BERT is that you need to modify the network to adapt it to any task by creating a prediction head, while decoder models (GPT style) do every task with tokens and never need to modify the network. Their advantage is that they have a single format for everything. BERT's advantage is the bidirectional attention, but apparently large size decoders don't have an issue with unidirectionality.
The gains in training efficiency and compute cost versus widely used text-encoding models like RoBERTa and XLNet are significant.
Thank you for sharing this on HN!
Related
Researchers run high-performing LLM on the energy needed to power a lightbulb
Researchers at UC Santa Cruz developed an energy-efficient method for large language models. By using custom hardware and ternary numbers, they achieved high performance with minimal power consumption, potentially revolutionizing model power efficiency.
Math Behind Transformers and LLMs
This post introduces transformers and large language models, focusing on OpenGPT-X and transformer architecture. It explains language models, training processes, computational demands, GPU usage, and the superiority of transformers in NLP.
New AI Training Technique Is Drastically Faster, Says Google
Google's DeepMind introduces JEST, a new AI training technique speeding up training by 13 times and boosting efficiency by 10 times. JEST optimizes data selection, reducing energy consumption and improving model effectiveness.
Image Self Supervised Learning on a Shoestring
A new cost-effective approach in machine learning, IJEPA, enhances image encoder training by predicting missing parts internally. Released on GitHub, it optimizes image embeddings, reducing computational demands for researchers.
Exploring the Limits of Transfer Learning with a Unified Transformer (2019)
The study by Colin Raffel et al. presents a unified text-to-text transformer for transfer learning in NLP. It introduces new techniques, achieves top results in various tasks, and provides resources for future research.