July 16th, 2024

Electra: Pre-Training Text Encoders as Discriminators Rather Than Generators

The paper introduces ELECTRA, a text encoder pre-training method using replaced token detection instead of masked language modeling like BERT. ELECTRA outperforms BERT in contextual representation learning, especially for small models, with superior efficiency and effectiveness.

Read original article

The paper titled "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators" introduces a new pre-training method for text encoders. Unlike traditional methods like BERT that use masked language modeling (MLM), ELECTRA proposes a more efficient approach called replaced token detection. Instead of masking input tokens, ELECTRA corrupts them by replacing some with alternatives from a small generator network. A discriminative model is then trained to predict whether each token was replaced by a generator sample. Experiments show that ELECTRA outperforms BERT in contextual representation learning, especially for small models. It achieves superior results even with less compute power compared to models like GPT, RoBERTa, and XLNet. Notably, ELECTRA trained on one GPU for 4 days surpasses GPT's performance on the GLUE benchmark, despite GPT using 30 times more compute. At scale, ELECTRA performs comparably to RoBERTa and XLNet while using significantly less compute, showcasing its efficiency and effectiveness in pre-training text encoders.

Researchers run high-performing LLM on the energy needed to power a lightbulb

Researchers at UC Santa Cruz developed an energy-efficient method for large language models. By using custom hardware and ternary numbers, they achieved high performance with minimal power consumption, potentially revolutionizing model power efficiency.

Math Behind Transformers and LLMs

This post introduces transformers and large language models, focusing on OpenGPT-X and transformer architecture. It explains language models, training processes, computational demands, GPU usage, and the superiority of transformers in NLP.

New AI Training Technique Is Drastically Faster, Says Google

Google's DeepMind introduces JEST, a new AI training technique speeding up training by 13 times and boosting efficiency by 10 times. JEST optimizes data selection, reducing energy consumption and improving model effectiveness.

Image Self Supervised Learning on a Shoestring

A new cost-effective approach in machine learning, IJEPA, enhances image encoder training by predicting missing parts internally. Released on GitHub, it optimizes image embeddings, reducing computational demands for researchers.

Exploring the Limits of Transfer Learning with a Unified Transformer (2019)

The study by Colin Raffel et al. presents a unified text-to-text transformer for transfer learning in NLP. It introduces new techniques, achieves top results in various tasks, and provides resources for future research.

6 comments

By @visarga - 9 months

LOL, I was reading the abstract and remembering there used to be a paper like that. Then I look at the title and see it was from 2020. For a moment I thought someone plagiarised the original paper.

Unfortunately BERT models are dead. Even the cross between BERT and GPT - the T5 architecture (encode-decoder) is rarely used.

The issue with BERT is that you need to modify the network to adapt it to any task by creating a prediction head, while decoder models (GPT style) do every task with tokens and never need to modify the network. Their advantage is that they have a single format for everything. BERT's advantage is the bidirectional attention, but apparently large size decoders don't have an issue with unidirectionality.

By @cs702 - 9 months

Good work by well-known reputable authors.

The gains in training efficiency and compute cost versus widely used text-encoding models like RoBERTa and XLNet are significant.

Thank you for sharing this on HN!

By @adw - 9 months

(2020)

By @trhway - 9 months

Reminds somewhat parallel from the classic expert systems - human experts shine at discrimination, and that is one of the most efficient methods of knowledge eliciting from them.

Electra: Pre-Training Text Encoders as Discriminators Rather Than Generators

Related

Researchers run high-performing LLM on the energy needed to power a lightbulb

Math Behind Transformers and LLMs

New AI Training Technique Is Drastically Faster, Says Google

Image Self Supervised Learning on a Shoestring

Exploring the Limits of Transfer Learning with a Unified Transformer (2019)

Related

Researchers run high-performing LLM on the energy needed to power a lightbulb

Math Behind Transformers and LLMs

New AI Training Technique Is Drastically Faster, Says Google

Image Self Supervised Learning on a Shoestring

Exploring the Limits of Transfer Learning with a Unified Transformer (2019)