July 16th, 2024

Electra: Pre-Training Text Encoders as Discriminators Rather Than Generators

The paper introduces ELECTRA, a text encoder pre-training method using replaced token detection instead of masked language modeling like BERT. ELECTRA outperforms BERT in contextual representation learning, especially for small models, with superior efficiency and effectiveness.

Read original articleLink Icon
Electra: Pre-Training Text Encoders as Discriminators Rather Than Generators

The paper titled "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators" introduces a new pre-training method for text encoders. Unlike traditional methods like BERT that use masked language modeling (MLM), ELECTRA proposes a more efficient approach called replaced token detection. Instead of masking input tokens, ELECTRA corrupts them by replacing some with alternatives from a small generator network. A discriminative model is then trained to predict whether each token was replaced by a generator sample. Experiments show that ELECTRA outperforms BERT in contextual representation learning, especially for small models. It achieves superior results even with less compute power compared to models like GPT, RoBERTa, and XLNet. Notably, ELECTRA trained on one GPU for 4 days surpasses GPT's performance on the GLUE benchmark, despite GPT using 30 times more compute. At scale, ELECTRA performs comparably to RoBERTa and XLNet while using significantly less compute, showcasing its efficiency and effectiveness in pre-training text encoders.

Link Icon 6 comments
By @visarga - 8 months
LOL, I was reading the abstract and remembering there used to be a paper like that. Then I look at the title and see it was from 2020. For a moment I thought someone plagiarised the original paper.

Unfortunately BERT models are dead. Even the cross between BERT and GPT - the T5 architecture (encode-decoder) is rarely used.

The issue with BERT is that you need to modify the network to adapt it to any task by creating a prediction head, while decoder models (GPT style) do every task with tokens and never need to modify the network. Their advantage is that they have a single format for everything. BERT's advantage is the bidirectional attention, but apparently large size decoders don't have an issue with unidirectionality.

By @cs702 - 8 months
Good work by well-known reputable authors.

The gains in training efficiency and compute cost versus widely used text-encoding models like RoBERTa and XLNet are significant.

Thank you for sharing this on HN!

By @adw - 8 months
(2020)
By @trhway - 8 months
Reminds somewhat parallel from the classic expert systems - human experts shine at discrimination, and that is one of the most efficient methods of knowledge eliciting from them.