July 8th, 2024

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

The paper introduces Test-Time Training (TTT) layers for sequence modeling, featuring linear complexity and self-supervised learning for training on test sequences. TTT-Linear outperforms Transformer, while TTT-MLP shows potential for long contexts.

Read original article

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

The paper discusses a new class of sequence modeling layers called Test-Time Training (TTT) layers, which have linear complexity and an expressive hidden state. The key idea is to update the hidden state using self-supervised learning, allowing training even on test sequences. Two instantiations, TTT-Linear and TTT-MLP, are introduced, with the latter showing potential in long contexts. The layers are evaluated against Transformer and Mamba, with TTT-Linear already faster than Transformer at 8k context. TTT-MLP faces challenges in memory I/O but indicates promise for future research. The proposed layers aim to enhance performance in long contexts compared to existing RNN and Transformer models.

Whats better: Neural nets wider with less layers or thinner with more layers

Experiments compared Transformer models with varying layer depths and widths. Optimal performance was achieved with a model featuring four layers and an embedding dimension of 1024. Balancing layer depth and width is crucial for efficiency and performance improvement.

Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs

The study presents a method to boost Large Language Models' retrieval and reasoning abilities for long-context inputs by fine-tuning on a synthetic dataset. Results show significant improvements in information retrieval and reasoning skills.

xLSTM Explained in Detail

Maximillan Beck's YouTube video delves into XLSTM as a Transformer alternative in language modeling. XLSTM combines LSTM and modern techniques to tackle storage and decision-making issues, aiming to rival Transformers in predictive tasks.

The Illustrated Transformer

Jay Alammar's blog explores The Transformer model, highlighting its attention mechanism for faster training. It outperforms Google's NMT in some tasks, emphasizing parallelizability. The blog simplifies components like self-attention and multi-headed attention for better understanding.

Math Behind Transformers and LLMs

This post introduces transformers and large language models, focusing on OpenGPT-X and transformer architecture. It explains language models, training processes, computational demands, GPU usage, and the superiority of transformers in NLP.

0 comments

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Related

Whats better: Neural nets wider with less layers or thinner with more layers

Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs

xLSTM Explained in Detail

The Illustrated Transformer

Math Behind Transformers and LLMs

Related

Whats better: Neural nets wider with less layers or thinner with more layers

Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs

xLSTM Explained in Detail

The Illustrated Transformer

Math Behind Transformers and LLMs