August 15th, 2024

Symmetric Power Transformers

Symmetric Power Transformers enhance linear transformer performance by using higher-dimensional embeddings and a hyperparameter \(p\) for state size, showing improved capabilities and compatibility with rotary embeddings in experiments.

Read original articleLink Icon
Symmetric Power Transformers

The article discusses the development of Symmetric Power Transformers, a new variant of linear transformers designed to improve performance while maintaining efficient computational costs. Traditional linear transformers, while theoretically advantageous for long contexts, often suffer from degraded performance due to small state sizes relative to their weights. The authors propose a solution by embedding keys and queries in a higher-dimensional space, which allows for larger state sizes and better performance. Symmetric Power Transformers introduce a hyperparameter \(p\) that controls the state size, enabling configurations that outperform standard transformers when \(p\) is set to 4 or higher. The article also highlights the compatibility of these transformers with rotary embeddings, enhancing their utility. The authors conducted experiments using an attention formulation of linear transformers to validate the architecture's learning capabilities, with plans to release an efficient implementation in future work. The findings suggest that by adjusting the embedding function and state size, it is possible to achieve competitive performance with a manageable computational footprint.

- Symmetric Power Transformers improve performance of linear transformers by embedding keys and queries in higher-dimensional spaces.

- A hyperparameter \(p\) controls the state size, allowing configurations that outperform standard transformers.

- The architecture is compatible with rotary embeddings, enhancing its effectiveness.

- Experiments validate the learning capabilities of the proposed architecture.

- An efficient implementation of the chunked algorithm for these transformers is planned for future release.

Link Icon 4 comments
By @brrrrrm - 8 months
So, noticing that linearized models have tiny KV caches ahem i mean state spaces, this approach tries to increase their size along the embedding dimension. Increasing this enormously by applying a different softmax (which is compatible with the expanding tensor product) yields a very symmetric mathematical structure that can be exploited to recover some efficiency.

Is that right?

By @userbinator - 8 months
Glanced at the title and clicked, expecting this to be EE related.
By @d110af5ccf - 8 months
Formatted like a formal academic publication. No way (that I can tell) to grab a pdf. Comes across as a blog masquerading as academic literature to me. Am I wrong? Did I miss something and there's an offline version available?

Pages served up over http are ephemeral. An absolutely essential part of formal academic literature is the archival aspect - self contained, immutable, and referenceable in an unambiguous manner.

There's also an immediate practical aspect for me. I will likely never get around to reading this because I will forget it exists because my "reading list" consists of a pile of pdf files.

By @kazinator - 8 months
I almost clicked on this, thinking it would be an electrical engineering topic; good thing I read the domain name.