January 31st, 2025

Theoretical limitations of multi-layer Transformer

The paper presents the first unconditional lower bound for multi-layer decoder-only Transformers, revealing a depth-width trade-off, separation of encoder-decoder capabilities, and advantages of chain-of-thought reasoning in task simplification.

Read original articleLink Icon
Theoretical limitations of multi-layer Transformer

The paper titled "Theoretical limitations of multi-layer Transformer" by Lijie Chen, Binghui Peng, and Hongxun Wu addresses the expressive power of multi-layer decoder-only Transformers, which are fundamental to modern large language models. The authors highlight the lack of understanding regarding the capabilities of these models beyond the single-layer case. They present the first unconditional lower bound for multi-layer decoder-only Transformers, demonstrating that any L-layer model requires a polynomial model dimension to perform sequential composition of L functions over n tokens. This work leads to several significant findings: it establishes a depth-width trade-off indicating that L-step composition is exponentially more challenging for L-layer models than for (L+1)-layer models; it shows a clear separation between encoder and decoder capabilities, revealing a task solvable by a smaller encoder that is hard for decoders; and it illustrates the advantages of chain-of-thought reasoning, which simplifies certain tasks exponentially. The authors introduce a new multi-party autoregressive communication model to capture the computation of decoder-only Transformers and a novel proof technique for establishing lower bounds. This research aims to enhance the understanding of the computational power of Transformers.

- The paper establishes the first unconditional lower bound for multi-layer decoder-only Transformers.

- It reveals a depth-width trade-off, indicating increased difficulty in L-layer models compared to (L+1)-layer models.

- The research highlights a separation between encoder and decoder capabilities.

- It demonstrates the advantages of chain-of-thought reasoning in simplifying tasks.

- A new communication model and proof technique are introduced to aid in understanding Transformers' computational power.

Link Icon 5 comments
By @thesz - 2 months

  > ...our results give: ... (3) a provable advantage of chain-of-thought, exhibiting a task that becomes exponentially easier with chain-of-thought.
It would be good to also prove that there is no task that becomes exponentially harder with chain-of-thought.
By @cubefox - 2 months
Loosely related thought: A year ago, there was a lot of talk about the Mamba SSM architecture replacing transformers. Apparently that didn't happen so far.
By @hochstenbach - 2 months
Quanta magazine has an article that explains in plain words what the researchers were trying to do : https://www.quantamagazine.org/chatbot-software-begins-to-fa...
By @byyoung3 - 2 months
those lemmas are wild
By @cs702 - 2 months
Huh. I just skimmed this and quickly concluded that it's definitely not light reading.

It sure looks and smells like good work, so I've added it to my reading list.

Nowadays I feel like my reading list is growing faster than I can go through it.