Transformers: Attention, Architecture, Training, and Scaling

1. The Sequence Modeling Bottleneck
Before we talk about attention, it is worth naming the problem it was designed to solve. A sequence model is not merely a machine that consumes tokens in order; it is a machine that must route information between positions. If token matters for predicting something at position , the architecture needs a reliable computational path from to . The central question is: how long, fragile, and sequential is that path?
Many important tasks can be written abstractly as sequence-to-sequence mappings,
where the input and output lengths may differ. Translation maps a sentence in one language to a sentence in another. Summarization maps a long document to a shorter text. Code generation maps a prompt or partial program to a completed program. Even when the output is not explicitly a separate sequence, language modeling has the same flavor: at each position , the model predicts the next token from the previous context,
This notation hides the hard part. The conditioning set may be large, but not every previous token is equally relevant. A model predicting the verb in a sentence may need to find the true subject many tokens earlier. A model completing code may need to remember an opening bracket, variable declaration, or function signature hundreds or thousands of tokens back. A translation model may need to align a word near the end of the source sentence with a word near the beginning of the target sentence. In all cases, sequence modeling requires selective communication between positions.
Classical recurrent neural networks handle this by passing information through a hidden state:
This is elegant because it respects temporal order and can in principle summarize everything seen so far. But it creates a narrow communication channel. If information from is needed at , it must survive repeated transformations through
The number of computational steps between the two positions grows with their distance, roughly . Gradients must also travel through this same chain during training. Gating mechanisms such as LSTMs and GRUs reduce the damage, but they do not remove the fundamental bottleneck: distant tokens communicate through a long sequential path.
Convolutional sequence models improve parallelism because all positions in a layer can be processed simultaneously. However, local convolutions have their own routing problem. A kernel of small width only mixes nearby tokens in one layer, so long-range interaction requires stacking many layers. Dilated convolutions shorten the path, but the architecture still imposes a predefined communication pattern. Whether two positions can exchange information efficiently depends on the convolutional design rather than on the content of the sequence itself.
This suggests three desiderata for a strong sequence architecture:
- Relevant conditioning: each output should be able to depend on the input or context positions that matter.
- Parallel training: computations across positions should not be forced into a strict left-to-right loop when the training targets are already known.
- Short path length: the number of computational steps between any two positions should be small, ideally constant or close to constant.
The key insight behind Transformers is to replace sequential recurrence with learned content-based communication. Instead of requiring information to move one step at a time through a hidden state chain, each position can directly ask: which other positions contain information useful for me? Attention implements this as a differentiable retrieval mechanism. Positions produce queries, keys, and values; similarity between queries and keys determines where information flows. The route is not hard-coded by distance or adjacency. It is learned from content.
This matters because many sequence dependencies are sparse but not local. A token may need its immediate neighbors for syntax, a faraway noun for agreement, and an even farther definition for semantic interpretation. Architectures based only on local or sequential propagation must repeatedly carry all potentially useful information forward. Attention instead allows the model to create direct edges between relevant positions, making the effective path length between and very short.
The visual below condenses this bottleneck into a single picture: an input sequence , an output sequence , and the central challenge of connecting distant but relevant positions. The faint recurrent chain represents the older strategy: information moves through many local transitions, producing a long path . The highlighted long-range arrow represents the dependency we actually care about.
The same visual also previews the Transformer solution. Rather than relying only on neighboring steps, positions exchange information through learned, content-based links. Those direct communication paths are the conceptual bridge from traditional sequence models to attention: the model still respects sequence structure, but it no longer forces all information to travel through a narrow sequential corridor.



























