Transformers: Attention, Architecture, Training, and Scaling - FeynmanWiki

CONTENTS

Bookmark this paper

Save for later reading

MACHINE LEARNING, TRANSFORMERS, ATTENTION - 84 MIN READ

Transformers: Attention, Architecture, Training, and Scaling

1. The Sequence Modeling Bottleneck

Before we talk about attention, it is worth naming the problem it was designed to solve. A sequence model is not merely a machine that consumes tokens in order; it is a machine that must route information between positions. If token xix_ixi​ matters for predicting something at position jjj, the architecture needs a reliable computational path from iii to jjj. The central question is: how long, fragile, and sequential is that path?
Many important tasks can be written abstractly as sequence-to-sequence mappings,
x1:n↦y1:m,x_{1:n} \mapsto y_{1:m},x1:n​↦y1:m​,
where the input and output lengths may differ. Translation maps a sentence in one language to a sentence in another. Summarization maps a long document to a shorter text. Code generation maps a prompt or partial program to a completed program. Even when the output is not explicitly a separate sequence, language modeling has the same flavor: at each position ttt, the model predicts the next token from the previous context,
pθ(xt∣x<t).p_\theta(x_t \mid x_{<t}).pθ​(xt​∣x<t​).
This notation hides the hard part. The conditioning set x<tx_{<t}x<t​ may be large, but not every previous token is equally relevant. A model predicting the verb in a sentence may need to find the true subject many tokens earlier. A model completing code may need to remember an opening bracket, variable declaration, or function signature hundreds or thousands of tokens back. A translation model may need to align a word near the end of the source sentence with a word near the beginning of the target sentence. In all cases, sequence modeling requires selective communication between positions.
Classical recurrent neural networks handle this by passing information through a hidden state:
ht=fθ(ht−1,xt).h_t = f_\theta(h_{t-1}, x_t).ht​=fθ​(ht−1​,xt​).
This is elegant because it respects temporal order and can in principle summarize everything seen so far. But it creates a narrow communication channel. If information from xix_ixi​ is needed at xjx_jxj​, it must survive repeated transformations through
hi→hi+1→⋯→hj.h_i \rightarrow h_{i+1} \rightarrow \cdots \rightarrow h_j.hi​→hi+1​→⋯→hj​.
The number of computational steps between the two positions grows with their distance, roughly L=∣j−i∣L = |j-i|L=∣j−i∣. Gradients must also travel through this same chain during training. Gating mechanisms such as LSTMs and GRUs reduce the damage, but they do not remove the fundamental bottleneck: distant tokens communicate through a long sequential path.
Convolutional sequence models improve parallelism because all positions in a layer can be processed simultaneously. However, local convolutions have their own routing problem. A kernel of small width only mixes nearby tokens in one layer, so long-range interaction requires stacking many layers. Dilated convolutions shorten the path, but the architecture still imposes a predefined communication pattern. Whether two positions can exchange information efficiently depends on the convolutional design rather than on the content of the sequence itself.
This suggests three desiderata for a strong sequence architecture:
Relevant conditioning: each output should be able to depend on the input or context positions that matter.
Parallel training: computations across positions should not be forced into a strict left-to-right loop when the training targets are already known.
Short path length: the number of computational steps LLL between any two positions should be small, ideally constant or close to constant.
The key insight behind Transformers is to replace sequential recurrence with learned content-based communication. Instead of requiring information to move one step at a time through a hidden state chain, each position can directly ask: which other positions contain information useful for me? Attention implements this as a differentiable retrieval mechanism. Positions produce queries, keys, and values; similarity between queries and keys determines where information flows. The route is not hard-coded by distance or adjacency. It is learned from content.
This matters because many sequence dependencies are sparse but not local. A token may need its immediate neighbors for syntax, a faraway noun for agreement, and an even farther definition for semantic interpretation. Architectures based only on local or sequential propagation must repeatedly carry all potentially useful information forward. Attention instead allows the model to create direct edges between relevant positions, making the effective path length between iii and jjj very short.
The visual below condenses this bottleneck into a single picture: an input sequence x1:nx_{1:n}x1:n​, an output sequence y1:my_{1:m}y1:m​, and the central challenge of connecting distant but relevant positions. The faint recurrent chain represents the older strategy: information moves through many local transitions, producing a long path LLL. The highlighted long-range arrow represents the dependency we actually care about.
The same visual also previews the Transformer solution. Rather than relying only on neighboring steps, positions exchange information through learned, content-based links. Those direct communication paths are the conceptual bridge from traditional sequence models to attention: the model still respects sequence structure, but it no longer forces all information to travel through a narrow sequential corridor.

CONTENTS

Bookmark this paper

Save for later reading

MACHINE LEARNING, TRANSFORMERS, ATTENTION - 84 MIN READ

Transformers: Attention, Architecture, Training, and Scaling

1. The Sequence Modeling Bottleneck

Before we talk about attention, it is worth naming the problem it was designed to solve. A sequence model is not merely a machine that consumes tokens in order; it is a machine that must route information between positions. If token xix_ixi​ matters for predicting something at position jjj, the architecture needs a reliable computational path from iii to jjj. The central question is: how long, fragile, and sequential is that path?
Many important tasks can be written abstractly as sequence-to-sequence mappings,
x1:n↦y1:m,x_{1:n} \mapsto y_{1:m},x1:n​↦y1:m​,
where the input and output lengths may differ. Translation maps a sentence in one language to a sentence in another. Summarization maps a long document to a shorter text. Code generation maps a prompt or partial program to a completed program. Even when the output is not explicitly a separate sequence, language modeling has the same flavor: at each position ttt, the model predicts the next token from the previous context,
pθ(xt∣x<t).p_\theta(x_t \mid x_{<t}).pθ​(xt​∣x<t​).
This notation hides the hard part. The conditioning set x<tx_{<t}x<t​ may be large, but not every previous token is equally relevant. A model predicting the verb in a sentence may need to find the true subject many tokens earlier. A model completing code may need to remember an opening bracket, variable declaration, or function signature hundreds or thousands of tokens back. A translation model may need to align a word near the end of the source sentence with a word near the beginning of the target sentence. In all cases, sequence modeling requires selective communication between positions.
Classical recurrent neural networks handle this by passing information through a hidden state:
ht=fθ(ht−1,xt).h_t = f_\theta(h_{t-1}, x_t).ht​=fθ​(ht−1​,xt​).
This is elegant because it respects temporal order and can in principle summarize everything seen so far. But it creates a narrow communication channel. If information from xix_ixi​ is needed at xjx_jxj​, it must survive repeated transformations through
hi→hi+1→⋯→hj.h_i \rightarrow h_{i+1} \rightarrow \cdots \rightarrow h_j.hi​→hi+1​→⋯→hj​.
The number of computational steps between the two positions grows with their distance, roughly L=∣j−i∣L = |j-i|L=∣j−i∣. Gradients must also travel through this same chain during training. Gating mechanisms such as LSTMs and GRUs reduce the damage, but they do not remove the fundamental bottleneck: distant tokens communicate through a long sequential path.
Convolutional sequence models improve parallelism because all positions in a layer can be processed simultaneously. However, local convolutions have their own routing problem. A kernel of small width only mixes nearby tokens in one layer, so long-range interaction requires stacking many layers. Dilated convolutions shorten the path, but the architecture still imposes a predefined communication pattern. Whether two positions can exchange information efficiently depends on the convolutional design rather than on the content of the sequence itself.
This suggests three desiderata for a strong sequence architecture:
Relevant conditioning: each output should be able to depend on the input or context positions that matter.
Parallel training: computations across positions should not be forced into a strict left-to-right loop when the training targets are already known.
Short path length: the number of computational steps LLL between any two positions should be small, ideally constant or close to constant.
The key insight behind Transformers is to replace sequential recurrence with learned content-based communication. Instead of requiring information to move one step at a time through a hidden state chain, each position can directly ask: which other positions contain information useful for me? Attention implements this as a differentiable retrieval mechanism. Positions produce queries, keys, and values; similarity between queries and keys determines where information flows. The route is not hard-coded by distance or adjacency. It is learned from content.
This matters because many sequence dependencies are sparse but not local. A token may need its immediate neighbors for syntax, a faraway noun for agreement, and an even farther definition for semantic interpretation. Architectures based only on local or sequential propagation must repeatedly carry all potentially useful information forward. Attention instead allows the model to create direct edges between relevant positions, making the effective path length between iii and jjj very short.
The visual below condenses this bottleneck into a single picture: an input sequence x1:nx_{1:n}x1:n​, an output sequence y1:my_{1:m}y1:m​, and the central challenge of connecting distant but relevant positions. The faint recurrent chain represents the older strategy: information moves through many local transitions, producing a long path LLL. The highlighted long-range arrow represents the dependency we actually care about.
The same visual also previews the Transformer solution. Rather than relying only on neighboring steps, positions exchange information through learned, content-based links. Those direct communication paths are the conceptual bridge from traditional sequence models to attention: the model still respects sequence structure, but it no longer forces all information to travel through a narrow sequential corridor.

2. Failure Case: A Long-Range Agreement Trap

The bottleneck is easiest to dismiss when we talk about it abstractly: “long-range dependency” sounds like a rare linguistic edge case. But the problem appears in one of the most ordinary tasks a language model performs: choosing the next word. Even a short sentence prefix can force the model to decide whether to trust nearby evidence or route information from a more distant but grammatically relevant token.
Consider the prefix
x<t=“The keys to the cabinet near the door”.x_{<t}=\text{``The keys to the cabinet near the door''}.x<t​=“The keys to the cabinet near the door”.
The next word should be “are”, not “is”:
“The keys to the cabinet near the door are ...”\text{``The keys to the cabinet near the door are ...''}“The keys to the cabinet near the door are ...”
The grammatical subject is keys, which is plural. But by the time the model reaches the prediction position, the most recent nouns are cabinet and door, both singular. A model that overweights local context may be tempted by the nearby phrase “near the door” and predict a singular verb:
“The keys to the cabinet near the door is ...”\text{``The keys to the cabinet near the door is ...''}“The keys to the cabinet near the door is ...”
This is the long-range agreement trap. The correct prediction depends not on the closest noun, but on the noun that structurally controls the verb. In the prefix,
x2=“keys”x_2=\text{``keys''}x2​=“keys”
is the controller, while
x5=“cabinet”,x8=“door”x_5=\text{``cabinet''}, \qquad x_8=\text{``door''}x5​=“cabinet”,x8​=“door”
are distractors. The model must learn a conditional preference of the form
pθ(xt=“are”∣x<t)>pθ(xt=“is”∣x<t).p_\theta(x_t=\text{``are''}\mid x_{<t})
>
p_\theta(x_t=\text{``is''}\mid x_{<t}).pθ​(xt​=“are”∣x<t​)>pθ​(xt​=“is”∣x<t​).
The important point is that this inequality is not merely about memorizing that “keys” often goes with “are.” It requires the model to identify which earlier token is relevant for this prediction. The phrase contains multiple nouns, and the nearest ones are misleading. Sequential distance and grammatical relevance have come apart.
This exposes a weakness of purely local prediction. If the model primarily summarizes recent tokens, then the final phrase “near the door” dominates the representation near the prediction point. But the word door should not control the verb. It is embedded inside a prepositional phrase modifying cabinet, which itself is embedded inside another prepositional phrase modifying keys. The relevant dependency skips over these intervening tokens.
A good sequence model therefore needs a mechanism for content-based routing. Instead of asking, “Which token is closest to the current position?”, it should ask something more like:
Which previous token is syntactically or semantically relevant?
Which token supplies the feature needed for this prediction?
Which apparent cues are merely distractors because of their local proximity?
For subject–verb agreement, the feature being routed is number: plural versus singular. In other examples, it might be entity identity, coreference, topic, tense, quotation state, or a variable binding. The general pattern is the same: the model must retrieve information by relevance, not by position alone.
This is one of the motivations for attention. Attention will eventually give us a differentiable way to compare the current prediction context against earlier token representations and assign high weight to the tokens whose content matters. The long-range token does not need to be compressed through every intermediate step with equal fidelity; it can be selected directly when it becomes useful.
The visual below condenses this failure case into a single next-token decision. The prefix is laid out as token boxes, with the blank prediction position at the end. The plural noun keys is far away but is the true controller of the verb, while cabinet and door are closer distractors. The central mistake to avoid is treating proximity as a proxy for relevance.
The probability inequality on the right states the learning target: the model should assign higher probability to “are” than to “is” given the whole prefix. That small inequality captures the larger architectural lesson: long-range dependencies require mechanisms that can route information by content relevance, not merely by sequential neighborhood.

3. Why RNNs and CNNs Struggle

The agreement trap from the previous section is not just a quirky linguistic example; it exposes a more general systems problem. A model must move information from some earlier position iii—where the relevant subject, entity, or condition appears—to a later position jjj, where that information is needed to make a prediction. The intervening tokens may be syntactically plausible distractors, but semantically irrelevant. The question is: how many computational steps must the signal traverse before position jjj can use what position iii knew?
For a recurrent model, the route is built into the architecture. Information at xix_ixi​ is first absorbed into a hidden state HiH_iHi​, then passed forward one position at a time:
xi→Hi→Hi+1→⋯→Hj,L=O(∣j−i∣).x_i \rightarrow H_i \rightarrow H_{i+1}\rightarrow \cdots \rightarrow H_j,
\qquad
L=\mathcal{O}(|j-i|).xi​→Hi​→Hi+1​→⋯→Hj​,L=O(∣j−i∣).
This gives RNNs a useful inductive bias: nearby temporal continuity is natural, and the hidden state acts like a running summary. But the same design becomes a bottleneck for long-range dependencies. If iii and jjj are far apart, then the representation of xix_ixi​ must survive many state updates, each of which may overwrite, compress, or distort it. Even if the model has gates, such as in an LSTM or GRU, the route is still sequential. The architecture can learn to preserve information, but it cannot avoid the fact that the information must pass through many intermediate states.
The training signal suffers from the same geometry. When a loss at position jjj assigns credit to something that happened near iii, the relevant derivative contains a long product of Jacobians:
∂Hj∂Hi=∏t=i+1j∂Ht∂Ht−1.\frac{\partial H_j}{\partial H_i}
=
\prod_{t=i+1}^{j}
\frac{\partial H_t}{\partial H_{t-1}}.∂Hi​∂Hj​​=t=i+1∏j​∂Ht−1​∂Ht​​.
This product is the mathematical heart of the vanishing and exploding gradient problem. If the typical singular values of these Jacobians are smaller than one, the gradient decays exponentially with distance; if they are larger than one, it can blow up. Gating, normalization, careful initialization, and gradient clipping can make this more manageable, but they do not remove the long credit-assignment path. The model is still being asked to propagate both information and gradients through O(∣j−i∣)\mathcal{O}(|j-i|)O(∣j−i∣) sequential transformations.
Convolutional sequence models attack the problem differently. Instead of carrying a hidden state from left to right, they update all positions in parallel using local windows. This is excellent for parallel training: every position in a layer can be computed at the same time. But locality introduces a different limitation. A token can only influence positions within the receptive field of the convolution, and that receptive field grows layer by layer. With a kernel of fixed width, distant positions require many layers before they can interact.
So CNNs trade the RNN’s sequential time bottleneck for a depth bottleneck. A sufficiently deep convolutional network can connect distant positions, and dilated convolutions can expand the receptive field faster, but the architecture still imposes a structured route through intermediate neighborhoods. Distant communication is possible only after repeated mixing. In practice, this means either many layers, large kernels, carefully chosen dilation schedules, or some combination of these. The model’s ability to relate xix_ixi​ and xjx_jxj​ is mediated by architectural distance.
This is the motivation for attention as a new primitive. Instead of forcing information to travel through every intermediate hidden state, or to diffuse through local convolutional neighborhoods, we would like position jjj to directly retrieve information from position iii when iii is relevant. In the idealized path-length sense, attention offers:
L=O(1).L = \mathcal{O}(1).L=O(1).
That statement does not mean attention is computationally free. Full self-attention over a sequence of length nnn compares many pairs of positions, which introduces its own cost. Rather, the point is about communication distance: once the attention scores are computed, any position can incorporate information from any other position in a single layer. The burden shifts from “can the information survive the route?” to “can the model learn which positions matter?”
This shift is subtle but crucial. RNNs and CNNs bake in a strong notion of locality: information moves through adjacent time steps or nearby windows. Attention weakens that assumption and replaces it with content-based routing. If a verb at position jjj needs the subject at position iii, the model can learn to assign high weight to that subject directly, even across many distractors. The resulting architecture is often easier to optimize for long-range dependencies because the forward information path and the backward credit-assignment path are both shorter.
The comparison can be summarized as follows:
RNNs have long sequential paths and long gradient products.
CNNs parallelize well, but distant interactions require depth or dilation.
Attention permits direct pairwise communication, making long-range dependency modeling a question of learned retrieval rather than repeated propagation.
The visual below condenses this argument into a side-by-side comparison. The recurrent row emphasizes the chain from xix_ixi​ to HjH_jHj​, while the gradient row highlights why that chain is also an optimization problem. The convolutional row captures receptive-field growth through stacked local windows. The attention row contrasts these with a direct connection from iii to jjj, foregrounding the key design goal: reduce the path length for relevant interactions to O(1)\mathcal{O}(1)O(1).
Read the table not as saying that attention is universally cheaper or always better, but as isolating the architectural reason Transformers became compelling. They replace forced sequential or local communication with a differentiable mechanism for deciding who should talk to whom. That mechanism—attention as learned lookup—is the next object we will derive.

4. Desired Primitive: Differentiable Lookup

The previous discussion identified a common bottleneck behind both recurrence and convolution: routing information. If a token near the end of a sequence needs evidence from a token near the beginning, an RNN must carry that evidence through many recurrent steps, while a CNN must propagate it through many local receptive fields unless the network is made very deep or uses large kernels. The issue is not merely that long paths are inconvenient; long paths make learning fragile. Gradients, intermediate states, and local transformations all become part of the communication channel.
A more direct primitive would let each position ask: which other positions contain information useful for me right now? Instead of forcing information to move step by step through a fixed computational graph, we want every position to be able to retrieve relevant content from anywhere in the sequence in one operation. This is the motivation behind attention.
The simplest abstraction is a content-based lookup. Imagine that each position jjj in a sequence stores some information in a vector vjv_jvj​, called a value vector. Now suppose position iii wants to update its representation. Rather than reading only its neighbor or the previous hidden state, position iii forms a weighted combination of all available values:
retrieved information for position i=∑j=1naijvj.\text{retrieved information for position } i
=
\sum_{j=1}^{n} a_{ij}v_j.retrieved information for position i=j=1∑n​aij​vj​.
The coefficients aija_{ij}aij​ are the attention weights. They say how much position iii retrieves from position jjj. To make this retrieval behave like a soft selection, the weights are constrained to be nonnegative and sum to one:
aij≥0,∑j=1naij=1.a_{ij}\ge 0,
\qquad
\sum_{j=1}^{n} a_{ij}=1.aij​≥0,j=1∑n​aij​=1.
These constraints make the retrieved vector a convex combination of the value vectors. Intuitively, position iii is not copying a single source exactly; it is averaging information from multiple sources, with more relevant positions receiving larger weights. This convex-mixture view is important because it gives attention a stable numerical interpretation: the output remains in the span of available information rather than becoming an unconstrained linear explosion.
A hard lookup would choose one position jjj and return vjv_jvj​. That is useful in classical data structures, but it is awkward for gradient-based learning: a discrete choice is not smoothly differentiable with respect to the scores that produced it. Attention replaces that hard decision with a soft lookup. If the model is uncertain between two relevant tokens, it can assign weight to both. If training later reveals that one source was more useful, gradients can continuously shift probability mass toward it.
This also explains why attention is more than a memory trick. The weights should not be fixed by distance or position alone. They should depend on content compatibility: what the current position needs and what each candidate source offers. For example, in a translation model, a decoder position producing a verb may need to retrieve the subject from far away. In a language model, a pronoun may need to retrieve a compatible antecedent. The useful source is determined by meaning and context, not simply by being nearby.
There are a few subtle assumptions hidden in this primitive. First, the value vectors vjv_jvj​ must contain information worth retrieving; attention can route information, but it cannot recover content that was never represented. Second, the weights aija_{ij}aij​ must be produced by a learnable scoring mechanism that can compare position iii's needs with position jjj's contents. Third, because the retrieved vector is a mixture, attention can sometimes blur information when many incompatible sources receive nontrivial weight. Later, scaled dot-product attention will address how to compute these weights effectively and how to keep the scoring distribution numerically well behaved.
The key conceptual shift is therefore:
RNN/CNN view: information moves through a sequence by fixed local computation paths.
Attention view: each position directly retrieves a weighted mixture from all positions.
Learning problem: choose the weights aija_{ij}aij​ from content, so useful tokens communicate strongly even when distant.
The visual below condenses this idea into a single routing picture. The sequence positions each hold a value vector vjv_jvj​. For a target position iii, arrows from all positions represent possible communication channels, and their thickness represents the learned attention weights aija_{ij}aij​. The output at position iii is not one copied vector, but the soft mixture ∑jaijvj\sum_j a_{ij}v_j∑j​aij​vj​.
This compact diagram is the bridge from motivation to mechanism. We have not yet specified how the weights are computed—that will require embeddings, queries, keys, values, and scaling—but we have specified the primitive we want: differentiable, content-based retrieval over a sequence.

5. Tokens Become Vectors: Embeddings and Positions

Now that we have framed attention as a kind of differentiable lookup, we need to answer a deceptively basic question: what exactly are we looking up with? A Transformer cannot operate directly on raw symbols like "cat", "sat", or ".". Those symbols are discrete vocabulary items; they have no geometry, no dot products, no notion of similarity that a neural network can manipulate smoothly. Before attention can compare one token to another, every token must be represented as a vector in a shared continuous space.
The first ingredient is the token embedding table. If the vocabulary is V\mathcal{V}V and the model width is dmodeld_{\mathrm{model}}dmodel​, we learn a matrix
E∈R∣V∣×dmodel.E\in\mathbb{R}^{|\mathcal{V}|\times d_{\mathrm{model}}}.E∈R∣V∣×dmodel​.
Each row of EEE is the learned vector representation of one vocabulary item. For a token xix_ixi​, the embedding lookup returns
e(xi)∈Rdmodel.e(x_i)\in\mathbb{R}^{d_{\mathrm{model}}}.e(xi​)∈Rdmodel​.
This is often described as a “lookup,” but it is still part of the differentiable model: the selected embedding vector participates in the forward computation, receives gradients during backpropagation, and is updated during training. Over time, the model learns an embedding geometry in which useful distinctions for prediction become linearly accessible to later layers.
However, token identity alone is not enough. The sequence
“dog bites man”\text{“dog bites man”}“dog bites man”
does not mean the same thing as
“man bites dog”.\text{“man bites dog”}.“man bites dog”.
If we gave self-attention only the multiset of token embeddings {e(x1),…,e(xn)}\{e(x_1),\ldots,e(x_n)\}{e(x1​),…,e(xn​)}, then the model would know which tokens appeared, but not where they appeared. This is especially important because vanilla self-attention, unlike recurrence or convolution, does not inherently process tokens left-to-right or through local neighborhoods. Its core operation compares rows of a matrix to other rows of the same matrix. Without additional positional information, that operation is naturally permutation-equivariant: reorder the input rows, and the corresponding outputs reorder in the same way.
So Transformers add a second vector to each token representation: a position vector. For a maximum sequence length nnn, we can represent these position vectors as rows of a matrix
P∈Rn×dmodel,pi∈Rdmodel.P\in\mathbb{R}^{n\times d_{\mathrm{model}}},
\qquad
p_i\in\mathbb{R}^{d_{\mathrm{model}}}.P∈Rn×dmodel​,pi​∈Rdmodel​.
The vector pip_ipi​ tells the model that this row corresponds to position iii. In the original Transformer, these were sinusoidal encodings; in many modern models, they are learned or replaced with relative/rotary variants. But the conceptual role is the same: position information must enter the representation somehow, because attention by itself only sees a set of content vectors.
The simplest absolute-position construction is additive. For each token xix_ixi​, we combine “what token is here” with “where it is”:
xi⟼e(xi)+pi.x_i \quad\longmapsto\quad e(x_i)+p_i.xi​⟼e(xi​)+pi​.
Stacking these row vectors gives the Transformer’s input matrix
X=[e(x1)+p1e(x2)+p2⋮e(xn)+pn]∈Rn×dmodel.X=
\begin{bmatrix}
e(x_1)+p_1\\
e(x_2)+p_2\\
\vdots\\
e(x_n)+p_n
\end{bmatrix}
\in\mathbb{R}^{n\times d_{\mathrm{model}}}.X=​e(x1​)+p1​e(x2​)+p2​⋮e(xn​)+pn​​​∈Rn×dmodel​.
This matrix XXX is the object that will be projected into queries, keys, and values in the next step. Each row is one token position, and each row lives in the same dmodeld_{\mathrm{model}}dmodel​-dimensional space.
The addition e(xi)+pie(x_i)+p_ie(xi​)+pi​ is worth pausing over. We are not concatenating token and position into a larger vector; we are superimposing them in the same model dimension. That means the model must learn to use the shared coordinates to encode both lexical/content information and positional information. This works because the subsequent linear projections can learn directions that respond to content, position, or mixtures of the two. But it also encodes an assumption: the model width dmodeld_{\mathrm{model}}dmodel​ must be large enough to carry all the information the network needs.
A useful way to think about a row of XXX is:
Embedding component: “what symbol or subword is present?”
Position component: “where in the sequence does it occur?”
Combined representation: “what is at this location?”
That last phrase is crucial. Attention will not retrieve information from isolated word types; it will retrieve from contextualizable slots in a sequence. The same token can appear in two positions and begin with the same embedding, but after adding different pip_ipi​'s, the initial vectors are no longer identical. This allows later attention layers to distinguish, for example, the first occurrence of a word from the second.
There is also an important failure mode hiding here. If we omitted pip_ipi​, then a self-attention layer with shared projections would not know whether a token came first, last, or somewhere in the middle. It could still compare content, but it would lack sequence order. For tasks where order matters—and almost all language tasks do—that is a severe limitation. Positional information is therefore not a decorative add-on; it is what turns a bag of token vectors into a sequence representation.
The visual below compactly summarizes this construction as a pipeline: raw tokens are mapped through the embedding table EEE, position rows are supplied from PPP, and corresponding vectors are added row by row to form X∈Rn×dmodelX\in\mathbb{R}^{n\times d_{\mathrm{model}}}X∈Rn×dmodel​. The key idea to look for is that the Transformer input is not “just embeddings,” but embeddings plus positions.
It also foreshadows the next step. Once XXX has been assembled, attention can create queries, keys, and values by learned linear projections. In other words, the differentiable retrieval mechanism we want is built on top of this matrix: each row of XXX is now a content-and-position-aware record that attention can compare, weight, and combine.

6. Attention as Weighted Retrieval

Once tokens have been turned into vectors, we have a useful representation at every sequence position—but each position is still mostly carrying local information: “what token am I?” and “where am I?” The next problem is how one position can use information stored at other positions. If the word “it” appears late in a sentence, its representation may need to borrow meaning from a noun many tokens earlier. If a code variable is used after several lines, its current representation should be able to retrieve the earlier definition.
The key idea of attention is to make this retrieval differentiable. Instead of choosing exactly one previous token or one memory slot, a position forms a weighted average over many stored vectors. For a particular query position iii, suppose every position j∈{1,…,n}j \in \{1,\dots,n\}j∈{1,…,n} stores some vector vjv_jvj​, called a value. The output representation at position iii is
zi=∑j=1naijvj.z_i = \sum_{j=1}^{n} a_{ij} v_j .zi​=j=1∑n​aij​vj​.
This is the central retrieval equation. The output ziz_izi​ is not copied from a single location; it is blended from all available values. The coefficient aija_{ij}aij​ tells us how much position iii retrieves from position jjj.
For this weighted average to behave like retrieval, the weights should be nonnegative and should sum to one:
aij≥0,∑j=1naij=1.a_{ij} \ge 0,
\qquad
\sum_{j=1}^{n} a_{ij}=1.aij​≥0,j=1∑n​aij​=1.
So ziz_izi​ lies in the convex hull of the value vectors. Intuitively, attention says: “construct the new representation by mixing stored content, with mixture proportions determined by relevance.” This is a softer and more trainable version of lookup in a dictionary or memory table.
The weights themselves come from compatibility scores sijs_{ij}sij​, where sijs_{ij}sij​ measures how relevant position jjj is to position iii. We convert these arbitrary real-valued scores into normalized weights using a softmax over the retrievable positions:
aij=exp⁡(sij)∑u=1nexp⁡(siu).a_{ij}
=
\frac{\exp(s_{ij})}{\sum_{u=1}^{n}\exp(s_{iu})}.aij​=∑u=1n​exp(siu​)exp(sij​)​.
The denominator is important: for a fixed retrieving position iii, the model compares all candidate positions jjj against one another. Attention is therefore relative: a token receives high weight not merely because its score is large in isolation, but because its score is large compared with the alternatives in the same row.
This row-wise normalization has several consequences. First, attention weights are easy to interpret as a distribution over source positions. Second, the operation is differentiable end-to-end, so learning can adjust both the scoring mechanism and the stored value vectors. Third, the softmax introduces competition: increasing aija_{ij}aij​ tends to decrease the mass assigned elsewhere. That competition is part of what makes attention behave like selective retrieval rather than an unstructured sum.
There is also a subtle but crucial separation here: scoring and content are conceptually different. The score sijs_{ij}sij​ determines where to look; the value vjv_jvj​ determines what information is retrieved once we look there. This distinction will become central when we introduce queries, keys, and values. For now, it is enough to notice that the vector used to decide relevance need not be identical to the vector whose information is ultimately copied into the output.
Stacking the outputs for all positions gives the matrix form. Let V∈Rn×dvV \in \mathbb{R}^{n \times d_v}V∈Rn×dv​ contain the value vectors row by row, and let A∈Rn×nA \in \mathbb{R}^{n \times n}A∈Rn×n contain the attention weights, with row iii holding the distribution (ai1,…,ain)(a_{i1},\dots,a_{in})(ai1​,…,ain​). Then
Z=AV.Z = A V.Z=AV.
This equation is compact but powerful. Each row of ZZZ is a weighted average of the rows of VVV. The matrix AAA is not just a generic linear map; it is row-stochastic, meaning each row is a probability distribution. Attention is therefore a structured, data-dependent linear combination of stored representations.
The visual below condenses this idea into two complementary views. On the algebraic side, it emphasizes the chain from scores sijs_{ij}sij​, to softmax-normalized weights aija_{ij}aij​, to retrieved output ziz_izi​. On the retrieval side, one position iii sends different-strength connections to several stored values vjv_jvj​, and those weighted contributions merge into the output vector.
The most important takeaway is that attention is not yet “magic Transformer machinery.” At this stage, it is simply weighted differentiable retrieval. The model computes relevance scores, normalizes them into a distribution, and uses that distribution to average content vectors. The next step is to specify how the scores sijs_{ij}sij​ are produced—and that is where queries, keys, and values enter.

7. Queries, Keys, and Values

The weighted-retrieval view gives us a useful abstraction: once we have an attention matrix AAA, producing outputs is just
Z=AV,Z = AV,Z=AV,
a weighted average of “retrievable” vectors. But this leaves an important question unresolved: where do the weights in AAA come from, and what exactly are we averaging? If the same representation of a token is used both to decide whether it is relevant and to supply the content returned, the model is forced to entangle two different roles. Transformers avoid this by separating matching from retrieval.
The key move is to project each input position into three learned spaces:
Q=XWQ,K=XWK,V=XWV.Q=XW_Q,\qquad K=XW_K,\qquad V=XW_V.Q=XWQ​,K=XWK​,V=XWV​.
Here X∈Rn×dmodelX\in\mathbb{R}^{n\times d_{\mathrm{model}}}X∈Rn×dmodel​ is the sequence of contextual token representations: nnn positions, each represented by a dmodeld_{\mathrm{model}}dmodel​-dimensional vector. The learned projection matrices have shapes
WQ,WK∈Rdmodel×dk,WV∈Rdmodel×dv,W_Q,W_K\in\mathbb{R}^{d_{\mathrm{model}}\times d_k},
\qquad
W_V\in\mathbb{R}^{d_{\mathrm{model}}\times d_v},WQ​,WK​∈Rdmodel​×dk​,WV​∈Rdmodel​×dv​,
so the resulting matrices are
Q,K∈Rn×dk,V∈Rn×dv.Q,K\in\mathbb{R}^{n\times d_k},
\qquad
V\in\mathbb{R}^{n\times d_v}.Q,K∈Rn×dk​,V∈Rn×dv​.
Conceptually, each row of these matrices plays a different role. The row qiq_iqi​ of QQQ is the query for position iii: it represents what that position is looking for. The row kjk_jkj​ of KKK is the key for position jjj: it represents how position jjj advertises itself for matching. The row vjv_jvj​ of VVV is the value for position jjj: it is the information returned if some other position decides to attend to jjj.
This distinction is subtle but central. A word may need to be matched according to one set of features, while returning a different set of features once selected. For example, a pronoun might look for a compatible noun phrase using syntactic and semantic cues encoded in queries and keys, but the information retrieved from the noun phrase might include number, gender, entity identity, or broader contextual meaning encoded in the value vector. The model should not have to use the same coordinates for all of these purposes.
In self-attention, all three matrices are projected from the same input XXX. That is why the mechanism is “self”: every position can attend to other positions in the same sequence. But the projections are different, so “same source” does not mean “same representation.” The model learns three views of each token:
Query view: what this position needs.
Key view: how this position can be found.
Value view: what this position contributes if selected.
This separation also explains why attention is more flexible than a fixed similarity computation on raw embeddings. If we compared rows of XXX directly, the notion of relevance would be tied to whatever features happen to be present in the model representation. By learning WQW_QWQ​ and WKW_KWK​, the Transformer learns a task-specific compatibility space. By learning WVW_VWV​, it separately learns what content should flow forward after compatibility has been determined.
There are also important dimensional choices hidden in these equations. The query and key dimensions must match, because they will be compared to produce attention scores; hence both live in Rdk\mathbb{R}^{d_k}Rdk​. The value dimension dvd_vdv​, however, need not equal dkd_kdk​, because values are not used for matching in the same way. They are aggregated after the attention weights have already been computed. In practice, architectures often choose convenient equal dimensions, especially inside multi-head attention, but the mathematical roles remain distinct.
A useful way to think about the full computation is:
queries and keys determine A,values determine what A averages.\text{queries and keys determine } A,
\qquad
\text{values determine what } A \text{ averages.}queries and keys determine A,values determine what A averages.
So QQQ and KKK answer the question, “Which positions should interact, and how strongly?” while VVV answers, “What information should be passed along once that interaction is chosen?” This is the bridge from abstract weighted retrieval to the concrete attention mechanism used in Transformers.
The visual below compactly summarizes this decomposition. A single input matrix XXX branches into three learned projections: QQQ, KKK, and VVV. The parallel arrows emphasize that these are not three different input sequences, but three learned views of the same sequence. Highlighting individual rows qiq_iqi​, kjk_jkj​, and vjv_jvj​ reinforces the position-wise interpretation: one position asks, another position matches, and the matched position returns content.
The faded continuation toward Z=AVZ=AVZ=AV is also important. It reminds us that values are still the objects being averaged, as in weighted retrieval, but the weights AAA will now be produced by comparing queries and keys. This separation is the conceptual step that makes scaled dot-product attention possible: first learn what to match, then learn what to return.

8. Dot-Product Compatibility

With queries, keys, and values in place, the next question is: how should a query decide which keys are relevant? We have already separated “what I am looking for” (qiq_iqi​) from “what each position offers as an address” (kjk_jkj​) and “what content I can retrieve” (vjv_jvj​). Attention now needs a compatibility function that turns each query–key pair into a scalar score.
The simplest and most important choice is the dot product:
sij=qi⊤kj.s_{ij}=q_i^\top k_j.sij​=qi⊤​kj​.
This score is large when qiq_iqi​ and kjk_jkj​ point in compatible directions. Geometrically, the dot product rewards alignment: if two vectors have similar directions, their inner product is positive and large; if they are orthogonal, it is near zero; if they point in opposing directions, it can be negative. In attention, this means position iii assigns a high raw score to position jjj when the learned query at iii matches the learned key at jjj.
There is a subtle but important assumption here: the learned projections that produce qiq_iqi​ and kjk_jkj​ are free to shape the space in which “matching” happens. We are not comparing raw token embeddings directly. Instead, the model learns a coordinate system where certain directions correspond to useful retrieval patterns: syntactic agreement, coreference, local continuation, delimiter matching, or any other relation that helps the task. The dot product is simple, but the learned projections make it expressive.
For a sequence of length nnn, every query position compares itself to every key position. If we stack the query vectors into a matrix
Q∈Rn×dkQ \in \mathbb{R}^{n \times d_k}Q∈Rn×dk​
and the key vectors into
K∈Rn×dk,K \in \mathbb{R}^{n \times d_k},K∈Rn×dk​,
then all n2n^2n2 pairwise dot products appear at once in the matrix product
[ sij ]ij=QK⊤,(QK⊤)ij=qi⊤kj.[\,s_{ij}\,]_{ij}=QK^\top,
\qquad
(QK^\top)_{ij}=q_i^\top k_j.[sij​]ij​=QK⊤,(QK⊤)ij​=qi⊤​kj​.
This is one of the central computational facts behind Transformers: attention is vectorized all-pairs comparison. Instead of iterating through the sequence recurrently, each position can compare against every other position in a single batched matrix multiplication. That is why attention is so well matched to modern accelerators: the communication pattern is dense, but it is expressed as large linear algebra operations.
The raw score matrix QK⊤QK^\topQK⊤ is not yet a retrieval distribution. Each row contains the scores for one query position iii against all possible source positions jjj. To turn these scores into weights, we apply a row-wise softmax:
A=softmax⁡(QK⊤).A=\operatorname{softmax}(QK^\top).A=softmax(QK⊤).
Here AijA_{ij}Aij​ can be read as “how much position iii attends to position jjj.” The rows of AAA sum to one, so each query forms a weighted average over values. The retrieved output vectors are then
Z=AV=softmax⁡(QK⊤)V.Z=AV
=
\operatorname{softmax}(QK^\top)V.Z=AV=softmax(QK⊤)V.
This is the full dot-product attention pattern before scaling and masking: compare queries to keys, normalize scores into attention weights, then use those weights to average values.
It is worth emphasizing what this buys us. A recurrent model must pass information through a chain of hidden states, so distant positions interact through many sequential steps. A convolutional model needs either large kernels or many layers to connect distant tokens. Dot-product attention, by contrast, gives every position a direct path to every other position in one layer. The cost is that the score matrix has n2n^2n2 entries, so dense self-attention is powerful but expensive for long sequences.
There are also important failure modes hidden in this compact formula:
If dot products become too large in magnitude, the softmax can become extremely peaked, causing gradients through most positions to shrink.
If all dot products are similar, the attention weights become nearly uniform, and retrieval loses selectivity.
If masking is required but omitted, a decoder can “look into the future,” violating causality.
If sequence length grows very large, the QK⊤QK^\topQK⊤ matrix becomes the main memory and compute bottleneck.
The next section will address the first of these issues directly: why the raw dot product is divided by dk\sqrt{d_k}dk​​. For now, the key idea is that dot-product attention is not mysterious. It is a learned content-addressable lookup system implemented as matrix multiplication.
The visual below consolidates this flow. On the left, the equations isolate the three conceptual steps: define a pairwise compatibility score, vectorize all scores as QK⊤QK^\topQK⊤, and retrieve values using softmax-normalized weights. On the right, the matrix pipeline makes the same idea operational: rows of QQQ meet rows of KKK through K⊤K^\topK⊤, producing a score grid whose (i,j)(i,j)(i,j) cell is exactly qi⊤kjq_i^\top k_jqi⊤​kj​.
The important thing to notice is the dense all-pairs communication pattern. Every query row has access to every key column before the softmax chooses how much value information to retrieve. That single pattern,
Z=softmax⁡(QK⊤)V,Z=\operatorname{softmax}(QK^\top)V,Z=softmax(QK⊤)V,
is the algebraic heart of attention.

9. Why Scale by \sqrt{d_k}?

Having introduced dot products as a natural compatibility score between a query and a key, there is one more detail that looks like a harmless implementation trick but is actually crucial for stable learning: Transformer attention does not use qi⊤kjq_i^\top k_jqi⊤​kj​ directly. It uses a scaled dot product.
The raw score between query qiq_iqi​ and key kjk_jkj​ is
sij=qi⊤kj=∑ℓ=1dkqiℓkjℓ.s_{ij}=q_i^\top k_j
=\sum_{\ell=1}^{d_k} q_{i\ell}k_{j\ell}.sij​=qi⊤​kj​=ℓ=1∑dk​​qiℓ​kjℓ​.
At first glance, this seems perfectly reasonable. If the query and key point in similar directions, the dot product is large; if they are unrelated or opposed, it is small or negative. But the magnitude of this score depends not only on semantic alignment. It also depends on the key/query dimension dkd_kdk​. As dkd_kdk​ grows, the dot product accumulates more random terms, so even unrelated vectors can produce scores with increasingly large variance.
A simple initialization-style calculation makes the issue visible. Suppose the components of qiq_iqi​ and kjk_jkj​ are independent, centered, and normalized:
E[qiℓ]=E[kjℓ]=0,Var⁡(qiℓ)=Var⁡(kjℓ)=1.\mathbb{E}[q_{i\ell}]=\mathbb{E}[k_{j\ell}]=0,
\qquad
\operatorname{Var}(q_{i\ell})=\operatorname{Var}(k_{j\ell})=1.E[qiℓ​]=E[kjℓ​]=0,Var(qiℓ​)=Var(kjℓ​)=1.
These assumptions are not meant to describe every trained Transformer exactly. They are a controlled approximation: before learning has shaped the representations too much, and under common normalization/initialization schemes, it is reasonable to ask what scale the scores would have if the components behaved like independent unit-variance random variables. The goal is to prevent the architecture itself from injecting an undesirable scale factor.
For one coordinate product, independence gives
E[qiℓkjℓ]=0,\mathbb{E}[q_{i\ell}k_{j\ell}]=0,E[qiℓ​kjℓ​]=0,
and
Var⁡(qiℓkjℓ)=E[qiℓ2kjℓ2]−E[qiℓkjℓ]2=1.\operatorname{Var}(q_{i\ell}k_{j\ell})
=
\mathbb{E}[q_{i\ell}^2k_{j\ell}^2]
-
\mathbb{E}[q_{i\ell}k_{j\ell}]^2
=1.Var(qiℓ​kjℓ​)=E[qiℓ2​kjℓ2​]−E[qiℓ​kjℓ​]2=1.
So each coordinate contributes a unit-variance random term to the dot product. Since the full score sums dkd_kdk​ such terms, the variance grows linearly:
Var⁡(sij)=Var⁡ ⁣(∑ℓ=1dkqiℓkjℓ)=∑ℓ=1dkVar⁡(qiℓkjℓ)=dk.\operatorname{Var}(s_{ij})
=
\operatorname{Var}\!\left(\sum_{\ell=1}^{d_k}q_{i\ell}k_{j\ell}\right)
=
\sum_{\ell=1}^{d_k}\operatorname{Var}(q_{i\ell}k_{j\ell})
=
d_k.Var(sij​)=Var(ℓ=1∑dk​​qiℓ​kjℓ​)=ℓ=1∑dk​​Var(qiℓ​kjℓ​)=dk​.
Equivalently, the standard deviation of the raw dot product grows like dk\sqrt{d_k}dk​​. This is the key point: increasing the representation dimension makes the scores larger in magnitude even when there is no stronger evidence of relevance.
That matters because attention does not use the scores directly; it passes them through a softmax. The softmax is sensitive to scale. If its inputs are small or moderate, it can assign a graded distribution over many keys. But if one logit is much larger than the others, the output becomes nearly one-hot:
softmax⁡(s)j=esj∑mesm.\operatorname{softmax}(s)_j
=
\frac{e^{s_j}}{\sum_m e^{s_m}}.softmax(s)j​=∑m​esm​esj​​.
Large score variance therefore pushes attention into a saturated regime. One key receives almost all the probability mass, the others receive almost none, and the gradients through the softmax become less informative. The model may still train, but optimization becomes unnecessarily brittle: early random score differences can dominate attention before the network has learned meaningful retrieval patterns.
The fix is to normalize the score by its typical standard deviation. Since
Std⁡(qi⊤kj)≈dk,\operatorname{Std}(q_i^\top k_j)\approx \sqrt{d_k},Std(qi⊤​kj​)≈dk​​,
we compute attention using
Attention⁡(Q,K,V)=softmax⁡ ⁣(QK⊤dk)V.\operatorname{Attention}(Q,K,V)
=
\operatorname{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V.Attention(Q,K,V)=softmax(dk​​QK⊤​)V.
This does not change the basic content-based retrieval story. Queries still compare themselves to keys, and values are still averaged according to the resulting attention weights. The scaling only keeps the logits in a numerically and statistically reasonable range as the key/query dimension changes.
A useful way to read the result is:
Unscaled dot product: variance grows with dkd_kdk​, so logits become large.
Large logits: softmax becomes sharp or saturated.
Scaled dot product: dividing by dk\sqrt{d_k}dk​​ restores roughly unit variance.
Stable softmax: attention weights remain trainable and less dominated by accidental magnitude.
The visual below condenses this argument into two complementary views. On one side, the variance derivation tracks how the score sijs_{ij}sij​ accumulates dkd_kdk​ independent unit-variance products, ending in the highlighted conclusion Var⁡(sij)=dk\operatorname{Var}(s_{ij})=d_kVar(sij​)=dk​. On the other side, the same phenomenon is shown operationally: unscaled scores become tall and uneven, producing a sharply peaked attention row, while scaled scores produce a smoother, more stable distribution.
The important takeaway is not that attention should always be diffuse. A trained Transformer can and often should place highly concentrated attention when the data calls for it. The point is that this concentration should be learned, not forced by the dimensionality of the dot product. Scaling by dk\sqrt{d_k}dk​​ makes dot-product attention behave consistently across dimensions, giving the softmax a well-conditioned set of logits to work with.

10. Masks: Turning Attention Links On and Off

After controlling the scale of the dot products, the next issue is not how strongly one token should attend to another, but whether that link should exist at all. Attention, by default, is completely content-driven: every query compares itself with every key, and the softmax turns those comparisons into a probability distribution over all value vectors. That is powerful, but it is also too permissive. Some attention links are structurally invalid regardless of content.
The mechanism for enforcing these hard constraints is an attention mask. Starting from the scaled attention logits,
QK⊤dk,\frac{QK^\top}{\sqrt{d_k}},dk​​QK⊤​,
we add a mask matrix MMM before applying the row-wise softmax:
A=softmax⁡(QK⊤dk+M),Z=AV.A=\operatorname{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}+M\right),
\qquad
Z=AV.A=softmax(dk​​QK⊤​+M),Z=AV.
Here MMM has the same row-column structure as the attention score matrix: rows correspond to query positions, columns correspond to key/value positions. The key idea is that masking happens at the logit level, before normalization. This matters because the softmax is sensitive to additive changes in logits: setting a logit to −∞-\infty−∞ makes its exponent exactly zero.
Concretely, for query position iii and key/value position jjj,
Mij=0M_{ij}=0Mij​=0
means the link is allowed. The original scaled dot-product score is unchanged, so the model may assign attention mass to vjv_jvj​ if the content match is strong. In contrast,
Mij=−∞M_{ij}=-\inftyMij​=−∞
means the link is forbidden. Since
exp⁡(−∞)=0,\exp(-\infty)=0,exp(−∞)=0,
the corresponding softmax weight becomes
aij=0.a_{ij}=0.aij​=0.
Therefore vjv_jvj​ contributes nothing to the output vector at query position iii, no matter how compatible qiq_iqi​ and kjk_jkj​ might have been.
This is a hard structural constraint, not a learned preference. The model is not being encouraged to avoid certain links; it is made mathematically impossible for those links to carry information. In actual implementations, −∞-\infty−∞ is often represented by a very large negative number for numerical reasons, but the intended operation is the same: after softmax, the masked probability is zero or effectively zero.
The most important example is causal self-attention, used in autoregressive language modeling. When predicting token ttt, the model may use tokens at positions u≤tu\le tu≤t, but it must not look ahead to future positions u>tu>tu>t. Otherwise, training would leak the answer: the representation for an earlier token could directly depend on later tokens that should not yet be known during generation.
The causal mask is therefore
(Mcausal)tu={0,u≤t,−∞,u>t.(M_{\mathrm{causal}})_{tu}
=
\begin{cases}
0, & u\le t,\\
-\infty, & u>t.
\end{cases}(Mcausal​)tu​={0,−∞,​u≤t,u>t.​
This gives a lower-triangular pattern, including the diagonal. Position 111 can attend only to itself, position 222 can attend to positions 111 and 222, and so on. The diagonal is usually allowed because the representation at a position may use the token currently being processed when computing hidden states for next-token prediction; what is forbidden is access to future positions.
The same additive-mask abstraction covers several different situations:
Padding masks prevent real tokens from attending to padding tokens inserted for batching.
Decoder self-attention masks enforce left-to-right causality.
Autoregressive generation masks ensure that generation-time computation matches the information pattern assumed during training.
Custom structural masks can restrict attention to windows, blocks, segments, or other task-specific connectivity patterns.
A subtle but important point is that masks operate independently of the learned parameters. The matrices QQQ, KKK, and VVV are still produced by learned projections, and the model still decides among allowed links using content similarity. The mask simply defines the set of possible routes through which information may flow. In graph terms, attention learns edge weights, while the mask determines which edges are present.
The visual below condenses this into two complementary views. On the left, the mask appears exactly where it belongs mathematically: added to the scaled dot-product logits before the softmax. The two cases Mij=0M_{ij}=0Mij​=0 and Mij=−∞M_{ij}=-\inftyMij​=−∞ then become easy to interpret as “keep this candidate link” versus “force its attention weight to zero.”
On the right, the causal mask is represented as a triangular matrix. The allowed region contains zeros, while the forbidden future region contains −∞-\infty−∞. After the row-wise softmax, that forbidden upper triangle disappears from the attention matrix AAA: future value vectors cannot contribute to earlier query positions. This is the small algebraic trick that makes Transformer attention compatible with autoregressive sequence modeling.

11. Theorem: Self-Attention Without Positions Is Permutation Equivariant

After introducing masks, it is tempting to think of attention as a graph over sequence positions: some token positions may attend to others, and masking removes selected edges. But before we add masks—or positional encodings—there is a deeper fact hiding in plain sight: content-only self-attention does not know what a sequence order is. It sees an n×dmodeln \times d_{\mathrm{model}}n×dmodel​ matrix XXX as a collection of row vectors and computes interactions among those rows, but nothing in the vanilla attention formula says that row 333 comes before row 444, or that row 111 is special because it is first.
Let X∈Rn×dmodelX \in \mathbb{R}^{n \times d_{\mathrm{model}}}X∈Rn×dmodel​, where each row is a token representation. A single-head self-attention layer without positional information computes
Q=XWQ,K=XWK,V=XWV,Q = XW_Q,\qquad K = XW_K,\qquad V = XW_V,Q=XWQ​,K=XWK​,V=XWV​,
and then
Attn⁡(X)=softmax⁡ ⁣(QK⊤dk)V.\operatorname{Attn}(X)
=
\operatorname{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V.Attn(X)=softmax(dk​​QK⊤​)V.
The softmax is applied row-wise: each query row produces a probability distribution over all key rows, and then forms a weighted average of value rows. This is content-based retrieval: each token asks, “which other token vectors are relevant to me?” The important subtlety is that relevance is computed only through dot products of learned projections. There is no term involving the integer index iii, no sinusoidal or learned positional vector PiP_iPi​, and no causal or padding mask MMM that distinguishes allowed from disallowed positions.
Now consider a permutation matrix Π∈Pn\Pi \in \mathcal{P}_nΠ∈Pn​. Multiplying XXX on the left by Π\PiΠ simply reorders the rows of XXX. For example, if XXX contains token vectors in one order, then ΠX\Pi XΠX contains the same token vectors in a different order. The theorem says that self-attention commutes with this reordering:
Attn⁡(ΠX)=ΠAttn⁡(X).\operatorname{Attn}(\Pi X)
=
\Pi \operatorname{Attn}(X).Attn(ΠX)=ΠAttn(X).
That property is called permutation equivariance. It is not invariance: the output does change when the input is permuted. But it changes in exactly the same way—the output rows are permuted by the same Π\PiΠ. In other words, content-only self-attention treats the input as an unordered set of token vectors, while still returning one output vector per input token.
The algebra is short but instructive. If we permute the input rows, the projected queries, keys, and values become
Q′=ΠXWQ=ΠQ,K′=ΠXWK=ΠK,V′=ΠXWV=ΠV.Q' = \Pi XW_Q = \Pi Q,\qquad
K' = \Pi XW_K = \Pi K,\qquad
V' = \Pi XW_V = \Pi V.Q′=ΠXWQ​=ΠQ,K′=ΠXWK​=ΠK,V′=ΠXWV​=ΠV.
The new attention score matrix is
Q′K′⊤=(ΠQ)(ΠK)⊤=ΠQK⊤Π⊤.Q'{K'}^\top
=
(\Pi Q)(\Pi K)^\top
=
\Pi QK^\top \Pi^\top.Q′K′⊤=(ΠQ)(ΠK)⊤=ΠQK⊤Π⊤.
So the score matrix is not arbitrary; it is the original score matrix with both rows and columns permuted. Rows are permuted because the queries have been reordered, and columns are permuted because the keys have been reordered. The scaling by dk\sqrt{d_k}dk​​ does not affect this symmetry.
The only slightly delicate step is the row-wise softmax. For any score matrix SSS,
softmax⁡(ΠSΠ⊤)=Πsoftmax⁡(S)Π⊤.\operatorname{softmax}(\Pi S \Pi^\top)
=
\Pi \operatorname{softmax}(S)\Pi^\top.softmax(ΠSΠ⊤)=Πsoftmax(S)Π⊤.
This holds because permuting a row before applying softmax simply permutes the resulting probabilities in the same way, and permuting the collection of rows also just reorders the row-wise outputs. Applying this to
S=QK⊤dk,S = \frac{QK^\top}{\sqrt{d_k}},S=dk​​QK⊤​,
we get
softmax⁡ ⁣(Q′K′⊤dk)=Πsoftmax⁡ ⁣(QK⊤dk)Π⊤.\operatorname{softmax}\!\left(\frac{Q'{K'}^\top}{\sqrt{d_k}}\right)
=
\Pi
\operatorname{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)
\Pi^\top.softmax(dk​​Q′K′⊤​)=Πsoftmax(dk​​QK⊤​)Π⊤.
Multiplying by the permuted values V′=ΠVV' = \Pi VV′=ΠV cancels the inner permutation:
softmax⁡ ⁣(Q′K′⊤dk)V′=Πsoftmax⁡ ⁣(QK⊤dk)Π⊤ΠV=Πsoftmax⁡ ⁣(QK⊤dk)V.\operatorname{softmax}\!\left(\frac{Q'{K'}^\top}{\sqrt{d_k}}\right)V'
=
\Pi
\operatorname{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)
\Pi^\top \Pi V
=
\Pi
\operatorname{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V.softmax(dk​​Q′K′⊤​)V′=Πsoftmax(dk​​QK⊤​)Π⊤ΠV=Πsoftmax(dk​​QK⊤​)V.
Substituting back Q=XWQQ=XW_QQ=XWQ​, K=XWKK=XW_KK=XWK​, and V=XWVV=XW_VV=XWV​, the theorem is exactly
softmax⁡ ⁣((ΠXWQ)(ΠXWK)⊤dk)(ΠXWV)=Πsoftmax⁡ ⁣((XWQ)(XWK)⊤dk)(XWV).\operatorname{softmax}\!\left(\frac{(\Pi X W_Q)(\Pi X W_K)^\top}{\sqrt{d_k}}\right)(\Pi X W_V)
=
\Pi\operatorname{softmax}\!\left(\frac{(XW_Q)(XW_K)^\top}{\sqrt{d_k}}\right)(XW_V).softmax(dk​​(ΠXWQ​)(ΠXWK​)⊤​)(ΠXWV​)=Πsoftmax(dk​​(XWQ​)(XWK​)⊤​)(XWV​).
This result matters because language is not merely a bag of token embeddings. The sentences “dog bites man” and “man bites dog” contain the same token set but mean different things. A Transformer without positional information cannot distinguish these two sequences by order alone. If the token embeddings are identical and only their rows are rearranged, the layer can only rearrange its outputs correspondingly. It has no intrinsic mechanism for representing “before,” “after,” “nearby,” or “first.”
There are also useful boundary cases to keep in mind. The equivariance statement assumes no positional encoding PPP, no mask MMM, and no position-dependent biases. Adding absolute positional embeddings breaks the symmetry because row iii receives information tied to index iii. Adding a causal mask also breaks full permutation equivariance because the mask privileges the left-to-right order. By contrast, operations applied identically to every row—such as a shared feed-forward network, residual addition, or layer normalization over features—typically preserve permutation equivariance. The symmetry is broken only when the model is given some information that distinguishes positions.
The visual below compactly summarizes the theorem as a commuting diagram: one path permutes the input rows first and then applies self-attention, while the other applies self-attention first and then permutes the output rows. The equality says these paths arrive at the same result. That is the operational meaning of permutation equivariance.
The equation in the theorem box is the algebraic version of the same story. The orange Π\PiΠ terms track row permutations, the blue softmax block tracks how attention weights reorder consistently, and the green value block reminds us that the final weighted sums are attached to the same permuted rows. The important takeaway is simple but profound: without positions or masks, self-attention is a powerful content-retrieval mechanism, but not yet a sequence model in the ordered sense.

12.

The equivariance result is useful precisely because it tells us what self-attention cannot know by itself. If the inputs are just token embeddings, self-attention treats the sequence as a set of content vectors with indices attached only externally. It can route information based on what appears, but not intrinsically on where it appears. That is elegant from a symmetry perspective, but disastrous for language, programs, music, genomes, and essentially every sequence domain where order changes meaning.
For example, the two strings “the dog bit the man” and “the man bit the dog” contain almost the same multiset of token identities. A content-only self-attention layer can produce correspondingly permuted representations, but it has no built-in reason to distinguish subject position from object position. The problem is not that attention is weak; the problem is that attention is too symmetric. We must deliberately break permutation symmetry by injecting positional information.
The standard Transformer does this by replacing each token embedding eie_iei​ with a position-aware input representation such as
xi=ei+pi,x_i = e_i + p_i,xi​=ei​+pi​,
where pip_ipi​ is a vector associated with position iii. Now the attention mechanism no longer receives just “the embedding for this word”; it receives “the embedding for this word at this location.” The dot products used to compute attention can depend on token identity, position, and interactions between the two. In other words, attention remains content-based retrieval, but the content being retrieved has been enriched with location.
There are several ways to choose the positional signal. The original Transformer used sinusoidal positional encodings, where each coordinate varies periodically with position at a different frequency. Learned absolute embeddings are also common: the model simply learns a table p1,p2,…,pnp_1, p_2, \dots, p_np1​,p2​,…,pn​. More recent architectures often use relative position biases or rotary positional embeddings, which modify the attention scores or query/key geometry so that the model reasons more directly about offsets like “three tokens ago” rather than absolute coordinates like “position 57.”
These choices differ in what generalization they encourage:
Absolute learned positions are simple and expressive, but may extrapolate poorly beyond the training context length.
Sinusoidal positions provide a fixed geometric structure and make some extrapolation possible, though not automatically robust.
Relative position methods often fit language modeling well because many linguistic dependencies are local or distance-sensitive.
Rotary embeddings encode relative displacement through rotations in query-key space, preserving the dot-product attention framework while adding position-dependent phase structure.
A subtle point is that masking is not a full substitute for positional encoding. A causal mask does introduce an ordering constraint: token iii cannot attend to future tokens j>ij > ij>i. But without positional information, the model still has limited ability to distinguish different permutations of the visible prefix. The mask says “you may look backward,” but it does not fully tell the model which earlier token was first, second, or adjacent in a content-independent way. For sequence modeling, causality and position solve different problems: causality prevents information leakage, while positional encoding gives the model a coordinate system.
This also explains why positional information matters even in encoder-only models, where there is no causal mask. In a bidirectional encoder, every token can attend to every other token. Without positions, the representation of a sentence is equivariant to arbitrary reordering. A downstream classifier might collapse those token representations into something nearly permutation-invariant, making word order even harder to recover. Positional encodings give the encoder a way to represent syntactic roles, local neighborhoods, phrase boundaries, and long-range dependencies as structured relations rather than unordered co-occurrences.
The visual summary for this idea should be read as a symmetry-breaking story: content-only self-attention preserves permutation structure, while adding positional signals turns a bag-like collection of embeddings into an ordered sequence. Once token and position information are combined, attention can still retrieve by similarity, but similarity is now computed in a space where “same word in a different place” can mean something different.
This sets up the next architectural refinement. After we give the model a notion of position, a single attention operation still represents only one retrieval pattern at a time. In practice, different relationships matter simultaneously: nearby syntax, long-range agreement, delimiter matching, coreference, copying, and positional offsets. Multi-head attention will let the model learn several such retrieval subspaces in parallel.

13. Multi-Head Attention: Multiple Retrieval Subspaces

Before assembling a full Transformer block, it is worth pausing on the attention mechanism itself. A single scaled dot-product attention layer already gives us a powerful content-addressable retrieval operation: each token forms a query, compares it against all keys, and uses the resulting weights to average the corresponding values. But one attention distribution is still only one way of asking, “What information should this token retrieve from the sequence?”
The central idea of multi-head attention is that a token may need to retrieve several different kinds of evidence at once. In language, for example, a word might need nearby syntactic context, a long-range subject for agreement, a previous mention for coreference, and a delimiter or boundary token for structure. These are not necessarily well represented by a single similarity function over one query-key space. Multi-head attention addresses this by running several attention mechanisms in parallel, each with its own learned projections.
Given an input sequence representation XXX, each head r∈{1,…,h}r \in \{1,\ldots,h\}r∈{1,…,h} learns separate projection matrices
WQ(r),WK(r),WV(r).W_Q^{(r)}, \qquad W_K^{(r)}, \qquad W_V^{(r)}.WQ(r)​,WK(r)​,WV(r)​.
These produce head-specific queries, keys, and values:
Q(r)=XWQ(r),K(r)=XWK(r),V(r)=XWV(r).Q^{(r)} = XW_Q^{(r)}, \qquad
K^{(r)} = XW_K^{(r)}, \qquad
V^{(r)} = XW_V^{(r)}.Q(r)=XWQ(r)​,K(r)=XWK(r)​,V(r)=XWV(r)​.
The head then performs the same scaled dot-product attention operation we have already developed:
head⁡r=softmax⁡ ⁣((XWQ(r))(XWK(r))⊤dk+M)(XWV(r)).\operatorname{head}_r
=
\operatorname{softmax}\!\left(
\frac{(XW_Q^{(r)})(XW_K^{(r)})^\top}{\sqrt{d_k}} + M
\right)
(XW_V^{(r)}).headr​=softmax(dk​​(XWQ(r)​)(XWK(r)​)⊤​+M)(XWV(r)​).
The mask MMM plays the same role as before: it can forbid attention to certain positions, such as future tokens in causal decoding or padding tokens in batched training. The scaling by dk\sqrt{d_k}dk​​ also remains essential, because each head computes dot products in its own key/query dimension dkd_kdk​. Without scaling, the logits can grow too large in magnitude as dkd_kdk​ increases, causing the softmax to become overly peaked and gradients to become less useful.
The important change is not the formula inside one head; it is the fact that each head has its own learned retrieval subspace. One head might learn projections where query-key similarity emphasizes syntactic adjacency. Another might emphasize semantic similarity. Another might specialize in positional or delimiter-like patterns. This specialization is not manually assigned; it emerges because the model can reduce training loss by distributing different retrieval behaviors across heads.
After computing all heads in parallel, their outputs are concatenated:
Concat⁡(head⁡1,…,head⁡h).\operatorname{Concat}(\operatorname{head}_1,\ldots,\operatorname{head}_h).Concat(head1​,…,headh​).
This concatenated representation contains multiple retrieved views of the sequence at each token position. A final learned output projection WOW_OWO​ then mixes these views back into the model dimension:
MHA⁡(X)=Concat⁡(head⁡1,…,head⁡h)WO.\operatorname{MHA}(X)
=
\operatorname{Concat}(\operatorname{head}_1,\ldots,\operatorname{head}_h)W_O.MHA(X)=Concat(head1​,…,headh​)WO​.
This final projection is easy to underestimate. Concatenation alone would merely place the heads side by side. The output matrix WOW_OWO​ lets the model form learned combinations across heads: it can amplify, suppress, or blend information retrieved by different attention patterns. In other words, multi-head attention is not just “several attentions in parallel”; it is parallel retrieval followed by a learned recombination step.
A common implementation choice is to keep the total compute roughly comparable to a single large attention layer by splitting the model dimension across heads. For example, if the model width is dmodeld_{\text{model}}dmodel​ and there are hhh heads, one often uses
dk=dv=dmodelh.d_k = d_v = \frac{d_{\text{model}}}{h}.dk​=dv​=hdmodel​​.
Then each head is narrower, but there are more of them. This gives the model multiple attention patterns without multiplying the representation size before the output projection. The trade-off is that each individual head has lower dimensional capacity, while the collection of heads has greater diversity in possible retrieval behavior.
There are also subtle failure modes. Heads are not guaranteed to become neatly interpretable modules such as “syntax head” or “coreference head.” Some heads may be redundant, diffuse, or useful only in combination with others. In practice, attention patterns are informative but not always faithful explanations of model behavior. Still, the architectural bias matters: by giving the model several independent query-key-value projections, multi-head attention makes it easier to represent multiple relational structures simultaneously.
The visual below compactly organizes this computation from left to right. The same input XXX fans out into several parallel lanes, one per head. Each lane applies its own QQQ, KKK, and VVV projections, performs masked scaled dot-product attention, and emits a head-specific retrieved representation. The different colors emphasize that these heads are not copies of one another; they are separate learned retrieval mechanisms operating in distinct subspaces.
On the right side, the heads are gathered by concatenation and passed through WOW_OWO​, which mixes them back into a single output representation MHA⁡(X)\operatorname{MHA}(X)MHA(X). This is the key structural pattern to remember before moving to the rest of the Transformer block: parallel attention heads create multiple retrieved views, and the output projection integrates those views into the next token representation.

14. Position-Wise Feed-Forward Networks

After multi-head attention has gathered information from different positions and different representation subspaces, the Transformer still needs a way to compute new features from the resulting token vectors. Attention is excellent at routing and mixing information across the sequence, but the weighted sums it produces are still largely linear combinations of value vectors. To make each token representation more expressive, every Transformer block follows attention with a position-wise feed-forward network.
The phrase position-wise is important. Suppose the block input is a matrix
X∈Rn×dmodel,X \in \mathbb{R}^{n \times d_{\mathrm{model}}},X∈Rn×dmodel​,
where row XiX_iXi​ is the representation of token position iii, already combining token identity and positional information. After multi-head attention, each row has had the opportunity to receive information from other rows. The feed-forward layer then applies the same nonlinear map to each row independently:
FFN⁡(X)i=ϕ(XiW1+b1)W2+b2,i=1,…,n.\operatorname{FFN}(X)_i
=
\phi(X_i W_1 + b_1) W_2 + b_2,
\quad i=1,\ldots,n.FFN(X)i​=ϕ(Xi​W1​+b1​)W2​+b2​,i=1,…,n.
Here W1,b1,W2,b2W_1,b_1,W_2,b_2W1​,b1​,W2​,b2​ are shared across all positions, and ϕ\phiϕ is a pointwise nonlinearity such as ReLU or GELU. In modern Transformers, GELU-like activations are common, but the architectural idea is not tied to one particular choice.
A useful mental model is that attention answers the question: Which other positions should this token read from? The feed-forward network answers a different question: Given the information now stored in this token vector, how should we transform its features? These are complementary operations:
Attention: mixes information between rows of XXX.
FFN: transforms information within each row of XXX.
Parameter sharing: the same transformation is reused at every sequence position.
This independence means there is no communication between positions inside the FFN itself. If position iii affects position jjj, that influence must have already been routed through attention or must happen in a later attention layer. The FFN is therefore not a replacement for attention; it is the nonlinear feature processor that acts after attention has assembled a useful local representation at each position.
The standard Transformer FFN has a characteristic expand-and-compress shape:
dmodel→dff→dmodel,dff>dmodel.d_{\mathrm{model}}
\rightarrow
d_{\mathrm{ff}}
\rightarrow
d_{\mathrm{model}},
\quad
d_{\mathrm{ff}} > d_{\mathrm{model}}.dmodel​→dff​→dmodel​,dff​>dmodel​.
The first linear map projects each token vector into a wider hidden space of dimension dffd_{\mathrm{ff}}dff​. The activation ϕ\phiϕ introduces nonlinearity, allowing the model to form feature interactions that cannot be represented by attention’s weighted averaging alone. The second linear map compresses the representation back to dmodeld_{\mathrm{model}}dmodel​, so the output can be passed cleanly to the next sublayer in the block.
This expansion matters. If the FFN were only a single linear map from dmodeld_{\mathrm{model}}dmodel​ to dmodeld_{\mathrm{model}}dmodel​, then it would add limited expressive power, especially when surrounded by other linear projections. The intermediate width dffd_{\mathrm{ff}}dff​ gives the model a larger workspace for computing token-wise features: detecting patterns, gating dimensions, composing semantic attributes, or re-encoding information gathered from attention. In many Transformer configurations, dffd_{\mathrm{ff}}dff​ is several times larger than dmodeld_{\mathrm{model}}dmodel​, making the FFN a major contributor to both parameter count and computation.
There is a subtle but important symmetry here. Because the same FFN parameters are applied to every row, the operation is shared over sequence length. The model does not learn one feed-forward map for the first token, another for the second token, and so on. Instead, positional differences are represented in the input vectors themselves, while the transformation rule remains the same everywhere. This sharing is one reason Transformers can process variable-length sequences: the FFN does not depend on a fixed sequence length nnn.
Equivalently, we can view the FFN as a tiny multilayer perceptron applied in parallel to all token positions. In matrix form, ignoring broadcasting details for the biases, this is
FFN⁡(X)=ϕ(XW1+b1)W2+b2,\operatorname{FFN}(X)
=
\phi(XW_1 + b_1)W_2 + b_2,FFN(X)=ϕ(XW1​+b1​)W2​+b2​,
but this compact expression can hide the crucial fact that row iii is transformed without directly reading row jjj. The matrix notation is efficient; the row-wise interpretation is the architectural insight.
The visual below condenses this idea into a left-to-right pipeline. Each row vector X1,…,XnX_1,\ldots,X_nX1​,…,Xn​ enters an identical two-layer nonlinear transformation: expansion by W1,b1W_1,b_1W1​,b1​, activation by ϕ\phiϕ, and compression by W2,b2W_2,b_2W2​,b2​. The parallel lanes emphasize that there are no arrows between positions inside the FFN.
The shared-weight annotation is just as important as the lanes themselves. It reminds us that the FFN is independent across rows but not separately parameterized across rows. The same learned map is reused at every position, turning attention’s cross-token communication into richer per-token features before the block moves on to residual connections and normalization.

15. Residual Connections and Layer Normalization

After adding the position-wise feed-forward network, we now have the two computational ingredients that make up a Transformer block: multi-head attention for token-to-token interaction, and an MLP/FFN for per-token nonlinear transformation. But simply stacking these transformations naively is usually unstable. Deep networks need a way to preserve information, keep gradients healthy, and prevent activations from drifting into poorly scaled regimes.
This is where the Transformer block becomes more than “attention followed by an MLP.” Each major sublayer is wrapped with three stabilizing mechanisms:
a residual connection, which adds the sublayer input back to its output;
dropout, which regularizes the sublayer output during training;
layer normalization, which rescales each token representation to a controlled feature distribution.
In the original Transformer formulation, these are arranged in a post-normalization pattern. For an input sequence representation X(ℓ)∈Rn×dmodelX^{(\ell)} \in \mathbb{R}^{n \times d_{\text{model}}}X(ℓ)∈Rn×dmodel​, the attention sublayer produces an intermediate representation
Z=LN⁡ ⁣(X(ℓ)+Dropout⁡(MHA⁡(X(ℓ)))).Z=\operatorname{LN}\!\left(
X^{(\ell)}
+
\operatorname{Dropout}(\operatorname{MHA}(X^{(\ell)}))
\right).Z=LN(X(ℓ)+Dropout(MHA(X(ℓ)))).
Then the feed-forward sublayer is applied with the same wrapper:
X(ℓ+1)=LN⁡ ⁣(Z+Dropout⁡(FFN⁡(Z))).X^{(\ell+1)}
=
\operatorname{LN}\!\left(
Z
+
\operatorname{Dropout}(\operatorname{FFN}(Z))
\right).X(ℓ+1)=LN(Z+Dropout(FFN(Z))).
The residual addition is not just a convenience. It gives the block a short identity route through depth. If the attention or feed-forward transformation is initially unhelpful, the model can still pass forward something close to the original representation. This matters because deep Transformers are trained by gradient descent: without residual paths, every layer would have to learn both how to preserve useful information and how to modify it. With residual paths, a sublayer can instead learn a correction or update to the current representation.
A useful way to think about one wrapped sublayer is
new representation≈old representation+learned update.\text{new representation}
\approx
\text{old representation}
+
\text{learned update}.new representation≈old representation+learned update.
Attention contributes a context-dependent update, while the FFN contributes a token-wise nonlinear update. The residual path ensures that these updates are added to an existing representation rather than replacing it entirely. This “incremental refinement” viewpoint is one reason very deep residual architectures are trainable.
Dropout is placed on the sublayer output before the residual addition. During training, this randomly removes parts of the proposed update. The identity path remains intact, so dropout regularizes the transformation without fully corrupting the information stream. In other words, the model is discouraged from relying too heavily on any one attention head, hidden feature, or feed-forward activation, while still preserving a stable baseline signal through the residual branch.
Layer normalization then controls the scale of the resulting token representations. For each token independently, layer normalization computes statistics across the feature dimension and normalizes the vector. Abstractly, for a token vector x∈Rdx \in \mathbb{R}^{d}x∈Rd,
LN⁡(x)=γ⊙x−μσ2+ϵ+β,\operatorname{LN}(x)
=
\gamma \odot \frac{x-\mu}{\sqrt{\sigma^2+\epsilon}}+\beta,LN(x)=γ⊙σ2+ϵ​x−μ​+β,
where μ\muμ and σ2\sigma^2σ2 are computed over the features of that token, and γ,β\gamma,\betaγ,β are learned scale and shift parameters. This differs from batch normalization: the normalization does not depend on other examples in the batch or other positions in the sequence. That property is especially important for variable-length sequence models and autoregressive decoding.
There is a subtle but important architectural variation here. The equations above describe post-normalization, where normalization happens after the residual addition. Many modern large language models instead use pre-normalization, where the input is normalized before the attention or FFN sublayer, and the residual addition happens afterward. The ingredients are the same, but the order changes:
Post-norm: sublayer →\rightarrow→ dropout →\rightarrow→ residual add →\rightarrow→ layer norm.
Pre-norm: layer norm →\rightarrow→ sublayer →\rightarrow→ dropout →\rightarrow→ residual add.
Pre-normalization often improves optimization stability for very deep Transformers because gradients can flow more directly through the residual stream. Post-normalization was used in the original Transformer and remains conceptually clean, but it can become harder to train as depth increases unless additional care is taken with initialization, learning-rate schedules, or normalization variants.
The visual below should now read as a compact assembly diagram for the post-normalization Transformer block. The main vertical path applies MHA and then FFN, while the curved bypass arrows represent the residual identity routes. Each learned update passes through Dropout, is added back to the incoming representation, and is then stabilized by LN.
The key idea is that the block is not a plain stack of transformations. It is a repeated pattern of propose an update, regularize it, add it to the residual stream, normalize the result. That pattern is what lets attention and feed-forward layers be composed deeply enough to form the backbone of modern Transformer models.

16. Positional Information: Absolute and Relative

With residual connections and layer normalization in place, a Transformer block has a stable way to transform and refine representations. But there is still a surprisingly fundamental problem: self-attention itself does not know what order the tokens came in.
The reason is that vanilla self-attention is a content-based retrieval mechanism. Each token produces a query, key, and value; attention compares queries to keys and mixes values according to similarity. If we permute the rows of the input matrix XXX, then the queries, keys, values, attention weights, and outputs are permuted in the same way. In other words, self-attention is permutation equivariant: reordering the input sequence merely reorders the output sequence. That is a useful symmetry for sets, but language is not a set. The sentences “dog bites man” and “man bites dog” contain the same words, but their meanings are not interchangeable.
So the Transformer must break this symmetry deliberately. It does not do so by recurrence, where position is implicit in the order of computation, nor by convolution, where locality is built into the kernel geometry. Instead, it injects position information into an otherwise order-agnostic attention mechanism. Broadly, there are two families of solutions:
Absolute positional information: attach a representation of position iii directly to the token representation.
Relative positional information: modify attention scores so that position iii attending to position jjj depends on their relative offset i−ji-ji−j.
The original Transformer used absolute positional encodings. If e(xi)∈Rdmodele(x_i)\in\mathbb{R}^{d_{\mathrm{model}}}e(xi​)∈Rdmodel​ is the token embedding for token xix_ixi​, then we add a position vector pip_ipi​ before the first attention layer:
X=[e(x1)+p1⋯e(xn)+pn]∈Rn×dmodel.X=
\begin{bmatrix}
e(x_1)+p_1 \\
\cdots \\
e(x_n)+p_n
\end{bmatrix}
\in\mathbb{R}^{n\times d_{\mathrm{model}}}.X=​e(x1​)+p1​⋯e(xn​)+pn​​​∈Rn×dmodel​.
This changes the meaning of the row representation. A row no longer says only “this is the word bank”; it says something closer to “this is the word bank at position iii.” Once that information is inside the representation, the attention mechanism can learn position-sensitive behavior through ordinary dot products. For example, a head may learn that a token near the beginning of a sentence behaves differently from the same token near the end, or that certain syntactic patterns depend on approximate position.
There are two common variants of absolute positions. In learned absolute positions, each pip_ipi​ is a trainable vector, just like a word embedding. This is simple and flexible, but it ties the model to the range of positions seen during training unless special care is taken. In fixed sinusoidal positions, pip_ipi​ is a deterministic function of iii, using sine and cosine waves at multiple frequencies. The motivation is that different dimensions encode position at different resolutions, and relative offsets can be expressed through linear relationships among these periodic features. Fixed encodings are not learned from data, but they provide a structured notion of position that can extrapolate more gracefully in some settings.
Absolute positions are intuitive, but they have a subtle limitation: they identify where a token is in the sequence, not directly how far apart two tokens are. Many linguistic and sequential patterns are naturally relative. A word may care about the previous token, the next token, the nearest verb, or another symbol three positions back. For these cases, it is often more natural to inject order into the attention score itself.
In standard scaled dot-product attention, the score from query position iii to key position jjj is
sij=qi⊤kjdk.s_{ij}=\frac{q_i^\top k_j}{\sqrt{d_k}}.sij​=dk​​qi⊤​kj​​.
Relative positional methods modify this idea so that the score also depends on the displacement between the two positions:
sij=qi⊤kjdk⟶sij also depends on i−j.s_{ij}
=
\frac{q_i^\top k_j}{\sqrt{d_k}}
\quad\longrightarrow\quad
s_{ij}\text{ also depends on }i-j.sij​=dk​​qi⊤​kj​​⟶sij​ also depends on i−j.
This is a different way of breaking permutation symmetry. Instead of saying “token xix_ixi​ carries position vector pip_ipi​,” the model says “when position iii attends to position jjj, the interaction depends on their distance and direction.” The sign of i−ji-ji−j matters: attending three tokens to the left is not the same as attending three tokens to the right. In practice, this can be implemented using additive biases, relative key/value embeddings, rotary transformations, or other mechanisms, but the conceptual move is the same: make attention pairwise position-aware.
The distinction matters because each approach gives the model a different inductive bias. Absolute positions are simple and global: every token knows its address. Relative positions are relational: every attention edge knows its offset. Absolute positions can be enough for many tasks, especially when sequence lengths are bounded and consistent. Relative schemes often work better when patterns depend on local displacement, when length extrapolation matters, or when the model benefits from treating “nearby” and “far away” interactions differently regardless of absolute location.
The visual below consolidates these two routes into the same conceptual frame. On the left, order enters before attention: token embeddings e(xi)e(x_i)e(xi​) are combined with position vectors pip_ipi​, producing the matrix X∈Rn×dmodelX\in\mathbb{R}^{n\times d_{\mathrm{model}}}X∈Rn×dmodel​. On the right, order enters inside attention: the score grid is no longer determined only by content similarity qi⊤kjq_i^\top k_jqi⊤​kj​, but also by diagonal bands corresponding to relative offsets i−ji-ji−j.
The key takeaway is that Transformers do not obtain order “for free.” Self-attention gives them flexible content-based communication, residual paths preserve and refine representations, and layer normalization stabilizes the computation—but positional information is what turns a permutation-equivariant set processor into a sequence model. Order enters either through the representation pip_ipi​, or through the attention interaction sijs_{ij}sij​, rather than through recurrence.

17. Theorem: Causal Masking Gives Autoregressive Dependence

After adding positional information, we have fixed one major ambiguity of self-attention: a token representation can now know where it sits. But in the decoder there is a second, equally important constraint: it must not know what comes next. If a model is trained to predict the next token while its internal representation can already attend to future tokens, the learning problem becomes contaminated by future-token leakage. The model may appear to achieve excellent training loss, but it is solving the wrong conditional distribution.
The autoregressive modeling assumption is that a sequence distribution factors as
pθ(x1,…,xn)=∏t=1npθ(xt∣x<t).p_\theta(x_1,\ldots,x_n)
=
\prod_{t=1}^n p_\theta(x_t\mid x_{<t}).pθ​(x1​,…,xn​)=t=1∏n​pθ​(xt​∣x<t​).
So when the model produces the conditional distribution for position ttt, its computation may depend on x1,…,xt−1x_1,\ldots,x_{t-1}x1​,…,xt−1​, but not on xt,xt+1,…,xnx_t,x_{t+1},\ldots,x_nxt​,xt+1​,…,xn​ depending on the exact indexing convention. Equivalently, if we feed the decoder a prefix ending at position ttt, the representation at that position may depend only on the prefix. In the common teacher-forcing setup, inputs are shifted so that the row at one position is used to predict the next token; the same structural requirement remains: no representation used for prediction may incorporate information from tokens to its right.
Causal masking enforces this directly inside self-attention. Recall that attention weights are computed from logits of the form
qt⊤kudk+Mtu,\frac{q_t^\top k_u}{\sqrt{d_k}} + M_{tu},dk​​qt⊤​ku​​+Mtu​,
followed by a row-wise softmax over key/value positions uuu. In decoder self-attention, the additive mask is
(Mcausal)tu=0 for u≤t,(Mcausal)tu=−∞ for u>t.(M_{\mathrm{causal}})_{tu}=0\ \text{for }u\le t,\qquad
(M_{\mathrm{causal}})_{tu}=-\infty\ \text{for }u>t.(Mcausal​)tu​=0 for u≤t,(Mcausal​)tu​=−∞ for u>t.
Thus, for query position ttt, keys and values at positions u>tu>tu>t receive logit −∞-\infty−∞. After softmax, their probability mass is exactly zero:
atu=0for all u>t.a_{tu}=0\qquad \text{for all }u>t.atu​=0for all u>t.
This is the key mechanism. The mask does not merely discourage looking ahead; in the mathematical idealization, it removes those edges from the computation graph.
Now consider a full stack of masked Transformer decoder blocks. The input at position iii is
Xi(0)=e(xi)+pi,X_i^{(0)}=e(x_i)+p_i,Xi(0)​=e(xi​)+pi​,
where e(xi)e(x_i)e(xi​) is the token embedding and pip_ipi​ is positional information. The theorem says that for every layer ℓ\ellℓ and every position ttt, the row Xt(ℓ)X_t^{(\ell)}Xt(ℓ)​ is a function only of
(x1,…,xt)and(p1,…,pt),(x_1,\ldots,x_t)
\quad\text{and}\quad
(p_1,\ldots,p_t),(x1​,…,xt​)and(p1​,…,pt​),
not of any future token xux_uxu​ with u>tu>tu>t. This is what makes decoder-only Transformers valid autoregressive models: their internal states respect the same left-to-right conditional structure as the probability distribution they are trained to represent.
The proof is a simple but important induction over layers. At layer 000, the row Xt(0)X_t^{(0)}Xt(0)​ depends only on xtx_txt​ and ptp_tpt​, so it certainly does not depend on future tokens. Assume that at layer ℓ\ellℓ, every row Xu(ℓ)X_u^{(\ell)}Xu(ℓ)​ depends only on tokens up to position uuu. When computing masked self-attention for row ttt at the next layer, the query at ttt can attend only to rows u≤tu\le tu≤t. By the induction hypothesis, each such row Xu(ℓ)X_u^{(\ell)}Xu(ℓ)​ depends only on tokens x1,…,xux_1,\ldots,x_ux1​,…,xu​, and since u≤tu\le tu≤t, all of those are contained in x1,…,xtx_1,\ldots,x_tx1​,…,xt​. Therefore the attention output at position ttt depends only on the prefix up to ttt.
The remaining parts of the Transformer block preserve this property. The feed-forward network is applied position-wise, so it cannot mix information across sequence positions. Residual connections add together quantities from the same row. Layer normalization, in the usual Transformer form, normalizes across feature dimensions within a row, not across time positions. Therefore residual-plus-normalization also cannot introduce dependence on future tokens. The induction closes: every layer preserves the causal dependency structure.
There are a few subtle assumptions hiding inside this clean theorem:
The mask must be applied before the softmax, and future logits must be treated as truly impossible, ideally −∞-\infty−∞, not merely a small penalty.
Normalization must be row-wise. A normalization operation that aggregates statistics across time could leak future information.
Positional information must not encode future token content. Absolute position vectors pip_ipi​ are fine because they reveal where a token is, not what future tokens are.
Relative positional biases are also fine as long as they respect the causal mask and do not reintroduce attention to u>tu>tu>t.
In real implementations, −∞-\infty−∞ is often approximated by a very large negative number. This is usually safe in floating-point arithmetic, but conceptually the theorem relies on the masked attention weights being exactly zero for future positions. Bugs in masking, off-by-one indexing errors, or applying the mask with the wrong tensor shape can produce silent leakage. Such leakage is especially dangerous because training loss may improve while generation quality or evaluation validity becomes compromised.
The practical payoff is that we can train on full sequences in parallel while still modeling left-to-right conditionals. During training, the model sees the entire length-nnn sequence as a tensor, but the causal mask ensures that the representation used for each prediction only has access to the appropriate prefix. This is the core reason Transformer decoders avoid the sequential computation bottleneck of RNNs while still parameterizing distributions like
pθ(xt∣x<t).p_\theta(x_t\mid x_{<t}).pθ​(xt​∣x<t​).
The visual below compactly summarizes the theorem’s two perspectives. On the left, the causal mask is a lower-triangular attention pattern: positions may attend backward and to themselves, but the strict upper triangle is removed. A highlighted query row ttt makes the statement concrete: all entries with u>tu>tu>t are blocked, which corresponds exactly to atu=0a_{tu}=0atu​=0.
On the right, the same idea is lifted from one attention matrix to an entire stack of decoder blocks. Information from positions 1,…,t1,\ldots,t1,…,t can flow upward into Xt(ℓ)X_t^{(\ell)}Xt(ℓ)​, while arrows from future positions are stopped before reaching it. That is the theorem in computational-graph form: after any number of masked self-attention layers, the row used for autoregressive prediction still depends only on the prefix, giving next-token modeling with no leakage.

18.

The causal masking theorem gives us more than a correctness condition for language modeling. It tells us that attention is a programmable dependency pattern: by changing which positions are allowed to communicate, we change the computational graph of the model without changing the basic attention mechanism itself.
That is the key architectural leap. A Transformer is not “an attention layer” repeated many times in the abstract. It is a stack of modules where each module answers three separate questions:
What information is available? Full sequence, prefix only, or another sequence entirely?
How is information selected? By content-based attention weights.
How is each token state updated? Through residual connections, normalization, and position-wise nonlinear transformations.
The previous result focused on the first question for autoregressive models. If token iii is only allowed to attend to positions ≤i\leq i≤i, then its representation cannot depend on future tokens. That makes next-token prediction valid: the model cannot “cheat” by reading the answer from the right-hand side of the sequence.
But causal masking is only one possible attention topology. The same attention operation can support very different modeling regimes depending on the allowed communication pattern:
In an encoder, every token may attend to every other token. This is useful when the whole input is known in advance, as in classification, retrieval, or the source side of translation.
In a decoder, each token may attend only to itself and earlier tokens. This is what makes autoregressive generation well-defined.
In cross-attention, decoder states attend to encoder states. This lets a generated target sequence condition on a separate input sequence.
A useful way to think about this is that attention defines a soft message-passing graph over token representations. The mask determines which edges exist; the attention scores determine how strongly each available edge is used. The model does not merely copy from neighboring positions. It learns, at every layer and every head, which previous or external states are relevant to the current computation.
This separation matters because many Transformer properties come from structural constraints, not from learned parameters alone. If we remove the causal mask from a language-model decoder during training, the model may achieve an artificially low loss by using future tokens. If we impose a causal mask inside an encoder, we unnecessarily prevent tokens from using right context. If we omit positional information, self-attention remains largely insensitive to token order except through whatever asymmetries are introduced elsewhere. The architecture works because these design choices are aligned with the task.
There is also a computational reason to keep these pieces modular. During training, even causal self-attention can be evaluated in parallel across all positions because the mask is known in advance. The model computes all token representations simultaneously while enforcing the same dependency pattern that will hold during generation. During decoding, however, generation is sequential: after predicting one token, the model appends it to the prefix and runs the next step. This is why training and inference have different bottlenecks even though they use the same learned layers.
The next architectural step is therefore to assemble these attention patterns into reusable blocks. Each block takes a sequence of hidden states, routes information through an attention sublayer, applies a token-wise feed-forward transformation, and preserves trainability through residual and normalization structure. The distinction between encoder, decoder, and encoder-decoder models is mostly a distinction in which attention sublayers are present and what they are allowed to see.
The visual below serves as a compact transition from the theorem to the architecture. Instead of treating masking as a small implementation detail, it places masking and attention access patterns at the center of the design. Once that picture is clear, encoder self-attention, decoder causal self-attention, and cross-attention become variations of the same underlying operation rather than separate mechanisms.
It is worth carrying this mental model forward: attention computes content-based retrieval; masks define legal information flow; Transformer blocks package that retrieval into stable, trainable layers. With that in place, we can now build the encoder, decoder, and encoder-decoder Transformer architectures systematically.

19. Encoder, Decoder, and Cross-Attention

A useful way to organize the Transformer design space is to stop thinking in terms of “different architectures” and instead ask two more primitive questions: which tokens are allowed to attend to which other tokens, and where do the queries, keys, and values come from? Once causal masking is understood, the distinction between encoder-only, decoder-only, and encoder–decoder Transformers becomes much less mysterious. They are mostly the same attention operation, wired with different masks and different source streams.
Recall the core attention computation:
Attention⁡(Q,K,V)=softmax⁡ ⁣(QK⊤dk+M)V.\operatorname{Attention}(Q,K,V)
=
\operatorname{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}+M\right)V.Attention(Q,K,V)=softmax(dk​​QK⊤​+M)V.
The matrix QK⊤QK^\topQK⊤ scores how much each query position wants to retrieve information from each key position. The mask MMM then changes which retrievals are legal. Typically, allowed entries receive 000, while forbidden entries receive a very negative number, effectively making their softmax probability zero. So the mask is not a minor implementation detail; it defines the information flow graph of the model.
In an encoder-only Transformer, all positions in the input sequence x1:nx_{1:n}x1:n​ can usually attend to all other non-padding positions. There is no causal restriction because the goal is not to generate the sequence left-to-right. Instead, the model builds a contextual representation ZZZ, where each token representation may depend on tokens both to its left and to its right. This is appropriate for tasks like classification, retrieval, tagging, and masked-token prediction, where the entire input is available at once.
The subtle assumption here is that bidirectional context is legal. If the downstream task requires predicting the future from the past, an encoder-only model would leak information unless we carefully modify the objective or mask. But when the whole input is genuinely observed, full self-attention is a strength: every token can be interpreted in light of the complete sequence.
In a decoder-only Transformer, the same sequence x1:nx_{1:n}x1:n​ is treated as a language-modeling sequence. Position ttt is trained to predict the next or current token using only earlier tokens. The model therefore uses causal self-attention, meaning token ttt may attend to positions ≤t\le t≤t, but not to positions >t>t>t. This gives the familiar autoregressive factorization:
pθ(x1:n)=∏t=1npθ(xt∣x<t).p_\theta(x_{1:n})
=
\prod_{t=1}^n p_\theta(x_t \mid x_{<t}).pθ​(x1:n​)=t=1∏n​pθ​(xt​∣x<t​).
This architecture is natural for open-ended generation because the model’s training-time information pattern matches its test-time use: when generating token xtx_txt​, the future tokens do not yet exist. A failure mode appears when this alignment is broken. If a decoder accidentally receives unmasked future tokens during training, it can learn a shortcut that disappears at inference time. Causal masking is therefore what makes parallel training compatible with left-to-right generation.
An encoder–decoder Transformer separates the problem into two streams. The encoder first reads the source sequence x1:nx_{1:n}x1:n​ and produces contextual representations ZZZ. The decoder then generates a target sequence y1:my_{1:m}y1:m​ autoregressively. Within the decoder, causal self-attention ensures that position ttt can only depend on y<ty_{<t}y<t​. But after that, the decoder also performs cross-attention over the encoder output. This gives the conditional prediction form
pθ(y1:m∣x1:n)=∏t=1mpθ(yt∣y<t,x1:n).p_\theta(y_{1:m}\mid x_{1:n})
=
\prod_{t=1}^m p_\theta(y_t \mid y_{<t},x_{1:n}).pθ​(y1:m​∣x1:n​)=t=1∏m​pθ​(yt​∣y<t​,x1:n​).
Cross-attention is not a new mathematical primitive. It is the same content-based retrieval operation, but with a different source for Q,K,VQ,K,VQ,K,V. The decoder state supplies the queries QQQ: “given what I have generated so far, what source information do I need?” The encoder output supplies the keys and values K,VK,VK,V: “here are the source-side memories available for retrieval.” In compact form,
Q←decoder states,K,V←Z.Q \leftarrow \text{decoder states},
\qquad
K,V \leftarrow Z.Q←decoder states,K,V←Z.
This distinction matters because it separates two kinds of dependency. Decoder self-attention models dependencies among generated target tokens, while cross-attention conditions those target tokens on the input. For translation, summarization, speech recognition, and other sequence-to-sequence tasks, this is exactly what we want: the output should be fluent in its own sequence while remaining grounded in the source.
The three families can therefore be compared by a small set of choices:
Encoder-only: bidirectional self-attention over an observed input; produces contextual representations.
Decoder-only: causal self-attention over a generated sequence; models pθ(xt∣x<t)p_\theta(x_t \mid x_{<t})pθ​(xt​∣x<t​).
Encoder–decoder: causal decoder self-attention plus cross-attention to an encoded source; models pθ(yt∣y<t,x1:n)p_\theta(y_t \mid y_{<t},x_{1:n})pθ​(yt​∣y<t​,x1:n​).
The visual below condenses this taxonomy into a table: the rows are not fundamentally different attention formulas, but different choices about masks and streams. The important thing to notice is that “self-attention” means Q,K,VQ,K,VQ,K,V come from the same sequence, whereas “cross-attention” means the decoder provides QQQ and the encoder provides K,VK,VK,V.
The equation at the bottom reinforces the main point: all three cases still use scaled dot-product attention. What changes is the mask MMM and the provenance of the matrices. Once that is clear, the transition to training objectives becomes straightforward: each architecture defines which conditional distribution it is allowed to model, and the next step is to train those conditionals by maximum likelihood.

20. Training Objective: Teacher-Forced Maximum Likelihood

Having separated the Transformer into encoder, decoder, and cross-attention components, the next question is: what exactly do we optimize during training? The architecture gives us a conditional distribution over tokens, but training turns that distribution into a supervised learning problem repeated across every position in a sequence.
For an encoder-decoder Transformer, the target sequence y1:my_{1:m}y1:m​ is generated autoregressively conditioned on the source sequence x1:nx_{1:n}x1:n​. That means we model the joint conditional probability by the chain rule:
pθ(y1:m∣x1:n)=∏t=1mpθ(yt∣y<t,x1:n)p_\theta(y_{1:m}\mid x_{1:n})
=
\prod_{t=1}^{m}
p_\theta(y_t\mid y_{<t},x_{1:n})pθ​(y1:m​∣x1:n​)=t=1∏m​pθ​(yt​∣y<t​,x1:n​)
This factorization is not an approximation by itself; it is just the probability chain rule. The modeling assumption enters through the Transformer parameterization of each conditional distribution pθ(yt∣y<t,x1:n)p_\theta(y_t\mid y_{<t},x_{1:n})pθ​(yt​∣y<t​,x1:n​). At position ttt, the decoder is supposed to use the source representation from the encoder and the already generated target prefix y<ty_{<t}y<t​, but not the future target tokens y>ty_{>t}y>t​.
This is where teacher forcing enters. During training, we do not ask the model to generate its own prefix token by token and then learn from the resulting rollout. Instead, for every position ttt, we feed the decoder the ground-truth prefix. The model sees the correct previous tokens and is trained to predict the next one:
y<t⟶yty_{<t} \longrightarrow y_ty<t​⟶yt​
The subtle but crucial detail is that the model may receive the full target sequence as a tensor during the forward pass, but the causal mask ensures that position ttt can only attend to positions <t<t<t. Without this mask, the decoder could leak information from yty_tyt​ or y>ty_{>t}y>t​, making the training loss artificially low and destroying the intended autoregressive semantics.
The maximum-likelihood objective asks us to maximize the probability of the observed target sequences under this factorization. Equivalently, we minimize the negative log-likelihood:
L(θ)=−∑(x1:n,y1:m)∈D∑t=1mlog⁡pθ(yt∣y<t,x1:n)\mathcal{L}(\theta)
=
-\sum_{(x_{1:n},y_{1:m})\in\mathcal{D}}
\sum_{t=1}^{m}
\log p_\theta(y_t\mid y_{<t},x_{1:n})L(θ)=−(x1:n​,y1:m​)∈D∑​t=1∑m​logpθ​(yt​∣y<t​,x1:n​)
In practice, this is exactly the usual cross-entropy loss over the vocabulary at every decoder position. The decoder outputs a vector of logits at each position; after a softmax, the probability assigned to the correct next token is selected, logged, negated, and summed. Because the correct prefix is supplied everywhere, every target position contributes a supervised training signal in one forward pass.
This is one of the key computational advantages of Transformer training. Although the probability model is autoregressive, the training computation is parallel over positions. We do not have to run the decoder once for t=1t=1t=1, then again for t=2t=2t=2, and so on. Instead, the causal mask gives each position the right information boundary, allowing all next-token predictions to be trained simultaneously:
position 111 predicts y1y_1y1​ from the start context,
position 222 predicts y2y_2y2​ from y1y_1y1​,
position ttt predicts yty_tyt​ from y<ty_{<t}y<t​,
position mmm predicts ymy_mym​ from y<my_{<m}y<m​.
For decoder-only language modeling, the same idea applies after removing the source sequence. A language model learns to predict each token from its left context:
L(θ)=−∑x1:n∈D∑t=1nlog⁡pθ(xt∣x<t)\mathcal{L}(\theta)
=
-\sum_{x_{1:n}\in\mathcal{D}}
\sum_{t=1}^{n}
\log p_\theta(x_t\mid x_{<t})L(θ)=−x1:n​∈D∑​t=1∑n​logpθ​(xt​∣x<t​)
This is the objective behind standard autoregressive pretraining. The “input” and “target” are shifted versions of the same sequence: the model consumes previous tokens and predicts the next token. Again, the causal mask is what makes it legitimate to process the entire sequence at once while preserving the left-to-right conditional structure.
There is an important distinction between training and decoding here. During training, teacher forcing conditions on the true prefix, so all positions are supervised in parallel. During inference, the true prefix is unavailable; the model must condition on tokens it has already generated. This creates exposure to its own mistakes, and it also makes decoding sequential in the output length. Thus, Transformers train highly parallelly but still generate autoregressively unless we change the modeling assumptions.
The visual summary below compresses this objective into three pieces: the autoregressive factorization, the encoder-decoder negative log-likelihood, and the decoder-only language modeling loss. The highlighted prefixes y<ty_{<t}y<t​ and x<tx_{<t}x<t​ are the conditioning contexts supplied under teacher forcing, while the causal mask marks the boundary that prevents a position from seeing future tokens.
The small token timeline reinforces the operational meaning of the equations: every position is trained as a next-token prediction problem, but the red barrier imposed by the causal mask keeps each prediction honest. This is the central training recipe for sequence Transformers: sum cross-entropy over examples and positions, with ground-truth prefixes and no future-token leakage.

21. Algorithm: Transformer Encoder Forward Pass

Having defined the teacher-forced likelihood objective, we now need to be precise about what the model actually computes before that objective is evaluated. For an encoder-only Transformer, the forward pass is conceptually simple: convert a sequence of tokens into a sequence of contextual vectors, repeatedly allowing each position to retrieve information from other positions and then transform its own representation through a shared nonlinear map.
The input is a token sequence x1:nx_{1:n}x1:n​. Since attention by itself is permutation-equivariant, the encoder must be given some representation of order. The usual first step is therefore to add a learned or fixed positional vector pip_ipi​ to each token embedding e(xi)e(x_i)e(xi​):
X←[ e(x1)+p1;…;e(xn)+pn ]∈Rn×dmodel.X \leftarrow [\,e(x_1)+p_1;\ldots;e(x_n)+p_n\,] \in \mathbb{R}^{n\times d_{\mathrm{model}}}.X←[e(x1​)+p1​;…;e(xn​)+pn​]∈Rn×dmodel​.
Here each row of XXX is the representation of one token position. The matrix shape matters: the encoder preserves the sequence length nnn throughout the stack, while updating the dmodeld_{\mathrm{model}}dmodel​-dimensional representation at each position. This is one reason encoders are reusable: their output is still a sequence, not a single collapsed vector.
Each encoder layer then applies two sublayers. The first is multi-head self-attention, where every position forms queries, keys, and values from the current sequence representation. Within each head, attention weights are computed as
A=softmax⁡ ⁣(QK⊤dk+M).A=\operatorname{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}+M\right).A=softmax(dk​​QK⊤​+M).
The mask MMM is especially important in batched training. For a standard bidirectional encoder, we usually do not use a causal mask, because each token is allowed to attend to tokens on both its left and right. But if examples have been padded to a common length, padded positions must not participate as real content. This is handled by adding a padding mask before the row-wise softmax, typically using large negative values so that attention probability on padded keys becomes essentially zero.
After attention, the encoder does not simply replace XXX with MHA⁡(X)\operatorname{MHA}(X)MHA(X). Instead it uses a residual connection, dropout, and layer normalization:
Z←LN⁡(X+Dropout⁡(MHA⁡(X))).Z \leftarrow \operatorname{LN}\bigl(X + \operatorname{Dropout}(\operatorname{MHA}(X))\bigr).Z←LN(X+Dropout(MHA(X))).
The residual path is not just an implementation detail. It gives the layer an easy way to preserve existing information, improves gradient flow through deep stacks, and lets attention learn corrections rather than recomputing the entire representation from scratch. Layer normalization then stabilizes the scale of activations at each position, which becomes increasingly important as LLL, the number of layers, grows.
The second sublayer is the position-wise feed-forward network:
X←LN⁡(Z+Dropout⁡(FFN⁡(Z))).X \leftarrow \operatorname{LN}\bigl(Z + \operatorname{Dropout}(\operatorname{FFN}(Z))\bigr).X←LN(Z+Dropout(FFN(Z))).
“Position-wise” means the same multilayer perceptron is applied independently to each row of ZZZ. Attention is the mechanism that mixes information across positions; the feed-forward network is the mechanism that applies a richer nonlinear transformation within each position. This separation is one of the clean design principles of the Transformer block:
Self-attention: communicates across tokens.
FFN: transforms each token representation independently.
Residual connections: preserve information and ease optimization.
Layer normalization: keeps activations numerically stable.
Dropout: regularizes both sublayers during training.
A subtle but useful way to view the encoder is as a repeated refinement process. At layer 111, a token’s representation may mostly encode lexical identity and position. After several layers, that same row can encode syntactic role, semantic relationships, discourse context, or task-relevant features—while still occupying the same sequence slot. The algorithm returns the final matrix XXX, whose rows are contextual embeddings. Depending on the task, this matrix might feed a classifier, a retrieval head, a token-level predictor, or the cross-attention module of a decoder.
The main failure mode to watch for is confusing the encoder mask with the decoder mask. In an encoder, the mask is usually about padding, not causality. If we accidentally apply a causal mask inside a bidirectional encoder, we restrict the model unnecessarily and change its equivariance properties. Conversely, if we forget the padding mask, real tokens may attend to padding embeddings, allowing meaningless positions to contaminate the contextual representations.
The visual below packages this forward pass as an algorithm: initialize token-plus-position representations, loop through LLL identical encoder blocks, apply masked multi-head self-attention with residual normalization, then apply the shared position-wise feed-forward update with another residual normalization. The right-hand stack view is a useful mental model: the sequence enters at the bottom, is lifted into XXX, passes through repeated encoder layers, and exits as a contextual sequence representation.
Read the pseudocode not as a low-level implementation prescription, but as the mathematical skeleton of the encoder. Actual implementations may choose pre-norm instead of post-norm, fuse projections for efficiency, or batch many sequences together, but the invariant structure remains the same: self-attention mixes tokens, the FFN transforms positions, and the stack preserves sequence shape while enriching representation quality.

22. Algorithm: Minibatch Training by Cross-Entropy

Once we know how a Transformer produces hidden states through a forward pass, training becomes almost surprisingly ordinary. The architectural details may differ—encoder-only classification, decoder-only language modeling, encoder-decoder translation—but the core optimization loop is the familiar one: sample a minibatch, run the model, compare predicted token distributions to the correct next or target tokens, and update parameters by gradient descent.
For sequence generation, the key supervision signal is cross-entropy over vocabulary logits. At each target position ttt, the final Transformer representation Xt(L)X_t^{(L)}Xt(L)​ is projected into vocabulary space:
ℓt=WvocabXt(L)+bvocab.\ell_t = W_{\mathrm{vocab}}X_t^{(L)} + b_{\mathrm{vocab}}.ℓt​=Wvocab​Xt(L)​+bvocab​.
Here ℓt∈R∣V∣\ell_t \in \mathbb{R}^{|\mathcal{V}|}ℓt​∈R∣V∣ is not yet a probability distribution; it is a vector of unnormalized scores, one per vocabulary item. Applying softmax converts those scores into a categorical distribution over the next token, and the loss penalizes the model when it assigns low probability to the correct token yty_tyt​:
L(θ)=−∑non-padding tlog⁡softmax⁡(ℓt)yt.\mathcal{L}(\theta)
=
-\sum_{\text{non-padding }t}
\log \operatorname{softmax}(\ell_t)_{y_t}.L(θ)=−non-padding t∑​logsoftmax(ℓt​)yt​​.
The phrase non-padding is more important than it may look. In minibatch training, sequences are usually padded to a common length so they can be represented as a dense tensor. Padding tokens are not real targets; if we included them in the loss, the model would waste capacity learning to predict artificial batch-formatting artifacts. Thus padding positions are masked both in attention, where they should not be read as content, and in the loss, where they should not contribute gradients.
For decoder training, the central trick is teacher forcing. Instead of generating tokens one at a time during training, we feed the model the ground-truth prefix and ask it to predict each next token in parallel. For a target sequence (y1,…,yT)(y_1,\dots,y_T)(y1​,…,yT​), the decoder receives shifted inputs such as (BOS,y1,…,yT−1)(\texttt{BOS}, y_1,\dots,y_{T-1})(BOS,y1​,…,yT−1​) and is trained to predict (y1,…,yT)(y_1,\dots,y_T)(y1​,…,yT​). Because causal masks prevent position ttt from seeing future target tokens, this parallel computation is still faithful to the autoregressive factorization:
pθ(y1,…,yT)=∏t=1Tpθ(yt∣y<t).p_\theta(y_1,\dots,y_T)
=
\prod_{t=1}^T p_\theta(y_t \mid y_{<t}).pθ​(y1​,…,yT​)=t=1∏T​pθ​(yt​∣y<t​).
This is one of the major computational advantages of Transformers over recurrent sequence models. At inference time, autoregressive decoding must still proceed token by token, because each generated token becomes part of the next input. But during training, the entire target sequence can be processed simultaneously under a causal mask. The model is therefore trained on all positions in a minibatch with a single parallel forward pass.
The precise interpretation of the target token depends on the model family:
In an encoder-decoder model, yty_tyt​ is the target-side token at position ttt, conditioned on the source sequence and previous target tokens.
In a decoder-only language model, the same idea appears as next-token prediction on a single stream; the loss target is the next token from the input sequence, often described informally as replacing yty_tyt​ by the appropriate shifted xtx_txt​.
In an encoder-only model, cross-entropy may be applied to selected positions or pooled representations, depending on the task, though the minibatch-gradient pattern is unchanged.
Once the loss is computed, training updates all parameters θ\thetaθ by differentiating through the entire computation graph: embeddings, attention projections, feed-forward layers, normalization parameters, and the vocabulary projection. In its simplest gradient-descent form, the update is
θ←θ−α∇θL(θ),\theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}(\theta),θ←θ−α∇θ​L(θ),
where α\alphaα is the learning rate. In practical Transformer training, this step is usually performed by Adam or AdamW rather than plain gradient descent, often with learning-rate warmup, weight decay, gradient clipping, mixed precision, and distributed data parallelism. But those engineering choices refine the same mathematical loop: minimize token-level negative log-likelihood over minibatches.
A useful way to view the whole algorithm is as a tension between parallelism during training and causality in the model definition. Teacher forcing exposes every target position at once, but causal masking ensures that the representation at position ttt cannot depend on target tokens after ttt. Padding masks remove fake tokens introduced by batching. Cross-entropy then turns each valid target position into a supervised classification problem over the vocabulary.
The visual summary condenses this into the training loop you would actually implement: initialize parameters, sample a minibatch, run the relevant Transformer forward pass, compute logits, accumulate masked cross-entropy over non-padding targets, and update θ\thetaθ. The highlighted lines emphasize the three mathematical operations that matter most: vocabulary projection, likelihood loss, and gradient update.
The callouts also mark the two assumptions that are easy to forget when reading pseudocode too quickly. First, the forward pass depends on which Transformer family is being trained—encoder, decoder-only, or encoder-decoder. Second, the loss is teacher-forced and padding-aware: all valid target positions contribute in parallel, while masked or padded positions do not.

23. Algorithm: Autoregressive Decoding

After training with cross-entropy, it is tempting to think of the Transformer decoder as producing an entire output sequence in one forward pass. During teacher forcing, that is almost true computationally: we feed the ground-truth prefix $y_{<t}$ at every position, apply the causal mask, and evaluate all next-token losses in parallel. But at test time the ground-truth prefix is gone. The model must condition on its own previous predictions, so generation becomes an explicitly sequential process.

The model defines a conditional distribution for the next token,

$p_\theta(y_t \mid \hat{y}_{<t}, x_{1:n}),$

where $x_{1:n}$ is the source/input sequence and $\hat{y}_{<t}$ is the prefix already generated. The hats matter: these are not gold tokens anymore. They are decisions made by the model at earlier decoding steps. Once the model chooses $\hat{y}_1$ , that token becomes part of the context used to choose $\hat{y}_2$ , and so on.

In greedy decoding, the decision rule is the simplest possible one: at each step, choose the most likely next token under the current model distribution,

$\hat{y}_t \leftarrow \arg\max_{y_t \in \mathcal{V}} p_\theta(y_t \mid \hat{y}_{<t}, x_{1:n}).$

This looks locally optimal, but it is not globally optimal in general. A token that is best at time $t$ may lead to a poor continuation later, while a slightly less likely token might open up a much better full sequence. Greedy decoding is therefore fast and deterministic, but it can be shortsighted.

The causal mask remains essential during decoding. Even though we generate left-to-right, each forward pass still computes attention over the current prefix positions. The mask enforces the autoregressive factorization: token $t$ may attend to $\hat{y}_{<t}$ , but not to future tokens that have not yet been produced. Conceptually, decoding constructs the sequence according to

$p_\theta(\hat{y}_{1:T}\mid x_{1:n}) = \prod_{t=1}^{T} p_\theta(\hat{y}_t \mid \hat{y}_{<t}, x_{1:n}),$

until a maximum length is reached or a special end-of-sequence token is emitted.

A useful way to write greedy decoding is:

function GREEDY_DECODE(x_{1:n}, T)
    encode x_{1:n} if using encoder-decoder
    initialize \hat{y}_{&lt;1} as the required start context

    for t = 1 to T do
        compute p_\theta(y_t | \hat{y}_{&lt;t}, x_{1:n}) with the causal mask
        \hat{y}_t &lt;- argmax_{y_t in V} p_\theta(y_t | \hat{y}_{&lt;t}, x_{1:n})
        stop if an end token is produced
    end for

    return \hat{y}_{1:t}
end function

The key train-test contrast is that training parallelizes over positions, while decoding does not. During training, the model already has the full target sequence shifted right, so it can compute all conditional distributions $p_\theta(y_t \mid y_{<t}, x_{1:n})$ in one masked pass. During decoding, however, $\hat{y}_t$ must be chosen before $\hat{y}_{t+1}$ can even be conditioned on. This dependency chain is fundamental to autoregressive generation.

Beam search relaxes greedy decoding by retaining multiple candidate prefixes instead of only one. At time $t$ , it keeps a beam $\mathcal{B}_t$ containing the top $K_{\mathrm{beam}}$ prefixes, usually ranked by cumulative log-probability:

$\sum_{u=1}^{t} \log p_\theta(\hat{y}_u \mid \hat{y}_{<u},x_{1:n}).$

When $K_{\mathrm{beam}}=1$ , beam search reduces to greedy decoding. Larger beams explore more alternatives, often improving sequence quality, but they increase computation and memory. Beam search is still an approximation: it does not enumerate the exponentially large space of possible sequences, and it may favor short outputs unless length normalization or other scoring adjustments are used.

This decoding process also exposes a subtle failure mode: error accumulation. If the model makes an early mistake, all later predictions condition on that mistake. This differs from teacher-forced training, where the model is usually conditioned on correct prefixes. The mismatch is one reason decoding behavior can be worse than validation cross-entropy alone might suggest.

The visual below condenses the algorithmic structure: initialize a prefix, repeatedly compute the next-token distribution under the causal mask, select a token, append it, and stop on an end token or length limit. The highlighted assignment line is the decisive greedy step—the place where a full probability distribution collapses into one chosen symbol.

It also emphasizes the broader contrast: beam search changes only the number of retained prefixes, not the left-to-right nature of decoding. Whether we keep one prefix or $K_{\mathrm{beam}}$ prefixes, generation proceeds token by token because each new prediction becomes part of the next conditioning context.

24. Complexity and Path Length

After seeing autoregressive decoding, it is tempting to think of Transformer cost mainly in terms of generation: one token at a time, with a growing key-value cache. But the deeper architectural trade-off is already present inside each layer. A Transformer layer gives every position a direct communication channel to every other position. That is the source of its remarkable parallelism and short dependency paths—but it is also exactly where the quadratic cost enters.
Recall the core operation:
Attention⁡(Q,K,V)=softmax⁡ ⁣(QK⊤dk)V.\operatorname{Attention}(Q,K,V)
=
\operatorname{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V.Attention(Q,K,V)=softmax(dk​​QK⊤​)V.
For a sequence of length nnn, the query and key matrices contain one vector per position, so the score matrix satisfies
QK⊤∈Rn×n.QK^\top \in \mathbb{R}^{n\times n}.QK⊤∈Rn×n.
This matrix is the all-pairs comparison table: position iii scores position jjj for every pair (i,j)(i,j)(i,j). In self-attention, the model is not restricted to neighboring tokens or to information carried through a recurrent hidden state. It can ask, in one layer, “which positions in the entire sequence are relevant to this position?”
That global access gives self-attention a constant path length between positions. If token xix_ixi​ needs information from token xjx_jxj​, then one attention layer can create a direct edge from jjj to iii. In graph terms, the self-attention layer behaves like a dense directed graph over sequence positions. The number of computational layers required for information to travel from one token to another is therefore
L=O(1).L = \mathcal{O}(1).L=O(1).
This is a major contrast with recurrence. In a left-to-right recurrent model, information from an early token must be repeatedly compressed and passed through hidden states before reaching a later token. Even if each recurrent update is powerful, the dependency path between distant positions grows with sequence length:
L=O(n).L = \mathcal{O}(n).L=O(n).
That long path creates two related problems. First, optimization becomes harder because gradients and information must survive many transformations. Second, training is less parallel across time, since hidden state hth_tht​ depends on ht−1h_{t-1}ht−1​. Recurrence has attractive linear sequence scaling, but it pays for that with sequential computation and long communication paths.
Convolution sits somewhere in between. A local convolution can process all positions in parallel, which is good for hardware utilization, but each layer only mixes information within a fixed neighborhood. To connect distant tokens, we must stack many layers, use dilation, increase kernel width, or combine these strategies. Thus the effective path length grows with the number of layers needed to cover the distance. Locality is computationally efficient, but global communication is not immediate.
The Transformer chooses the opposite bargain. Full self-attention spends compute and memory to make global communication cheap in depth. Computing the attention scores and applying them to values gives the familiar per-layer scaling
self-attention compute=O(n2dmodel),attention memory=O(n2).\text{self-attention compute}
=
\mathcal{O}(n^2 d_{\mathrm{model}}),
\qquad
\text{attention memory}
=
\mathcal{O}(n^2).self-attention compute=O(n2dmodel​),attention memory=O(n2).
The O(n2)\mathcal{O}(n^2)O(n2) term comes from storing or materializing the attention matrix AAA, whose entries correspond to pairwise token interactions. The dmodeld_{\mathrm{model}}dmodel​ factor appears because those interactions are used to combine vector-valued representations. Exact constants depend on the number of heads, projections, implementation details, and whether intermediate attention matrices are materialized, but the core asymptotic point remains: dense attention scales quadratically in sequence length.
By comparison, a recurrent layer is often summarized as
recurrence compute=O(ndmodel2),L=O(n),\text{recurrence compute}
=
\mathcal{O}(n d_{\mathrm{model}}^2),
\qquad
L=\mathcal{O}(n),recurrence compute=O(ndmodel2​),L=O(n),
because each of the nnn steps applies a transformation to a dmodeld_{\mathrm{model}}dmodel​-dimensional state. Local convolution has similar linear dependence on nnn per layer, assuming fixed kernel size, but may require many stacked layers for long-range interaction. So the relevant comparison is not merely “which is cheaper?” but rather:
Self-attention: expensive all-pairs interaction, excellent parallelism, constant path length.
Recurrence: linear sequence cost, limited parallelism, long path length.
Local convolution: parallel and local, but long-range communication requires depth.
This trade-off is one of the central reasons Transformers became so effective. They are not efficient because they avoid expensive operations; they are effective because they spend computation in a way that modern accelerators can exploit. A dense n×nn\times nn×n attention matrix is costly, but it is also highly parallelizable. During training, all token representations in a layer can be computed simultaneously, unlike recurrent models that must advance through time.
The visual below condenses this argument into a comparison table. The highlighted self-attention row emphasizes both sides of the bargain: the orange O(n2)\mathcal{O}(n^2)O(n2) terms mark the quadratic cost, while the green O(1)\mathcal{O}(1)O(1) path length marks the architectural benefit. The small equation callout ties the cost directly to QK⊤QK^\topQK⊤, reminding us that the all-pairs score matrix is not an implementation accident—it is the defining mechanism of full attention.
The accompanying icons reinforce the same intuition geometrically. Self-attention resembles a fully connected graph over positions; recurrence resembles a chain; local convolution resembles stacked short-range windows. The key takeaway is therefore compact but important: Transformers buy parallel, global communication by paying quadratic scaling in sequence length.

25. Empirical Anchor: Original Transformer on Machine Translation

The previous discussion gave us a clean theoretical prediction: if every token can communicate with every other token in one self-attention layer, then the maximum information path length between positions becomes constant,
L=O(1),L=\mathcal{O}(1),L=O(1),
instead of growing linearly as in recurrence,
L=O(n),L=\mathcal{O}(n),L=O(n),
or logarithmically under stacked convolutions with expanding receptive fields,
L=O(log⁡n).L=\mathcal{O}(\log n).L=O(logn).
That argument is elegant, but by itself it is not enough. A shorter path length is only useful if the model can exploit it in a real learning problem, under realistic optimization and hardware constraints. The original Transformer paper mattered because it turned this architectural claim into an empirical result: on large-scale machine translation, replacing recurrence and convolution with attention was not merely conceptually simpler—it was faster to train and more accurate.
The canonical benchmark was WMT 2014 machine translation, especially English-to-German and English-to-French. These tasks were a natural testbed for sequence-to-sequence models because they require both local phrase modeling and long-range dependency handling: agreement, word reordering, dropped pronouns, clause structure, and context-dependent lexical choices. The Transformer was evaluated as an encoder-decoder model: the encoder builds contextual representations of the source sentence, the decoder generates the target sentence autoregressively, and cross-attention lets each target-side position retrieve relevant source-side information.
Training used the now-standard maximum-likelihood setup with teacher forcing. Given a source sentence xxx and a target sequence y1,…,yTy_1,\dots,y_Ty1​,…,yT​, the model is trained to maximize
∑t=1Tlog⁡p(yt∣y<t,x),\sum_{t=1}^{T} \log p(y_t \mid y_{<t}, x),t=1∑T​logp(yt​∣y<t​,x),
where the decoder receives the true previous target tokens during training. This is important because the Transformer’s reported advantage was not based on an exotic objective or a different task formulation. It was tested in the same basic supervised translation regime as the recurrent and convolutional systems it replaced.
The main reported quality metric was BLEU, a corpus-level measure based on modified nnn-gram precision with a brevity penalty. BLEU is imperfect: it rewards surface overlap with reference translations and can miss semantic equivalence, discourse quality, or stylistic appropriateness. But as a historical comparison point for WMT machine translation systems, it was the standard scoreboard. So when Transformer-base reached
BLEU=27.3\mathrm{BLEU}=27.3BLEU=27.3
on WMT14 English-German, and Transformer-big reached
BLEU=28.4,\mathrm{BLEU}=28.4,BLEU=28.4,
those numbers were meaningful because they exceeded strong recurrent and convolutional baselines on the same benchmark family.
The efficiency result was just as important as the accuracy result. The Transformer-base model achieved its English-German score with a reported training cost of about 3.3×10183.3\times 10^{18}3.3×1018 FLOPs, substantially below many competitive recurrent encoder-decoder systems with attention. Transformer-big used more compute—about 2.3×10192.3\times 10^{19}2.3×1019 FLOPs—but still remained competitive with or better than much more sequential alternatives, and also achieved
BLEU=41.8\mathrm{BLEU}=41.8BLEU=41.8
on WMT14 English-French.
The subtle point is that this was not simply “a bigger model wins.” In fact, Transformer-base was not the largest or most expensive system in the comparison. Its advantage came from changing the computational geometry of the sequence model. Recurrence has a strong inductive bias for ordered processing, but it also imposes a hard sequential bottleneck: hidden state hth_tht​ depends on ht−1h_{t-1}ht−1​, which depends on ht−2h_{t-2}ht−2​, and so on. This makes both optimization and hardware utilization harder for long sequences. Convolutions improve parallelism, but distant positions still require multiple layers to interact unless the convolutional kernel is very wide or dilated.
Self-attention makes a different trade-off. Each layer pays a quadratic pairwise interaction cost in sequence length, but in exchange it allows content-dependent global communication immediately. A source token near the beginning of a sentence can influence a token near the end through a single attention operation, not through a chain of recurrent updates or a tower of convolutional neighborhoods. That constant path length is not just a theoretical convenience; it changes how gradients, alignments, and contextual evidence move through the network.
There are still caveats. The original translation setting used sequence lengths where quadratic attention was affordable, and the comparison depends on implementation details, hardware, batching, and the exact baselines chosen. BLEU also does not fully capture translation quality. But the empirical lesson survived these caveats: the Transformer converted a structural advantage—parallel global communication—into a practical training advantage. It improved quality while reducing sequential dependence, which is precisely what the path-length analysis suggested should happen.
The visual below condenses this empirical anchor into a compact comparison. The recurrent and convolutional rows represent the pre-Transformer alternatives: respectively stronger sequential dependence with L=O(n)L=\mathcal{O}(n)L=O(n), and improved but still multi-hop communication with L=O(log⁡n)L=\mathcal{O}(\log n)L=O(logn). The Transformer rows highlight the key outcome: higher BLEU scores paired with L=O(1)L=\mathcal{O}(1)L=O(1), showing that constant-path self-attention was not just an architectural novelty but a measurable advantage on a demanding benchmark.
Read the table less as a leaderboard and more as evidence for the trade-off we have been building toward. The important pattern is the alignment between shorter communication paths, greater parallelism, and better translation accuracy per reported training cost. This is why the original machine translation result became the empirical anchor for the Transformer architecture.

26. Worked Example: What Does One Attention Head Compute?

After seeing that the original Transformer worked surprisingly well in machine translation, it is worth slowing down and asking what one of its smallest moving parts is actually doing. The full model is large, multi-layered, and multi-headed, but a single attention head has a very concrete interpretation: it performs a content-based lookup. Given a token position, it asks, “Which other positions contain information useful for updating this representation?” Then it returns a weighted mixture of those positions’ value vectors.
Consider the toy sentence:
“The animal did not cross because it was tired”
Focus on position i=7i=7i=7, the token “it”. In a language understanding setting, a useful head might learn that “it” refers back to “animal”. Importantly, the model is not explicitly given a symbolic coreference rule. Instead, the head computes this relationship through learned vector projections. The hidden state at each position is projected into a query, key, and value:
the query q7q_7q7​ represents what position 7 is looking for;
each key kjk_jkj​ represents what position jjj offers as retrievable content;
each value vjv_jvj​ is the information that will be copied, blended, or routed forward if position jjj is attended to.
For the query position 777, the head scores every candidate position jjj using a scaled dot product:
s7j=q7⊤kjdk.s_{7j}=\frac{q_7^\top k_j}{\sqrt{d_k}}.s7j​=dk​​q7⊤​kj​​.
The dot product q7⊤kjq_7^\top k_jq7⊤​kj​ is a compatibility score: it is large when the query and key point in similar directions in the learned representation space. The division by dk\sqrt{d_k}dk​​ is not cosmetic. If the components of queries and keys have roughly unit variance, then an unscaled dot product grows in variance with dkd_kdk​. Large raw scores can push the softmax into saturation, producing extremely peaked gradients early in training. Scaling keeps the logits in a numerically and statistically healthier range.
Suppose this attention head has learned a roughly coreference-like pattern. For the query token “it”, the score against “animal” might be high, while scores against nearby but less relevant words are lower:
score(animal)=2.8,score(tired)=0.4,score(cross)=0.1.\text{score}(\text{animal}) = 2.8,\qquad
\text{score}(\text{tired}) = 0.4,\qquad
\text{score}(\text{cross}) = 0.1.score(animal)=2.8,score(tired)=0.4,score(cross)=0.1.
These scores are not yet attention weights. They are unnormalized retrieval logits. The head converts them into a probability distribution over positions using a row-wise softmax:
a7j=softmax⁡(s7j).a_{7j}=\operatorname{softmax}(s_{7j}).a7j​=softmax(s7j​).
After normalization, the largest score receives most of the mass. In this illustrative example, the token “animal” might receive weight a7,2≈0.73a_{7,2}\approx 0.73a7,2​≈0.73, while “tired” and “cross” receive much smaller weights, say a7,9≈0.07a_{7,9}\approx 0.07a7,9​≈0.07 and a7,5≈0.05a_{7,5}\approx 0.05a7,5​≈0.05. The output of the head at position 7 is then the weighted sum of value vectors:
z7=∑j=19a7jvj≈0.73v2+small mixtures of other vj.z_7=\sum_{j=1}^{9} a_{7j}v_j
\approx 0.73v_2 + \text{small mixtures of other } v_j.z7​=j=1∑9​a7j​vj​≈0.73v2​+small mixtures of other vj​.
This is the key operational idea: the representation of “it” is updated by directly mixing in information from “animal”. Unlike an RNN, this path does not require information to be carried step-by-step through the intervening tokens. Unlike a fixed-width convolution, it does not require many stacked layers to connect distant positions. A single attention head can create a direct, data-dependent communication channel between any two positions in the sequence.
There are two subtle points worth keeping in mind. First, the attention weights are content-dependent, not fixed by distance or position alone. A different sentence containing the same word “it” could produce a very different attention pattern. Second, attention weights are not always a faithful human explanation of what the model “believes.” One head may look interpretable, another may spread mass broadly, and another may implement a feature routing pattern that has no simple linguistic label. The computation is still meaningful, but the meaning lives in the learned representation space, not necessarily in our preferred grammatical categories.
The visual below compresses this worked example into the mechanics of one head. The query token “it” sends compatibility scores to all keys, the softmax turns those scores into retrieval weights, and the resulting output vector is dominated by the value at “animal”. The thick arrow represents the high-weight content path; the thinner arrows remind us that attention is usually a mixture, not a hard pointer.
The accompanying bar chart is a useful sanity check: attention for position 7 is a distribution over source positions. If one bar dominates, the head behaves almost like a soft lookup. If the bars are flatter, the head is aggregating broader context. In either case, the computation is the same: score keys against a query, normalize the scores, and return a weighted mixture of values.

27. Limitations and Common Failure Modes

After seeing a concrete attention head in action, it is tempting to think of attention as a nearly ideal communication primitive: every token can look directly at every other token, choose what matters, and aggregate the relevant information in one differentiable step. That intuition is mostly right—and it is exactly why Transformers became so dominant. But the same mechanism that gives self-attention its strength also creates several recurring limitations. The model has global content-based access, not free reasoning, not unlimited memory, and not a guarantee of faithful explanations.
The most basic trade-off is computational. In full self-attention, each of the nnn tokens forms a query and compares it against all nnn keys. This produces an n×nn \times nn×n attention score matrix. Even before thinking about values or feed-forward layers, the model has committed to representing all pairwise token-token interactions. Thus attention memory scales as
O(n2),\mathcal{O}(n^2),O(n2),
and the main attention computation scales roughly as
O(n2dmodel).\mathcal{O}(n^2 d_{\mathrm{model}}).O(n2dmodel​).
This quadratic dependence is not a small implementation detail; it is a structural property of dense attention. Doubling the sequence length roughly quadruples the number of pairwise scores. For short and medium contexts, this cost is often worth paying because global communication is extremely expressive. For very long contexts, however, the attention matrix becomes a bottleneck in memory, compute, latency, and training batch size. Many efficient-attention variants can be understood as different compromises: sparsify the pairs, compress the memory, chunk the sequence, use recurrence-like state, or approximate the attention kernel. Each saves something, but usually gives up exact dense global access.
A second failure mode is subtler: masking enforces information flow constraints, not correctness. In decoder-only or autoregressive decoding, the causal mask McausalM_{\mathrm{causal}}Mcausal​ prevents position ttt from attending to future positions >t>t>t. This is essential. Without it, the model could leak information from the target future during training and learn a distribution that cannot be used honestly at generation time. But a causal mask only says, “do not look ahead.” It does not say, “remain globally consistent,” “do not contradict yourself,” or “plan the entire answer before producing the first token.”
This distinction matters because autoregressive generation is sequentially conditioned on the model’s own previous outputs. During teacher-forced training, the model commonly learns next-token prediction under ground-truth prefixes y<ty_{<t}y<t​. At inference time, the prefix is instead made of sampled or selected predictions y^<t\hat{y}_{<t}y^​<t​. These are not the same conditioning events:
pθ(yt∣y^<t,x1:n)≠pθ(yt∣y<t,x1:n).p_\theta(y_t \mid \hat{y}_{<t},x_{1:n})
\neq
p_\theta(y_t \mid y_{<t},x_{1:n}).pθ​(yt​∣y^​<t​,x1:n​)=pθ​(yt​∣y<t​,x1:n​).
A small early error can shift the model into a region of prefix space that was less common during training. Later predictions then condition on that altered history, so mistakes can compound. This is one reason decoding strategy matters: greedy decoding, beam search, sampling temperature, nucleus sampling, length penalties, and reranking all shape how the model moves through its own distribution. None of them removes the underlying mismatch completely.
Position information introduces another important caveat. Self-attention by itself is permutation-equivariant: without positional signals, the model has no inherent notion that token 3 came before token 17. Positional encodings or position-dependent biases break this symmetry in useful ways, allowing the network to represent order, distance, locality, and sequence structure. But those position mechanisms are learned or designed under some training distribution. If the model is evaluated on much longer sequences, unfamiliar spacing patterns, or tasks requiring sharper extrapolation than training demanded, the positional representation pip_ipi​ may not generalize reliably.
This is not merely about “absolute versus relative” positions in a superficial sense. The deeper issue is whether the model has learned a rule that extrapolates, or only an interpolation pattern over observed contexts. For example:
Absolute learned embeddings can be brittle beyond trained indices.
Sinusoidal encodings provide a deterministic continuation, but the learned network still must use them extrapolatively.
Relative or rotary schemes often improve length behavior, yet can still fail when the test regime differs sharply from training.
So positional design helps, but it does not magically grant algorithmic length generalization.
Finally, attention weights invite an interpretability trap. A large attention coefficient aija_{ij}aij​ tells us that, in that layer and head, token iii placed substantial weight on token jjj’s value vector. That can be a useful diagnostic. It may reveal copying behavior, syntactic alignment, retrieval from a prompt, or dependence on a particular context span. But it is not, by itself, a causal explanation of the final output. The value vector may encode many features; later residual streams, MLPs, layer normalizations, and other heads can transform or override the contribution; and changing the attention weight alone may not produce the intuitive change we expect.
A good way to summarize these limitations is to separate what Transformers guarantee structurally from what they merely encourage statistically. Dense attention guarantees direct pairwise communication, but at quadratic cost. A causal mask guarantees no future-token leakage, but not globally coherent generation. Positional encodings provide order information, but not necessarily robust extrapolation. Autoregressive training gives a powerful conditional model, but inference conditions on the model’s own imperfect history. Attention maps expose part of the computation, but not a complete causal proof.
The visual below condenses these points into a comparison of common failure modes. The left side groups the main engineering and modeling limitations: the n×nn \times nn×n attention grid for quadratic scaling, the causal triangle for masking, the fading position ruler for extrapolation risk, and the prediction chain where an early error propagates forward. The central equation highlights the training-inference conditioning mismatch that drives autoregressive error accumulation.
The attention heatmap callout is intentionally separated from the others because it is not primarily a performance failure—it is an interpretation failure. A highlighted cell with large aija_{ij}aij​ may be evidence worth investigating, but the warning is that attention is a diagnostic signal, not a complete explanation. Together, these caveats set up the final unifying summary: Transformers are extraordinarily flexible sequence models, but their guarantees come from precise architectural constraints, and their weaknesses appear exactly where those constraints stop.

28. Unifying Summary: Transformer Forms and Equations

After looking at the limitations and failure modes, it is worth ending by stepping back from the many names Transformer models have acquired. “BERT-style,” “GPT-style,” “T5-style,” encoder-only, decoder-only, encoder-decoder: these are not fundamentally different mathematical species. They are different ways of wiring the same core operation, imposing different visibility constraints, and training against different probability factorizations.
The shared core is still scaled dot-product attention:
Attention⁡(Q,K,V)=softmax⁡ ⁣(QK⊤dk+M)V.\operatorname{Attention}(Q,K,V)
=
\operatorname{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}+M\right)V.Attention(Q,K,V)=softmax(dk​​QK⊤​+M)V.
This equation is the common language of the architecture. Queries QQQ ask what information is needed, keys KKK advertise what each position contains, and values VVV carry the information to be mixed. The dot product QK⊤QK^\topQK⊤ implements content-based retrieval; the scale factor dk\sqrt{d_k}dk​​ keeps logits from growing too large as the key dimension increases; and the additive mask MMM determines which token-to-token communications are legal.
That last term, MMM, is deceptively important. Much of the difference between Transformer variants comes not from changing attention itself, but from changing who is allowed to attend to whom. A padding mask prevents the model from treating artificial padding tokens as real content. A causal mask prevents position ttt from looking at positions >t>t>t, preserving autoregressive generation. Cross-attention uses queries from one sequence and keys/values from another, allowing a decoder to retrieve information from an encoded source sequence.
Around this attention core, the standard Transformer block adds the same supporting machinery again and again: multi-head attention, a position-wise feed-forward network, residual connections, layer normalization, and some form of position information. Multi-head attention lets different subspaces implement different retrieval patterns. The feed-forward network transforms each token representation locally after communication. Residual paths stabilize optimization and preserve information across depth. Normalization controls activation scale and makes very deep stacks trainable. Positional encodings or embeddings break the permutation symmetry that pure self-attention would otherwise have.
So the most useful distinction among Transformer families is probabilistic rather than architectural.
An encoder-only Transformer uses bidirectional self-attention. Each token can attend to tokens on both its left and right, subject only to padding constraints. This makes it natural for classification, tagging, retrieval, and masked-token-style objectives where the model is allowed to build a contextual representation of the entire input. It does not directly define a left-to-right generative factorization unless one is added through the task design.
A decoder-only Transformer uses causal self-attention. Token xtx_txt​ may depend only on x<tx_{<t}x<t​, so the model defines an autoregressive distribution
pθ(x1:n)=∏t=1npθ(xt∣x<t).p_\theta(x_{1:n})
=
\prod_{t=1}^{n} p_\theta(x_t \mid x_{<t}).pθ​(x1:n​)=t=1∏n​pθ​(xt​∣x<t​).
The usual maximum-likelihood training objective is
L(θ)=−∑x1:n∈D∑tlog⁡pθ(xt∣x<t).\mathcal{L}(\theta)
=
-\sum_{x_{1:n}\in\mathcal{D}}
\sum_t
\log p_\theta(x_t\mid x_{<t}).L(θ)=−x1:n​∈D∑​t∑​logpθ​(xt​∣x<t​).
This factorization is what makes decoder-only models natural language generators: at inference time, we repeatedly sample or select the next token, append it to the context, and run the same conditional distribution again.
An encoder-decoder Transformer combines both patterns. The encoder reads the full source sequence x1:nx_{1:n}x1:n​ bidirectionally. The decoder generates the target sequence y1:my_{1:m}y1:m​ causally, while also using cross-attention to retrieve source-side information from the encoder. Its probability model is
pθ(y1:m∣x1:n)=∏t=1mpθ(yt∣y<t,x1:n),p_\theta(y_{1:m}\mid x_{1:n})
=
\prod_{t=1}^{m}
p_\theta(y_t\mid y_{<t},x_{1:n}),pθ​(y1:m​∣x1:n​)=t=1∏m​pθ​(yt​∣y<t​,x1:n​),
with objective
L(θ)=−∑(x1:n,y1:m)∈D∑t=1mlog⁡pθ(yt∣y<t,x1:n).\mathcal{L}(\theta)
=
-\sum_{(x_{1:n},y_{1:m})\in\mathcal{D}}
\sum_{t=1}^{m}
\log p_\theta(y_t\mid y_{<t},x_{1:n}).L(θ)=−(x1:n​,y1:m​)∈D∑​t=1∑m​logpθ​(yt​∣y<t​,x1:n​).
This is the classical sequence-to-sequence setting: translation, summarization, speech recognition, structured generation, and any task where an output sequence is generated conditionally on an input sequence.
The key takeaway is that architecture, mask, and objective must agree. If a model is trained with a causal mask, it can be used autoregressively without leaking future information. If a model has bidirectional attention, it can produce rich contextual embeddings, but it cannot be naively sampled left-to-right as though it had learned p(xt∣x<t)p(x_t\mid x_{<t})p(xt​∣x<t​). If a model uses cross-attention, it explicitly separates source representation from target generation. Many practical failures come from confusing these regimes: using the wrong mask, training with one visibility pattern and decoding with another, or assuming that all Transformer outputs correspond to the same kind of probability distribution.
A compact way to remember the whole lecture is:
Same attention equation
Same stackable block template
Different masks
Different probability factorizations
Different maximum-likelihood objectives
The visual summary below consolidates this unification. The attention equation sits at the top because it is the invariant mechanism. Beneath it, the three major Transformer forms differ mainly in their attention pattern, mask, modeled distribution, and loss. Read the table horizontally: each row is a coherent contract between visibility, probability, and training.
The footer slogan is also the right mental model to leave with: Transformers are differentiable content-addressable communication layers plus position information, stacked deeply and trained by maximum likelihood. Once that is clear, the apparent diversity of Transformer architectures becomes much easier to organize.