LoRA Fine-Tuning: Low-Rank Adaptation of Large Neural Networks - FeynmanWiki

CONTENTS

Bookmark this paper

Save for later reading

DEEP DIVE - 78 MIN READ

LoRA Fine-Tuning: Low-Rank Adaptation of Large Neural Networks

1. The Fine-Tuning Bottleneck

Before we get to LoRA itself, it is worth slowing down on the pressure point that made methods like LoRA necessary. Modern pretrained models are powerful precisely because they concentrate enormous general-purpose knowledge into a parameter vector θ0\theta_0θ0​. But that same scale turns the most obvious adaptation strategy—just fine-tune everything—into a systems problem as much as a learning problem.
Suppose we start with a pretrained model fθ0f_{\theta_0}fθ0​​ and a supervised downstream dataset
D={(xi,yi)}i=1n.\mathcal{D}=\{(x_i,y_i)\}_{i=1}^{n}.D={(xi​,yi​)}i=1n​.
For a task such as sentiment classification, medical extraction, instruction following, or customer-specific generation, adaptation means changing something about the model so that its predictions y^i\hat{y}_iy^​i​ better match the desired outputs yiy_iyi​. Abstractly, we choose some trainable parameters ϕ\phiϕ and minimize the empirical loss
L(ϕ)=1n∑i=1nℓ(y^i,yi),y^i=fθ(xi).\mathcal{L}(\phi)
=
\frac{1}{n}\sum_{i=1}^{n}\ell(\hat{y}_i,y_i),
\qquad
\hat{y}_i=f_{\theta}(x_i).L(ϕ)=n1​i=1∑n​ℓ(y^​i​,yi​),y^​i​=fθ​(xi​).
The important detail is that ϕ\phiϕ does not have to be the entire model. It simply denotes whatever parameters we decide are allowed to move during adaptation. In full fine-tuning, we make the maximal choice: initialize θ=θ0\theta=\theta_0θ=θ0​, set ϕ=θ\phi=\thetaϕ=θ, and optimize every parameter of the pretrained model.
This is a very strong baseline. Full fine-tuning gives the optimizer direct access to every layer, every attention projection, every MLP weight, every embedding table, and every normalization parameter. If the downstream task needs a subtle redistribution of features throughout the network, full fine-tuning can in principle make those changes. That flexibility is one reason it often performs well when we have enough data, compute, and careful regularization.
The bottleneck is that flexibility is expensive. A model with billions of parameters is not just expensive to pretrain; it is expensive to repeatedly copy, optimize, store, serve, version, and audit. During training, full fine-tuning typically requires optimizer state for every trainable parameter. With Adam-like optimizers, that can mean storing the parameter, its gradient, and multiple moment estimates. The memory footprint can become several times larger than the model weights alone.
The deployment cost is even more direct. If we fine-tune a separate full model for each task or customer, then each adaptation produces another full-sized parameter vector:
θ(1),θ(2),…,θ(T).\theta^{(1)},\theta^{(2)},\ldots,\theta^{(T)}.θ(1),θ(2),…,θ(T).
Even if each θ(t)\theta^{(t)}θ(t) differs only slightly from the original θ0\theta_0θ0​, naive full fine-tuning stores every adapted copy independently. For a 7B, 13B, or 70B parameter model, this quickly becomes impractical. The problem is not merely training one model once; the problem is supporting many adapted models.
There is also a statistical subtlety. Downstream datasets are often much smaller than the pretraining corpus. When nnn is small and θ\thetaθ has billions of trainable degrees of freedom, the model has enough capacity to memorize idiosyncrasies of the task dataset. Good fine-tuning practice therefore relies on learning-rate schedules, regularization, early stopping, validation sets, and sometimes freezing parts of the network. Full fine-tuning is powerful, but it is not automatically data-efficient or operationally convenient.
This motivates the central question behind parameter-efficient fine-tuning:
> Can we keep the pretrained weights θ0\theta_0θ0​ frozen, while letting a much smaller set of parameters ϕ\phiϕ carry the task-specific adaptation?
If the answer is yes, then we can reuse one large shared model and store only a small task-specific delta for each downstream use case. Instead of maintaining many full copies of θ\thetaθ, we maintain one base model plus many compact adaptations. LoRA will instantiate this idea by restricting the update to certain weight matrices and forcing that update to have low rank, but the motivation starts here: most of the pretrained model should remain a reusable shared asset.
A useful way to frame the trade-off is:
Full fine-tuning: maximum flexibility, but every task may require a full model copy.
Frozen base model: maximum reuse, but no adaptation unless we add trainable task-specific parameters.
Parameter-efficient adaptation: preserve most of the pretrained model while learning a small, targeted modification.
The visual below compresses this argument into a left-to-right bottleneck: a small task dataset feeds into a huge pretrained model, full fine-tuning produces separate large adapted models, and the cost grows with the number of tasks or customers. The key asymmetry is that the task data and desired behavioral change may be small, while the object being duplicated is enormous.
The small ϕ\phiϕ placeholder in the diagram points toward the idea LoRA will develop next. Instead of asking every parameter in θ0\theta_0θ0​ to move, we will ask whether adaptation can live in a much lower-dimensional space—small enough to store and train cheaply, but expressive enough to recover much of the performance of full fine-tuning.

CONTENTS

Bookmark this paper

Save for later reading

DEEP DIVE - 78 MIN READ

LoRA Fine-Tuning: Low-Rank Adaptation of Large Neural Networks

1. The Fine-Tuning Bottleneck

Before we get to LoRA itself, it is worth slowing down on the pressure point that made methods like LoRA necessary. Modern pretrained models are powerful precisely because they concentrate enormous general-purpose knowledge into a parameter vector θ0\theta_0θ0​. But that same scale turns the most obvious adaptation strategy—just fine-tune everything—into a systems problem as much as a learning problem.
Suppose we start with a pretrained model fθ0f_{\theta_0}fθ0​​ and a supervised downstream dataset
D={(xi,yi)}i=1n.\mathcal{D}=\{(x_i,y_i)\}_{i=1}^{n}.D={(xi​,yi​)}i=1n​.
For a task such as sentiment classification, medical extraction, instruction following, or customer-specific generation, adaptation means changing something about the model so that its predictions y^i\hat{y}_iy^​i​ better match the desired outputs yiy_iyi​. Abstractly, we choose some trainable parameters ϕ\phiϕ and minimize the empirical loss
L(ϕ)=1n∑i=1nℓ(y^i,yi),y^i=fθ(xi).\mathcal{L}(\phi)
=
\frac{1}{n}\sum_{i=1}^{n}\ell(\hat{y}_i,y_i),
\qquad
\hat{y}_i=f_{\theta}(x_i).L(ϕ)=n1​i=1∑n​ℓ(y^​i​,yi​),y^​i​=fθ​(xi​).
The important detail is that ϕ\phiϕ does not have to be the entire model. It simply denotes whatever parameters we decide are allowed to move during adaptation. In full fine-tuning, we make the maximal choice: initialize θ=θ0\theta=\theta_0θ=θ0​, set ϕ=θ\phi=\thetaϕ=θ, and optimize every parameter of the pretrained model.
This is a very strong baseline. Full fine-tuning gives the optimizer direct access to every layer, every attention projection, every MLP weight, every embedding table, and every normalization parameter. If the downstream task needs a subtle redistribution of features throughout the network, full fine-tuning can in principle make those changes. That flexibility is one reason it often performs well when we have enough data, compute, and careful regularization.
The bottleneck is that flexibility is expensive. A model with billions of parameters is not just expensive to pretrain; it is expensive to repeatedly copy, optimize, store, serve, version, and audit. During training, full fine-tuning typically requires optimizer state for every trainable parameter. With Adam-like optimizers, that can mean storing the parameter, its gradient, and multiple moment estimates. The memory footprint can become several times larger than the model weights alone.
The deployment cost is even more direct. If we fine-tune a separate full model for each task or customer, then each adaptation produces another full-sized parameter vector:
θ(1),θ(2),…,θ(T).\theta^{(1)},\theta^{(2)},\ldots,\theta^{(T)}.θ(1),θ(2),…,θ(T).
Even if each θ(t)\theta^{(t)}θ(t) differs only slightly from the original θ0\theta_0θ0​, naive full fine-tuning stores every adapted copy independently. For a 7B, 13B, or 70B parameter model, this quickly becomes impractical. The problem is not merely training one model once; the problem is supporting many adapted models.
There is also a statistical subtlety. Downstream datasets are often much smaller than the pretraining corpus. When nnn is small and θ\thetaθ has billions of trainable degrees of freedom, the model has enough capacity to memorize idiosyncrasies of the task dataset. Good fine-tuning practice therefore relies on learning-rate schedules, regularization, early stopping, validation sets, and sometimes freezing parts of the network. Full fine-tuning is powerful, but it is not automatically data-efficient or operationally convenient.
This motivates the central question behind parameter-efficient fine-tuning:
> Can we keep the pretrained weights θ0\theta_0θ0​ frozen, while letting a much smaller set of parameters ϕ\phiϕ carry the task-specific adaptation?
If the answer is yes, then we can reuse one large shared model and store only a small task-specific delta for each downstream use case. Instead of maintaining many full copies of θ\thetaθ, we maintain one base model plus many compact adaptations. LoRA will instantiate this idea by restricting the update to certain weight matrices and forcing that update to have low rank, but the motivation starts here: most of the pretrained model should remain a reusable shared asset.
A useful way to frame the trade-off is:
Full fine-tuning: maximum flexibility, but every task may require a full model copy.
Frozen base model: maximum reuse, but no adaptation unless we add trainable task-specific parameters.
Parameter-efficient adaptation: preserve most of the pretrained model while learning a small, targeted modification.
The visual below compresses this argument into a left-to-right bottleneck: a small task dataset feeds into a huge pretrained model, full fine-tuning produces separate large adapted models, and the cost grows with the number of tasks or customers. The key asymmetry is that the task data and desired behavioral change may be small, while the object being duplicated is enormous.
The small ϕ\phiϕ placeholder in the diagram points toward the idea LoRA will develop next. Instead of asking every parameter in θ0\theta_0θ0​ to move, we will ask whether adaptation can live in a much lower-dimensional space—small enough to store and train cheaply, but expressive enough to recover much of the performance of full fine-tuning.

2. Failure Case: One Full Copy Per Task

The optimization view from the previous section makes full fine-tuning look deceptively simple: start from a pretrained parameter vector θ0\theta_0θ0​, optimize the downstream loss, and obtain an adapted parameter vector θ\thetaθ. Conceptually, that is clean. Operationally, however, it hides the main scaling problem: after fine-tuning, θ\thetaθ is no longer a universal shared object. It is now task-specific.
If we adapt the same pretrained model to one downstream task, this may be acceptable. But the whole point of large pretrained models is that they are reused across many tasks: summarization, code generation, retrieval-augmented QA, domain-specific chat, instruction following, classification, extraction, and so on. With full fine-tuning, adapting one base model to 100 downstream tasks gives 100 distinct full model checkpoints. Each checkpoint contains a complete copy of the model parameters, even if the actual task-specific change from θ0\theta_0θ0​ is small.
For a single dense weight matrix W0∈Rd×kW_0 \in \mathbb{R}^{d \times k}W0​∈Rd×k, full fine-tuning treats every entry as trainable. The number of trainable parameters in that matrix is therefore
pfull=dk.p_{\mathrm{full}} = dk.pfull​=dk.
This looks harmless at the matrix level, but modern transformers contain many such matrices: attention projections, MLP projections, embeddings, output heads, and normalization-related parameters. Full fine-tuning scales with the size of the entire model, not with the amount of information needed to specialize the model to the task.
The training memory cost is even worse than the checkpoint size suggests. During training, we do not only store the parameters themselves. We also store gradients, activations needed for backpropagation, and optimizer state. For Adam-like optimizers, each trainable parameter typically carries additional moment estimates, often the first and second moments. Abstracting away implementation details, the memory footprint associated with trainable parameters can be summarized as
Mtrain≈(cparam+copt)pfull,M_{\mathrm{train}} \approx (c_{\mathrm{param}} + c_{\mathrm{opt}})p_{\mathrm{full}},Mtrain​≈(cparam​+copt​)pfull​,
where cparamc_{\mathrm{param}}cparam​ accounts for storing trainable weights and related parameter-level quantities, while coptc_{\mathrm{opt}}copt​ accounts for optimizer state. The important point is not the exact constant; it is the scaling. If every parameter is trainable, Adam allocates optimizer state for every parameter.
This creates a practical failure mode that is easy to underestimate. The base model may be shared in theory, but full fine-tuning destroys that sharing in deployment. Each task now needs its own full checkpoint. Serving many variants requires model routing, checkpoint loading, memory residency decisions, version management, and duplicated storage. If the base model has billions of parameters, even “just one more fine-tuned task” can become expensive.
The key inefficiency is that full fine-tuning assumes the task-specific solution must be represented as an entirely new point θ\thetaθ in the full parameter space. But empirically, many downstream adaptations appear to require only a comparatively small change in behavior. We would like to capture that change without rewriting and storing the entire model. In other words, the desired pattern is:
keep the pretrained parameters θ0\theta_0θ0​ shared and frozen;
learn a much smaller task-specific object ϕ\phiϕ;
combine θ0\theta_0θ0​ and ϕ\phiϕ at inference time to recover task-adapted behavior.
This is the motivation behind parameter-efficient fine-tuning. Rather than asking, “How do we find a new full model for each task?”, we ask, “Can we represent the task-specific update compactly?” LoRA will answer this by constraining updates to certain weight matrices to have low rank, but before getting there, it is important to see the failure case clearly: full fine-tuning pays the full model cost even when the useful adaptation may be much smaller.
The visual below compresses this contrast into two patterns. On the left is the full fine-tuning regime: one shared pretrained model gives rise to many separate full checkpoints, with storage and Adam state growing with the full parameter count. On the right is the parameter-efficient goal: keep fθ0f_{\theta_0}fθ0​​ frozen and shared, while attaching small task-specific parameters ϕ\phiϕ.
The equation at the bottom anchors the intuition mathematically. For a fully trainable matrix, pfull=dkp_{\mathrm{full}}=dkpfull​=dk, and the training footprint scales like (cparam+copt)pfull(c_{\mathrm{param}}+c_{\mathrm{opt}})p_{\mathrm{full}}(cparam​+copt​)pfull​. The rest of the lecture is about replacing that scaling law with something much smaller, while preserving most of the quality benefits of fine-tuning.

3. Prior PEFT Ideas and Their Trade-Offs

The “one full copy per task” failure mode suggests a natural repair: stop treating every downstream task as a reason to duplicate the whole model. If the pretrained parameters θ0\theta_0θ0​ already encode broadly useful linguistic, visual, or multimodal structure, then perhaps task adaptation should be a small correction rather than a complete rewrite. This is the core motivation behind parameter-efficient fine-tuning: keep the base model frozen, and learn only a comparatively tiny set of task-specific parameters ϕ\phiϕ.
In abstract form, most PEFT methods follow the same pattern:
θ0 frozen,ϕ trainable,∣ϕ∣≪∣θ0∣.\theta_0 \text{ frozen}, \qquad \phi \text{ trainable}, \qquad |\phi| \ll |\theta_0|.θ0​ frozen,ϕ trainable,∣ϕ∣≪∣θ0​∣.
This changes the economics of fine-tuning. Instead of storing and serving a separate full model for each task, we store one shared pretrained backbone plus many small task modules. Training also becomes cheaper because gradients, optimizer states, and parameter updates are needed only for ϕ\phiϕ, not for the full θ0\theta_0θ0​. For very large models, this distinction is not cosmetic: optimizer states alone can multiply memory requirements several times over the raw parameter count.
But “freeze the model and train something small” is not a complete algorithm. The hard question is where to put the small trainable object so that it has enough influence over the computation. Different PEFT methods answer this question differently, and their trade-offs are mostly about the location and form of ϕ\phiϕ.
Prompt tuning puts the trainable parameters at the input side. Instead of updating the transformer weights, we prepend learned continuous prompt vectors to the input embeddings. This can be extremely parameter-efficient because the trainable object is just a small collection of vectors. However, it also means the model is adapted indirectly: the only way the task signal can affect the network is by changing the effective input sequence. That can be elegant, but it may be too weak for tasks requiring deeper internal changes, and it consumes part of the model’s context budget.
Prefix tuning is more targeted. Rather than adding learned vectors only at the input embedding layer, it learns prefix states that are injected into attention, often as additional key/value vectors. This gives the task parameters a more direct handle on attention patterns: the model can attend to learned task-specific memory at each layer. The cost is that these prefixes increase the effective attention context length, which can add computation and memory overhead during inference. The base weights remain frozen, but the runtime attention path is now larger.
Adapters take a different route: they insert small neural modules inside the network, commonly bottleneck MLPs placed between or within transformer sublayers. A typical adapter maps a hidden state down to a low-dimensional space, applies a nonlinearity, maps it back up, and adds the result residually. This is more expressive than input-side prompt methods because it creates new task-specific computation throughout the model. The downside is architectural surgery. The model’s forward graph now contains extra modules, which can introduce latency and complicate deployment, especially when we care about highly optimized inference kernels.
LoRA belongs to this PEFT family, but its design choice is more surgical: instead of changing the input sequence, extending attention context, or inserting new nonlinear modules, it modifies the effective weights of existing linear layers. The pretrained weight W0W_0W0​ is frozen, but the layer behaves as though it had received an additive update:
z=(W0+ΔWLoRA)h+b0.z = (W_0 + \Delta W_{\mathrm{LoRA}})h + b_0.z=(W0​+ΔWLoRA​)h+b0​.
The crucial constraint is that this update is not a full dense matrix. LoRA parameterizes it as a low-rank product:
ΔWLoRA=sBA,\Delta W_{\mathrm{LoRA}} = sBA,ΔWLoRA​=sBA,
where AAA and BBB are trainable low-rank factors and sss is a scaling factor. If W0∈Rdout×dinW_0 \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}W0​∈Rdout​×din​, then a full update would require doutdind_{\text{out}}d_{\text{in}}dout​din​ trainable parameters. LoRA instead uses something like
A∈Rr×din,B∈Rdout×r,A \in \mathbb{R}^{r \times d_{\text{in}}}, 
\qquad
B \in \mathbb{R}^{d_{\text{out}} \times r},A∈Rr×din​,B∈Rdout​×r,
for a total of
r(din+dout)r(d_{\text{in}} + d_{\text{out}})r(din​+dout​)
trainable parameters. When r≪min⁡(din,dout)r \ll \min(d_{\text{in}}, d_{\text{out}})r≪min(din​,dout​), this is dramatically smaller than a dense update.
This design gives LoRA an important deployment advantage over many earlier PEFT methods: after training, the learned update can often be merged into the frozen weight by replacing W0W_0W0​ with W0+ΔWLoRAW_0 + \Delta W_{\mathrm{LoRA}}W0​+ΔWLoRA​. At inference time, the layer can remain an ordinary linear transformation. There is no extra prompt length, no additional attention prefix, and no new adapter branch that must be executed separately. The trade-off is that LoRA’s expressiveness is controlled by the rank rrr. If rrr is too small, the update may be unable to represent the adaptation the task needs; if rrr is too large, we lose some of the parameter-efficiency and memory benefits.
So the PEFT landscape can be summarized by a simple comparison:
Prompt tuning: smallest and simplest, but influences the model indirectly through inputs.
Prefix tuning: more direct control over attention, but increases attention length.
Adapters: expressive internal computation, but add modules and inference overhead.
LoRA: trainable updates inside existing linear weights, with a mergeable low-rank form.
The visual below consolidates this comparison. The first three methods all preserve the frozen pretrained backbone, but they attach task-specific capacity in different places: before the model, inside attention state, or as inserted modules. LoRA is highlighted because it keeps the computation closest to the original linear layer while still allowing a learned task-specific correction.
The equations underneath the comparison are the key transition into the next idea. LoRA is not merely “another small module”; it is a hypothesis about the geometry of fine-tuning itself: perhaps the useful update ΔW\Delta WΔW does not need to occupy the full matrix space. If task adaptation lives in a low-dimensional subspace, then the factorization ΔWLoRA=sBA\Delta W_{\mathrm{LoRA}} = sBAΔWLoRA​=sBA is not just a memory trick — it is a structured constraint on how the pretrained model is allowed to move.

4. LoRA Hypothesis: Updates Live in a Low-Dimensional Subspace

The tension with earlier PEFT methods is that they save parameters by adding something around the pretrained model: prompts, adapters, extra bottlenecks, or task-specific modules. That can work well, but it raises an important question: do we really need to alter the architecture or insert new computation paths, or can we express task adaptation as a small change to the weights that are already there?
LoRA starts from a very direct view of fine-tuning. Suppose a pretrained linear layer has weight matrix W0∈Rd×kW_0 \in \mathbb{R}^{d \times k}W0​∈Rd×k. In ordinary full fine-tuning, we replace it with a new learned matrix WWW. Equivalently, we can write the final weight as the pretrained matrix plus a task-specific displacement:
W=W0+ΔW,W0∈Rd×k frozen.W = W_0 + \Delta W,
\qquad
W_0 \in \mathbb{R}^{d \times k}\ \text{frozen}.W=W0​+ΔW,W0​∈Rd×k frozen.
This decomposition is conceptually useful because it separates two roles. The pretrained matrix W0W_0W0​ already contains broad linguistic, visual, or multimodal structure learned from large-scale pretraining. The downstream task does not usually require relearning all of that structure from scratch. Instead, it may only need to steer the pretrained transformation in a relatively small number of directions.
The central LoRA hypothesis is therefore not that the pretrained weight itself is simple. W0W_0W0​ may be full-rank, dense, and highly expressive. The claim is about the update induced by downstream fine-tuning:
rank⁡(ΔW)≪min⁡(d,k).\operatorname{rank}(\Delta W) \ll \min(d,k).rank(ΔW)≪min(d,k).
In words: although ΔW\Delta WΔW has the same shape as W0W_0W0​, the meaningful change needed for a task may lie in a much lower-dimensional subspace. This is plausible in large pretrained models because many downstream tasks are not asking the model to invent entirely new representations; they are asking it to recombine, emphasize, suppress, or redirect features that already exist.
A useful geometric picture is to think of full fine-tuning as allowing movement in every direction in the d×kd \times kd×k-dimensional weight space. LoRA assumes that the useful displacement from W0W_0W0​ lives near a much smaller manifold or subspace. Rather than learning every entry of ΔW\Delta WΔW independently, we restrict ΔW\Delta WΔW to have low rank by factorizing it:
ΔWLoRA=sBA,A∈Rr×k,B∈Rd×r,s=α/r.\Delta W_{\mathrm{LoRA}} = sBA,
\qquad
A \in \mathbb{R}^{r \times k},
\qquad
B \in \mathbb{R}^{d \times r},
\qquad
s=\alpha/r.ΔWLoRA​=sBA,A∈Rr×k,B∈Rd×r,s=α/r.
Here rrr is the LoRA rank, usually chosen much smaller than ddd or kkk. The matrix AAA projects the input into an rrr-dimensional adaptation space, and BBB maps that low-dimensional signal back into the output dimension. The scalar s=α/rs=\alpha/rs=α/r controls the scale of the update, making the adaptation strength easier to tune across different ranks.
This factorization immediately gives the rank constraint. Since BABABA is the product of a d×rd \times rd×r matrix and an r×kr \times kr×k matrix,
rank⁡(BA)≤min⁡(rank⁡(B),rank⁡(A))≤r.\operatorname{rank}(BA)
\leq
\min(\operatorname{rank}(B), \operatorname{rank}(A))
\leq r.rank(BA)≤min(rank(B),rank(A))≤r.
So LoRA does not merely hope that the update will be low-rank; it enforces that constraint by construction. The learned update cannot use more than rrr independent directions, no matter how large the original matrix is.
The parameter savings follow from the same structure. A full update ΔW\Delta WΔW would require dkdkdk trainable parameters. LoRA learns only AAA and BBB, requiring
rk+dr=r(d+k)rk + dr = r(d+k)rk+dr=r(d+k)
parameters. When r≪min⁡(d,k)r \ll \min(d,k)r≪min(d,k), this is dramatically smaller than dkdkdk. For example, if d=k=4096d=k=4096d=k=4096 and r=8r=8r=8, a full update has over 16 million parameters, while the LoRA update has only 8(4096+4096)=65,5368(4096+4096)=65{,}5368(4096+4096)=65,536 parameters for that layer.
The subtle but important point is that LoRA turns fine-tuning into subspace learning around a strong pretrained solution. We are not training a small model instead of a large one. We are using the large model as a fixed base and learning a compact task-specific displacement. This is why LoRA can often preserve the capabilities of the pretrained model while adapting efficiently: the expressive burden remains mostly in W0W_0W0​, and the trainable part only needs to encode the difference.
There are also failure modes hidden inside the hypothesis. If a task genuinely requires many independent changes to a layer, a very small rank rrr may underfit. If the inserted LoRA modules are placed in layers or projections that are not important for the task, the low-rank update may be parameter-efficient but ineffective. And if the downstream distribution is far from pretraining, the assumption that adaptation is a small low-dimensional displacement may become weaker. LoRA is powerful precisely when the pretrained representation is already close and the task mostly needs targeted steering.
The visual below can be read as a compact summary of this idea. The large frozen matrix W0W_0W0​ represents the stable pretrained transformation. The small low-rank branch, formed by AAA and BBB, represents the lightweight trainable update ΔWLoRA=sBA\Delta W_{\mathrm{LoRA}} = sBAΔWLoRA​=sBA. The adapted weight is not a completely new matrix learned from scratch; it is the frozen base plus this constrained task-specific correction.
The key contrast is visual as much as mathematical: W0W_0W0​ remains large and expressive, while the learned update is deliberately narrow. The annotation r≪min⁡(d,k)r \ll \min(d,k)r≪min(d,k) captures the LoRA bet: most of the useful downstream movement can be expressed through a small number of adaptation directions.

5. The Unit of Adaptation: A Frozen Linear Layer

Having argued that task-specific changes may live in a much smaller subspace than the full parameter space, we now zoom in to the smallest object where LoRA actually acts: one linear layer. This is the right level of abstraction because transformer blocks are built from many affine maps—query, key, value, output projections, MLP projections—and fine-tuning ultimately changes the matrices inside those maps.
Consider a pretrained linear layer receiving an activation vector h∈Rkh \in \mathbb{R}^{k}h∈Rk and producing an output vector z∈Rdz \in \mathbb{R}^{d}z∈Rd. Before any adaptation, the layer computes
z=W0h+b0,z = W_0 h + b_0,z=W0​h+b0​,
where W0∈Rd×kW_0 \in \mathbb{R}^{d \times k}W0​∈Rd×k is the pretrained weight matrix and b0∈Rdb_0 \in \mathbb{R}^{d}b0​∈Rd is the pretrained bias. The subscript 000 is important: it reminds us that these parameters come from pretraining and are treated as the reference point around which adaptation happens.
Full fine-tuning would simply allow W0W_0W0​ and b0b_0b0​ to move freely during task training. But LoRA begins from a different framing. Instead of saying “we learn a new weight matrix,” we say: the adapted model uses an effective weight
W=W0+ΔW,ΔW∈Rd×k.W = W_0 + \Delta W,
\qquad
\Delta W \in \mathbb{R}^{d \times k}.W=W0​+ΔW,ΔW∈Rd×k.
Here, ΔW\Delta WΔW is the task-specific correction to the pretrained matrix. If ΔW=0\Delta W = 0ΔW=0, we recover the original model exactly. If ΔW\Delta WΔW is unrestricted, this is just another way of writing ordinary weight fine-tuning. The key move in LoRA will be to keep W0W_0W0​ fixed and learn only a structured version of ΔW\Delta WΔW.
This additive view is more than notation. It separates two roles that are entangled in full fine-tuning:
W0W_0W0​ stores broad pretrained knowledge accumulated from large-scale training.
ΔW\Delta WΔW stores the task-specific displacement needed for adaptation.
That distinction is what makes parameter-efficient fine-tuning possible. We do not need to rewrite the pretrained model from scratch; we need to learn a correction that steers its existing computation toward a new task or distribution.
There is also a subtle but useful assumption here: LoRA treats adaptation as a perturbation around a strong pretrained solution. That assumption is usually reasonable when the downstream task is related to the model’s pretraining distribution, but it can fail when the new task demands behavior far outside the pretrained model’s capabilities. In such cases, a small or low-rank ΔW\Delta WΔW may not have enough expressive power, and full fine-tuning—or a larger adaptation mechanism—may be necessary.
At this point, ΔW\Delta WΔW is still a full d×kd \times kd×k matrix. So we have not yet saved parameters. We have only changed our perspective from “learn the whole weight” to “learn an update to a frozen weight.” The next step will be to constrain ΔW\Delta WΔW to have low rank, but the additive decomposition itself is the core abstraction:
adapted computation  =  (W0+ΔW)h+b0.\text{adapted computation}
\;=\;
(W_0 + \Delta W)h + b_0.adapted computation=(W0​+ΔW)h+b0​.
In LoRA, W0W_0W0​ and typically b0b_0b0​ remain frozen during training. Gradients flow through the layer as usual, but only the parameters used to construct ΔW\Delta WΔW are updated. This preserves the pretrained matrix exactly, reduces optimizer state, and allows multiple task-specific adaptations to be stored separately while sharing the same base model.
The same abstraction applies directly to transformer projection matrices such as WQW_QWQ​, WKW_KWK​, WVW_VWV​, and WOW_OWO​. Each is just a linear map from one activation space to another. LoRA does not require a special transformer-specific derivation at this stage; it only requires identifying which linear maps should receive learned additive updates.
The visual below condenses this idea into one frozen affine layer and its adapted counterpart. The blue components represent the pretrained computation z=W0h+b0z = W_0h + b_0z=W0​h+b0​, while the orange update ΔW\Delta WΔW marks the only part that will later become trainable under LoRA’s low-rank parameterization.
Read it as a bridge between standard fine-tuning and LoRA: first define the effective adapted weight W=W0+ΔWW = W_0 + \Delta WW=W0​+ΔW; then, in the next step, restrict the shape of ΔW\Delta WΔW so that adaptation becomes dramatically cheaper without discarding the pretrained layer.

6. Full Fine-Tuning as an Additive Update

Now that we have isolated a single frozen linear layer, the next step is to ask what full fine-tuning actually changes at that layer. The usual description says that fine-tuning “updates the weights,” but for LoRA the more useful viewpoint is slightly different: full fine-tuning learns an additive correction to the pretrained matrix.
Start from the frozen layer
z=W0h+b0,z = W_0 h + b_0,z=W0​h+b0​,
where W0∈Rd×kW_0 \in \mathbb{R}^{d \times k}W0​∈Rd×k, h∈Rkh \in \mathbb{R}^kh∈Rk, and z∈Rdz \in \mathbb{R}^dz∈Rd. The matrix W0W_0W0​ is the pretrained weight we inherited from the base model. During full fine-tuning, we replace it by a new adapted matrix WWW. But any such adapted matrix can always be written as
W=W0+ΔW,ΔW∈Rd×k.W = W_0 + \Delta W,
\qquad
\Delta W \in \mathbb{R}^{d \times k}.W=W0​+ΔW,ΔW∈Rd×k.
This is not yet an approximation or a LoRA assumption. It is just algebra. Given any final fine-tuned weight WWW, the difference
ΔW=W−W0\Delta W = W - W_0ΔW=W−W0​
is the update that full fine-tuning has learned relative to the pretrained model.
Substituting this decomposition back into the layer gives
z=(W0+ΔW)h+b0.z = (W_0 + \Delta W)h + b_0.z=(W0​+ΔW)h+b0​.
By distributing over hhh, we get
z=W0h+ΔWh+b0.z = W_0h + \Delta W h + b_0.z=W0​h+ΔWh+b0​.
This expression is the key abstraction. The adapted layer can be understood as the original pretrained computation W0h+b0W_0h + b_0W0​h+b0​, plus an additional task-specific correction ΔWh\Delta W hΔWh. In other words, fine-tuning does not need to be viewed as replacing the pretrained model wholesale. At the level of one linear layer, it is equivalent to keeping the pretrained transformation and adding a learned residual transformation.
Under full fine-tuning, however, the update ΔW\Delta WΔW is completely unconstrained. Since ΔW\Delta WΔW has the same shape as W0W_0W0​, it contains
pfull=dkp_{\mathrm{full}} = dkpfull​=dk
free parameters for this layer alone. Every entry of the d×kd \times kd×k matrix may move independently. If the layer maps a kkk-dimensional activation into a ddd-dimensional output, full fine-tuning allows an arbitrary linear correction from Rk\mathbb{R}^kRk to Rd\mathbb{R}^dRd.
That flexibility is powerful, but it is also expensive. In modern transformers, the major weight matrices are large, and there are many of them. Updating all of their entries means we must:
store optimizer states for every trainable weight,
backpropagate gradients through every adapted parameter,
maintain a separate full-size copy of the model for each fine-tuned task,
and risk overfitting when the downstream dataset is small relative to the number of trainable parameters.
The subtle point is that LoRA does not begin by removing the additive correction ΔWh\Delta W hΔWh. It keeps exactly this residual-update interpretation. What LoRA changes is the space of matrices that ΔW\Delta WΔW is allowed to occupy. Full fine-tuning says: “learn any matrix ΔW∈Rd×k\Delta W \in \mathbb{R}^{d \times k}ΔW∈Rd×k.” LoRA will say: “learn a structured, low-rank matrix ΔW\Delta WΔW that is much cheaper to represent.”
This distinction matters because it explains why LoRA can be inserted into an existing pretrained model without changing the basic forward computation. The layer still produces the frozen pretrained contribution W0hW_0hW0​h, and it still adds an adaptation term. The efficiency gain comes from parameterizing that adaptation term more carefully, not from inventing a new kind of layer.
The visual below compresses this logic into three pieces: the original frozen layer, the decomposition W=W0+ΔWW = W_0 + \Delta WW=W0​+ΔW, and the adapted computation where the correction appears explicitly as ΔWh\Delta W hΔWh. The color separation is important: pretrained quantities remain frozen, while the update matrix is the trainable object.
It also highlights the cost of doing this without constraints. The orange d×kd \times kd×k block represents an unconstrained update matrix with dkdkdk free parameters. This is the baseline LoRA will improve upon: not by denying that full fine-tuning learns an additive update, but by restricting that update to a much smaller low-rank family.

7. What Does Low Rank Mean Geometrically?

Having written full fine-tuning as a frozen pretrained map plus an additive correction, the natural next question is: how expressive does that correction really need to be? If a layer originally computes
y=W0h+b0,y = W_0 h + b_0,y=W0​h+b0​,
then full fine-tuning replaces W0W_0W0​ by W0+ΔWW_0 + \Delta WW0​+ΔW, so the adapted layer becomes
y=W0h+ΔWh+b0.y = W_0 h + \Delta W h + b_0.y=W0​h+ΔWh+b0​.
The key observation is that, for this layer, all task-specific adaptation enters through the vector ΔWh\Delta W hΔWh. We do not directly care about ΔW\Delta WΔW as an abstract matrix; we care about the directions in activation space that it can add to the frozen pretrained computation.
A completely unrestricted ΔW∈Rd×k\Delta W \in \mathbb{R}^{d \times k}ΔW∈Rd×k can represent any linear correction from the input activation space Rk\mathbb{R}^kRk to the output activation space Rd\mathbb{R}^dRd. Geometrically, this means the update may use as many as min⁡(d,k)\min(d,k)min(d,k) independent output directions. That is powerful, but expensive: it requires learning dkdkdk new degrees of freedom for a single weight matrix. In large transformers, many such matrices are enormous, so repeating this across attention and MLP layers quickly becomes the dominant storage and optimization cost.
LoRA’s central bet is that useful downstream adaptation often lives in a much smaller subspace. Instead of allowing ΔW\Delta WΔW to be arbitrary, we constrain it to have low rank:
rank⁡(ΔW)≤r,r≪min⁡(d,k).\operatorname{rank}(\Delta W) \le r,
\qquad
r \ll \min(d,k).rank(ΔW)≤r,r≪min(d,k).
This is not merely a parameter-count trick. It is a geometric restriction on what kinds of changes the model can make. A rank-rrr matrix can only map inputs into an output subspace of dimension at most rrr. So although ΔWh\Delta W hΔWh is still a vector in Rd\mathbb{R}^dRd, as hhh varies it can only move within at most rrr learned directions inside that ddd-dimensional output space.
The most useful way to see this is through factorization. Any rank-at-most-rrr update can be represented as the product of two thinner matrices:
ΔW=BA,\Delta W = BA,ΔW=BA,
where
A∈Rr×k,B∈Rd×r.A \in \mathbb{R}^{r \times k},
\qquad
B \in \mathbb{R}^{d \times r}.A∈Rr×k,B∈Rd×r.
Then the update applied to an activation h∈Rkh \in \mathbb{R}^kh∈Rk becomes
ΔWh=B(Ah).\Delta W h = B(Ah).ΔWh=B(Ah).
This equation is the geometric heart of LoRA. The matrix AAA first compresses the original kkk-dimensional activation into rrr coordinates:
h∈Rk⟼Ah∈Rr.h \in \mathbb{R}^k
\quad \longmapsto \quad
Ah \in \mathbb{R}^r.h∈Rk⟼Ah∈Rr.
Then BBB expands those rrr coordinates back into a ddd-dimensional update vector:
Ah∈Rr⟼B(Ah)∈Rd.Ah \in \mathbb{R}^r
\quad \longmapsto \quad
B(Ah) \in \mathbb{R}^d.Ah∈Rr⟼B(Ah)∈Rd.
So the update is forced through an rrr-dimensional bottleneck. The model can still produce a full-sized correction vector, but that correction must be assembled from at most rrr learned basis directions: the columns of BBB. The compressed vector AhAhAh supplies the coefficients for combining those directions.
This also clarifies both the strength and the limitation of LoRA. If the task-specific change really is concentrated in a low-dimensional subspace, then a small rrr can capture it efficiently. But if adaptation requires many independent directions — for example, if the downstream task demands a broad reorganization of the layer’s representation — then too small a rank may underfit. In that case, the bottleneck is not just saving parameters; it is actively preventing certain updates from being represented.
LoRA usually adds one more scaling factor:
ΔWLoRA=sBA,s=α/r.\Delta W_{\mathrm{LoRA}} = sBA,
\qquad
s = \alpha/r.ΔWLoRA​=sBA,s=α/r.
The scalar sss controls the magnitude of the low-rank branch relative to the frozen pretrained path. This matters because the matrices AAA and BBB are trained from initialization, while W0W_0W0​ already contains a large amount of pretrained structure. The scaling helps make the injected update numerically stable and comparable across different choices of rank rrr.
A compact way to remember the whole construction is:
Full fine-tuning: learn an arbitrary ΔW∈Rd×k\Delta W \in \mathbb{R}^{d \times k}ΔW∈Rd×k.
Low-rank adaptation: learn ΔW=BA\Delta W = BAΔW=BA, with a bottleneck dimension rrr.
Geometric effect: updates live in at most rrr learned output directions.
LoRA form: use the scaled update ΔWLoRA=sBA\Delta W_{\mathrm{LoRA}} = sBAΔWLoRA​=sBA.
The visual below consolidates this idea as a flow: the input activation hhh is first passed through AAA, producing a narrow rrr-dimensional bottleneck AhAhAh, and then through BBB, producing the full output update ΔWh\Delta W hΔWh. The frozen pretrained computation remains in the background, while the colored low-rank branch shows exactly where adaptation is being inserted.
The important thing to notice is not just that the middle vector is smaller. The bottleneck is what enforces the rank constraint. Since every update must pass through rrr coordinates before being expanded, the final correction can only span an rrr-dimensional family of output movements. That is the geometric meaning of “low rank” in LoRA: compress to a small learned coordinate system, then expand into a controlled task-specific update.

8. Theorem: Rank and Parameter Count of a Factorized Update

The geometric picture of low rank gives us the right intuition: a low-rank matrix can only move inputs through a small-dimensional subspace before expanding them back out. LoRA turns that intuition into a concrete parameterization for fine-tuning. Instead of learning an arbitrary dense update ΔW∈Rd×k\Delta W \in \mathbb{R}^{d \times k}ΔW∈Rd×k, we force the update to pass through an rrr-dimensional bottleneck.
For a pretrained weight matrix W0∈Rd×kW_0 \in \mathbb{R}^{d \times k}W0​∈Rd×k, full fine-tuning would replace it by
W=W0+ΔW,W = W_0 + \Delta W,W=W0​+ΔW,
where ΔW\Delta WΔW has the same shape as W0W_0W0​ and contains dkdkdk trainable parameters. LoRA keeps W0W_0W0​ frozen and constrains the trainable update to have the factorized form
ΔWLoRA=sBA,s=α/r.\Delta W_{\mathrm{LoRA}} = sBA,
\qquad s = \alpha/r .ΔWLoRA​=sBA,s=α/r.
Here
A∈Rr×k,B∈Rd×r,r≪min⁡(d,k).A \in \mathbb{R}^{r \times k},
\qquad
B \in \mathbb{R}^{d \times r},
\qquad
r \ll \min(d,k).A∈Rr×k,B∈Rd×r,r≪min(d,k).
The multiplication order is important. An input vector in Rk\mathbb{R}^kRk is first projected down by AAA into Rr\mathbb{R}^rRr, then projected back up by BBB into Rd\mathbb{R}^dRd. The scalar s=α/rs=\alpha/rs=α/r controls the magnitude of the LoRA update; it does not change the rank or the number of trainable parameters. It is best understood as a scaling convention that helps make different choices of rrr comparable during optimization.
The key theorem is simple but powerful:
rank⁡(ΔWLoRA)≤r.\operatorname{rank}(\Delta W_{\mathrm{LoRA}}) \le r .rank(ΔWLoRA​)≤r.
This is the formal version of the bottleneck story. No matter how large ddd and kkk are, the product BABABA cannot have rank larger than the intermediate dimension rrr. The update may live inside a huge d×kd \times kd×k matrix, but its degrees of freedom are constrained to flow through an rrr-dimensional channel.
That constraint is exactly what makes LoRA parameter-efficient. Instead of training all dkdkdk entries of ΔW\Delta WΔW, we train only the entries of AAA and BBB:
pLoRA=rk+dr=r(k+d).p_{\mathrm{LoRA}} = rk + dr = r(k+d).pLoRA​=rk+dr=r(k+d).
By contrast, a full update would require
pfull=dk.p_{\mathrm{full}} = dk.pfull​=dk.
So the trainable-parameter fraction is
ρ=pLoRApfull=r(k+d)dk.\rho
=
\frac{p_{\mathrm{LoRA}}}{p_{\mathrm{full}}}
=
\frac{r(k+d)}{dk}.ρ=pfull​pLoRA​​=dkr(k+d)​.
When r≪min⁡(d,k)r \ll \min(d,k)r≪min(d,k), this ratio can be very small. For example, if d=k=4096d=k=4096d=k=4096 and r=8r=8r=8, then full fine-tuning uses over 16 million trainable parameters for that matrix, while LoRA uses only 8(4096+4096)=65,5368(4096+4096)=65{,}5368(4096+4096)=65,536. That is about 0.39%0.39\%0.39% of the full parameter count for the layer.
There is a subtle but important distinction here: LoRA is not saying the weight matrix itself is low rank. The pretrained matrix W0W_0W0​ remains full-size and may be full-rank. LoRA only constrains the change applied during adaptation:
W=W0+ΔWLoRA.W = W_0 + \Delta W_{\mathrm{LoRA}}.W=W0​+ΔWLoRA​.
This matters because the model can still use the rich representation already stored in W0W_0W0​. LoRA merely assumes that the task-specific correction needed for fine-tuning lies in a much smaller subspace than the original model weights.
This assumption can fail. If a new task requires many independent directions of change in a layer, then a very small rrr may underfit. Increasing rrr gives the update more expressive capacity, but also increases memory, optimizer state, and potential overfitting. The practical appeal of LoRA comes from the empirical observation that many adaptation tasks do not require a full-rank update everywhere; a modest bottleneck often captures most of the useful task-specific movement.
The visual below compresses the theorem into one picture: an unconstrained dense update is replaced by two thinner matrices, AAA and BBB, with a narrow rrr-dimensional passage between them. The green bottleneck is the reason for the rank bound, while the separate parameter counts remind us why the method is attractive computationally.
It is useful to read the diagram from left to right as a computation and from top to bottom as a theorem. Computationally, inputs pass through AAA, then BBB, then the scalar sss. Mathematically, the same factorization guarantees rank⁡(ΔWLoRA)≤r\operatorname{rank}(\Delta W_{\mathrm{LoRA}})\le rrank(ΔWLoRA​)≤r and reduces the number of trainable parameters from dkdkdk to r(k+d)r(k+d)r(k+d).

9. Proof: Rank Bound and Parameter Count

The theorem gives us the punchline; now we should make the proof feel almost inevitable. LoRA’s efficiency is not a mysterious property of transformers or attention layers. It comes from a simple linear-algebra fact: if we force an update matrix to factor through a small rrr-dimensional bottleneck, then the update cannot have rank larger than rrr, and the number of trainable parameters is the size of the two skinny factors rather than the size of the full matrix.
Suppose a pretrained layer contains a frozen weight matrix W0∈Rd×kW_0 \in \mathbb{R}^{d \times k}W0​∈Rd×k. Full fine-tuning would learn an arbitrary update ΔW∈Rd×k\Delta W \in \mathbb{R}^{d \times k}ΔW∈Rd×k, so the adapted weight would be
W=W0+ΔW.W = W_0 + \Delta W.W=W0​+ΔW.
LoRA restricts the update to the factorized form
ΔWLoRA=sBA,A∈Rr×k,B∈Rd×r.\Delta W_{\mathrm{LoRA}} = sBA,
\qquad
A \in \mathbb{R}^{r \times k},
\quad
B \in \mathbb{R}^{d \times r}.ΔWLoRA​=sBA,A∈Rr×k,B∈Rd×r.
Here rrr is the LoRA rank, usually chosen much smaller than both ddd and kkk, and sss is a scalar scaling factor. The important structural point is that the update does not map directly from Rk\mathbb{R}^kRk to Rd\mathbb{R}^dRd with a fully flexible d×kd \times kd×k matrix. Instead, it first passes through an rrr-dimensional intermediate space via AAA, and then maps back up to ddd dimensions via BBB.
That bottleneck immediately limits the rank. One way to see this is column-wise. The product BABABA has kkk columns, and each column is obtained by multiplying BBB by the corresponding column of AAA. Therefore every column of BABABA lies in the column space of BBB. Since BBB has only rrr columns, its column space has dimension at most rrr. Thus
rank⁡(BA)≤rank⁡(B)≤r.\operatorname{rank}(BA) \le \operatorname{rank}(B) \le r.rank(BA)≤rank(B)≤r.
Equivalently, we could reason from the other side: AAA has only rrr rows, so rank⁡(A)≤r\operatorname{rank}(A) \le rrank(A)≤r, and the rank of a product cannot exceed the rank of either factor. Both perspectives say the same thing: the factorization forces the update to move within a low-dimensional subspace of possible full updates.
The scalar sss does not change this conclusion. If s≠0s \neq 0s=0, multiplying by sss rescales every singular value but does not create new nonzero singular values. If s=0s = 0s=0, the update becomes the zero matrix, whose rank is even smaller. So in all cases,
rank⁡(ΔWLoRA)=rank⁡(sBA)≤rank⁡(BA)≤r.\operatorname{rank}(\Delta W_{\mathrm{LoRA}})
=
\operatorname{rank}(sBA)
\le
\operatorname{rank}(BA)
\le
r.rank(ΔWLoRA​)=rank(sBA)≤rank(BA)≤r.
This is the rank-bound half of the proof: LoRA is not merely encouraging a low-rank update; it parameterizes the update so that low rank is guaranteed.
The parameter-count argument is just as direct. In full fine-tuning, the update ΔW\Delta WΔW has one trainable parameter for every entry of a d×kd \times kd×k matrix:
pfull=dk.p_{\mathrm{full}} = dk.pfull​=dk.
In LoRA, the original matrix W0W_0W0​ is frozen. The only trainable entries are those in AAA and BBB. Matrix AAA contributes rkrkrk parameters, and matrix BBB contributes drdrdr parameters, so
pLoRA=rk+dr=r(k+d).p_{\mathrm{LoRA}} = rk + dr = r(k+d).pLoRA​=rk+dr=r(k+d).
The trainable-parameter fraction is therefore
ρ=pLoRApfull=r(k+d)dk.\rho
=
\frac{p_{\mathrm{LoRA}}}{p_{\mathrm{full}}}
=
\frac{r(k+d)}{dk}.ρ=pfull​pLoRA​​=dkr(k+d)​.
When r≪d,kr \ll d,kr≪d,k, this fraction can be tiny. For square matrices with d=k=nd=k=nd=k=n, it becomes approximately
ρ=2rnn2=2rn,\rho = \frac{2rn}{n^2} = \frac{2r}{n},ρ=n22rn​=n2r​,
so the savings grow as the layer width grows. This is precisely why LoRA becomes more attractive for large models: the full matrix scales quadratically in width, while the factorized update scales linearly in width for fixed rrr.
There is a subtle but important trade-off hiding inside this proof. The low-rank constraint reduces memory, optimizer state, checkpoint size, and communication cost, but it also restricts the set of updates the model can represent. LoRA works well when the task-specific adaptation can be captured by a relatively low-dimensional change to the pretrained computation. It may struggle if the desired update genuinely requires high rank, or if LoRA is inserted into layers that are not responsible for the task-relevant behavior. The theorem does not say low rank is always enough; it says that if low rank is enough, LoRA gives a very efficient way to train it.
The visual below compresses the proof into two linked claims. The first claim is geometric: BABABA must live inside the column space supplied by BBB, so its rank cannot exceed the bottleneck size rrr. The second claim is arithmetic: instead of paying for all dkdkdk entries of a dense update, we pay only for the entries of the two thin factors, rk+drrk + drrk+dr.
Read the final highlighted expressions as the reusable facts we will carry forward:  
rank⁡(ΔWLoRA)≤r,ρ=r(k+d)dk.\operatorname{rank}(\Delta W_{\mathrm{LoRA}}) \le r,
\qquad
\rho = \frac{r(k+d)}{dk}.rank(ΔWLoRA​)≤r,ρ=dkr(k+d)​.
These two statements are the entire motivation for the LoRA parameterization that comes next: freeze the large pretrained matrix, learn only a small low-rank residual, and rely on the pretrained model to provide the high-dimensional representation that the low-rank update gently redirects.

10. Deriving the LoRA Parameterization

Having proved that a product BABABA cannot have rank larger than its inner dimension rrr, we can now turn that fact into a fine-tuning method. The key move in LoRA is not to invent a new kind of neural layer, but to reinterpret full fine-tuning as learning an additive correction to a pretrained weight matrix—and then restrict that correction to a low-rank family.
For a pretrained linear layer, suppose the original weight is
W0∈Rd×k,W_0 \in \mathbb{R}^{d \times k},W0​∈Rd×k,
mapping an input activation h∈Rkh \in \mathbb{R}^kh∈Rk to an output preactivation z∈Rdz \in \mathbb{R}^dz∈Rd. In ordinary full fine-tuning, we update all entries of W0W_0W0​. Equivalently, after training we can write the adapted weight as
W=W0+ΔW.W = W_0 + \Delta W.W=W0​+ΔW.
This is a useful conceptual shift. Instead of thinking “we train WWW,” we think “we keep the pretrained solution W0W_0W0​ and learn a task-specific displacement ΔW\Delta WΔW.” Full fine-tuning places essentially no structural constraint on ΔW\Delta WΔW: it may be any matrix in Rd×k\mathbb{R}^{d \times k}Rd×k. That expressiveness is powerful, but expensive, because it requires storing and optimizing dkdkdk trainable parameters for each adapted matrix.
LoRA asks whether we really need that much freedom. Empirically, many downstream adaptations appear to live in a much smaller “intrinsic” subspace than the full parameter space suggests. So instead of learning an arbitrary dense ΔW\Delta WΔW, LoRA constrains the update to be low-rank:
ΔW=ΔWLoRA=sBA,s=α/r.\Delta W
=
\Delta W_{\mathrm{LoRA}}
=
sBA,
\qquad
s = \alpha/r.ΔW=ΔWLoRA​=sBA,s=α/r.
Here the two learned matrices have shapes
A∈Rr×k,B∈Rd×r,A \in \mathbb{R}^{r \times k},
\qquad
B \in \mathbb{R}^{d \times r},A∈Rr×k,B∈Rd×r,
where
r≪min⁡(d,k).r \ll \min(d,k).r≪min(d,k).
The product BABABA has the same shape as W0W_0W0​, namely d×kd \times kd×k, so it can be added directly to the frozen pretrained weight. But because it factors through an rrr-dimensional bottleneck, its rank is at most rrr. This is precisely the rank bound from the previous section made operational: LoRA does not merely hope for a low-rank update; it parameterizes the update so that low rank is guaranteed.
The scalar sss is usually written as
s=α/r,s=\alpha/r,s=α/r,
where α\alphaα is a scaling hyperparameter. This scaling is not just cosmetic. Since changing rrr changes the size and behavior of the low-rank branch, α/r\alpha/rα/r helps keep the magnitude of the adaptation reasonably controlled across different ranks. In practice, α\alphaα gives the practitioner a knob for the strength of the LoRA update, while rrr controls its capacity.
With this parameterization, the adapted linear layer becomes
z=(W0+sBA)h+b0.z
=
(W_0+sBA)h+b_0.z=(W0​+sBA)h+b0​.
By associativity, we can rewrite the same computation as
z=W0h+sB(Ah)+b0.z
=
W_0h+sB(Ah)+b_0.z=W0​h+sB(Ah)+b0​.
This second form is the one that reveals the implementation. We do not need to explicitly materialize the full matrix sBAsBAsBA during training. Instead, the input hhh first passes through the small matrix AAA, producing an rrr-dimensional intermediate representation AhAhAh. Then BBB maps that small vector back to the output dimension ddd. The result is scaled by sss and added in parallel to the frozen pretrained computation W0h+b0W_0h+b_0W0​h+b0​.
The trainable parameters are therefore only
ϕ={A,B}\phi=\{A,B\}ϕ={A,B}
for each selected weight matrix. The original pretrained parameters W0W_0W0​ and b0b_0b0​ remain frozen. This is the central efficiency gain: instead of training dkdkdk parameters for a full dense update, LoRA trains
pLoRA=rk+dr=r(k+d)p_{\mathrm{LoRA}} = rk + dr = r(k+d)pLoRA​=rk+dr=r(k+d)
parameters. When rrr is small compared with ddd and kkk, this can be orders of magnitude smaller than full fine-tuning while still allowing a meaningful task-specific displacement.
There are a few subtle assumptions hidden in this elegant construction. LoRA works best when the useful downstream change can be approximated well by a low-rank update in the chosen weight matrices. If the task requires many independent directions of change, a very small rrr may underfit. Conversely, increasing rrr improves expressiveness but also increases memory, compute, and the risk of losing the parameter-efficiency advantage. The choice of where to insert LoRA also matters: in transformers, it is commonly applied to attention projections such as query and value matrices, and sometimes to key, output, or MLP projections depending on the task and budget.
The visual that accompanies this derivation condenses the entire idea into one flow: start from the full fine-tuning identity W=W0+ΔWW=W_0+\Delta WW=W0​+ΔW, replace the unconstrained update ΔW\Delta WΔW with the structured low-rank product sBAsBAsBA, and then substitute that replacement into the layer’s forward pass. The important conceptual distinction is also encoded visually: W0W_0W0​ and b0b_0b0​ are frozen, while only AAA and BBB are trainable.
The matrix-shape sketch is especially useful because it makes the bottleneck concrete. A wide matrix AAA maps from kkk dimensions down to rrr, and a tall matrix BBB maps from rrr back up to ddd. Their product has the right d×kd \times kd×k shape to behave like a full update, but it is forced to pass through the narrow rank-rrr channel. That is LoRA in one sentence: full fine-tuning’s additive update, restricted to a scaled low-rank form, with only the factors trained.

11. LoRA Forward Pass: Two Small Matrices in Parallel

Once we have committed to the LoRA parameterization, the next question is operational: what actually happens during the forward pass? The key point is that LoRA does not replace the pretrained layer with an entirely new trainable layer. Instead, it keeps the original computation intact and adds a small trainable correction in parallel.
Start with a frozen affine layer from the pretrained model. If the input activation is h∈Rkh \in \mathbb{R}^kh∈Rk, the original layer computes
z=W0h+b0,z = W_0 h + b_0,z=W0​h+b0​,
where W0∈Rd×kW_0 \in \mathbb{R}^{d \times k}W0​∈Rd×k and b0∈Rdb_0 \in \mathbb{R}^db0​∈Rd. In full fine-tuning, we would directly update W0W_0W0​, turning it into W0+ΔWW_0 + \Delta WW0​+ΔW. LoRA instead freezes W0W_0W0​ and restricts the update to a low-rank form:
z=(W0+ΔWLoRA)h+b0,z = (W_0 + \Delta W_{\mathrm{LoRA}})h + b_0,z=(W0​+ΔWLoRA​)h+b0​,
with
ΔWLoRA=sBA,s=αr.\Delta W_{\mathrm{LoRA}} = sBA,
\qquad
s = \frac{\alpha}{r}.ΔWLoRA​=sBA,s=rα​.
Substituting this into the layer gives the forward computation
z=W0h+b0+sB(Ah).z = W_0h + b_0 + sB(Ah).z=W0​h+b0​+sB(Ah).
This equation is the practical heart of LoRA. The model output is the sum of two paths:
a frozen base path, h↦W0h+b0h \mapsto W_0h + b_0h↦W0​h+b0​, which preserves the pretrained model;
a trainable LoRA path, h↦Ah↦B(Ah)↦sB(Ah)h \mapsto Ah \mapsto B(Ah) \mapsto sB(Ah)h↦Ah↦B(Ah)↦sB(Ah), which learns the task-specific adaptation.
The trainable path is deliberately narrow. If A∈Rr×kA \in \mathbb{R}^{r \times k}A∈Rr×k, then
Ah∈Rr.Ah \in \mathbb{R}^r.Ah∈Rr.
So the input activation is first projected down into an rrr-dimensional bottleneck, then expanded back to the layer output dimension by B∈Rd×rB \in \mathbb{R}^{d \times r}B∈Rd×r. When r≪min⁡(d,k)r \ll \min(d,k)r≪min(d,k), this path has far fewer parameters than a dense d×kd \times kd×k update. Instead of learning every entry of ΔW\Delta WΔW, LoRA learns only
rk+dr=r(k+d)rk + dr = r(k+d)rk+dr=r(k+d)
parameters for that layer.
The scaling factor s=α/rs = \alpha/rs=α/r is also important. Without scaling, changing the rank rrr would change the typical magnitude of the LoRA branch, making hyperparameters harder to transfer across ranks. The parameter α\alphaα acts like a LoRA-specific gain, while division by rrr normalizes the contribution as the bottleneck width changes. In practice, this helps make the low-rank branch behave like a controlled residual update rather than an unstable perturbation to the frozen model.
A useful way to think about the forward pass is that LoRA is not asking the small matrices AAA and BBB to re-learn the original layer. The pretrained matrix W0W_0W0​ already contains the general-purpose representation learned during pretraining. The LoRA branch only needs to learn a directional correction in weight space: a low-dimensional adjustment that nudges the frozen model toward the downstream task.
This also explains why LoRA can fail when the required adaptation is not well captured by a low-rank update. If the downstream task requires many independent changes across the full weight matrix, a small rank rrr may be too restrictive. But when the task-specific shift is concentrated in a low-dimensional subspace—which is often empirically true for adapting large pretrained models—then the bottleneck is a feature, not a bug. It forces parameter efficiency while still allowing meaningful movement in function space.
Many implementations also apply LoRA dropout to the input of the trainable branch:
z=W0h+b0+sBA(Mdroph).z = W_0h + b_0 + sBA(M_{\mathrm{drop}}h).z=W0​h+b0​+sBA(Mdrop​h).
The dropout mask affects only the LoRA path, not the frozen base computation. This is a subtle but important design choice: the pretrained model remains deterministic and intact, while the adapter branch is regularized. During training, dropout discourages the LoRA branch from depending too heavily on specific activation coordinates; during inference, the dropout is disabled just as in standard neural network layers.
The visual below compresses this forward pass into a computation graph. The input hhh splits into two parallel routes: the gray frozen route through W0W_0W0​ and b0b_0b0​, and the blue trainable route through the two small matrices AAA and BBB. The narrow middle activation Ah∈RrAh \in \mathbb{R}^rAh∈Rr makes the low-rank bottleneck explicit.
It is worth reading the diagram as an implementation recipe as much as a mathematical identity. In training, gradients flow through AAA and BBB, while W0W_0W0​ and b0b_0b0​ remain fixed. At inference time, the two-branch computation can either be evaluated directly as W0h+b0+sB(Ah)W_0h + b_0 + sB(Ah)W0​h+b0​+sB(Ah), or merged into the base weight as W0+sBAW_0 + sBAW0​+sBA for a standard single-matrix forward pass.

12.

Now that the forward pass has been reduced to “a frozen linear layer plus a small parallel correction,” the next important question is not mathematical but operational: what exactly is being trained? LoRA is easy to describe as an extra branch, but its practical value comes from treating the pretrained model as a fixed computational substrate while letting only the low-rank adapter carry task-specific change.
For a pretrained weight matrix WWW, the LoRA-augmented layer behaves like
h=Wx+ΔWx,h = Wx + \Delta W x,h=Wx+ΔWx,
where the update ΔW\Delta WΔW is represented indirectly by two small trainable matrices. The key implementation decision is that WWW remains frozen. It is still used in the forward pass, and it still shapes the activations that later layers see, but the optimizer is not allowed to modify it.
That distinction matters. Freezing WWW does not mean removing it from the computation. The base model still provides the pretrained function: all of its attention patterns, feature extractors, and learned representations remain active. LoRA simply adds a trainable residual direction on top of that function. In this sense, LoRA fine-tuning is closer to learning a task-specific correction field than relearning the model.
A useful way to think about the training graph is:
the main path computes the original pretrained transformation;
the adapter path computes a low-rank residual update;
the outputs are added;
gradients update only the adapter parameters.
This is also why LoRA can be parameter-efficient without being computationally invisible. During training, activations still need to flow through the full network, and backpropagation still needs to propagate error signals across layers. What is saved is not the need to run the model, but the need to store and update optimizer state for every pretrained parameter.
That saving is large. In standard full fine-tuning, each parameter typically needs not only its value but also gradients and optimizer statistics, such as Adam’s first and second moment estimates. LoRA avoids this cost for the frozen base weights. The pretrained matrices remain present in memory for inference and forward computation, but they do not accumulate gradients or optimizer state. The trainable state is concentrated in the small adapter matrices.
There is a subtle implementation failure mode here. If we merely “freeze everything” too aggressively—especially by wrapping large portions of the model in a no-gradient context—we can accidentally prevent gradients from reaching adapters in earlier layers. A frozen module should usually mean its parameters do not receive gradients, not that its computation is detached from the graph. The network still needs a differentiable path through the frozen operations so that downstream loss signals can reach upstream LoRA modules.
So the clean mental model is not “turn off the model and train LoRA.” It is:
use the pretrained model normally, but optimize only the adapter parameters.\text{use the pretrained model normally, but optimize only the adapter parameters.}use the pretrained model normally, but optimize only the adapter parameters.
This is the separation that makes LoRA modular. The base model can be shared across tasks, while each task stores only its lightweight adapter. Later, the adapter can be loaded dynamically, swapped for another one, or merged into the base weights for deployment.
The visual below compresses this idea into a single training-flow picture: a large frozen route carries the pretrained computation, while a much smaller trainable route runs in parallel and contributes the learned correction. The important takeaway is the asymmetry: both routes affect the forward output, but only the adapter route is seen by the optimizer.
It also sets up the next issue naturally. If LoRA begins as an additive branch, then its initial behavior must be chosen carefully. Before fine-tuning starts, we usually want the adapted model to behave exactly like the pretrained model—not approximately, but functionally unchanged at step zero. That requirement leads directly to the initialization scheme used for the two LoRA matrices.

13. Initialization: Preserve the Pretrained Function at Step Zero

Before deciding where to insert LoRA modules in a transformer, it is worth getting one small but crucial detail right: how the LoRA parameters are initialized. This detail is easy to overlook because LoRA is often summarized as “freeze W0W_0W0​, train a low-rank update.” But if the update is initialized carelessly, the model’s behavior can change immediately before any learning has happened, destroying one of the main advantages of adaptation from a pretrained model.
Recall the LoRA parameterization for a frozen linear layer:
z=(W0+ΔWLoRA)h+b0,z = (W_0 + \Delta W_{\mathrm{LoRA}})h + b_0,z=(W0​+ΔWLoRA​)h+b0​,
with
ΔWLoRA=sBA,s=αr.\Delta W_{\mathrm{LoRA}} = sBA,
\qquad s = \frac{\alpha}{r}.ΔWLoRA​=sBA,s=rα​.
Here W0W_0W0​ and b0b_0b0​ are frozen pretrained parameters, hhh is the layer input, A∈Rr×dinA \in \mathbb{R}^{r \times d_{\text{in}}}A∈Rr×din​, B∈Rdout×rB \in \mathbb{R}^{d_{\text{out}} \times r}B∈Rdout​×r, and rrr is the LoRA rank. The scalar sss controls the magnitude of the adaptation, often written as α/r\alpha/rα/r, so that changing the rank does not automatically make the update scale explode.
The desired behavior at the beginning of training is simple: the LoRA-augmented model should compute exactly the same function as the pretrained model at step zero. That is, before observing any task-specific gradients, we do not want the adapter to perturb the logits, hidden states, or attention projections. The pretrained model is already a highly optimized solution; LoRA should begin as a function-preserving reparameterization, not as random noise injected into a delicate computation graph.
LoRA achieves this by using an asymmetric initialization:
A0 random,B0=0.A_0 \text{ random}, \qquad B_0 = 0.A0​ random,B0​=0.
Then the low-rank update at initialization is exactly zero:
ΔWLoRA=sB0A0=s 0 A0=0(t=0).\Delta W_{\mathrm{LoRA}}
= sB_0A_0
= s\,0\,A_0
= 0
\qquad (t=0).ΔWLoRA​=sB0​A0​=s0A0​=0(t=0).
Therefore the layer output is also exactly the pretrained output:
z=(W0+sB0A0)h+b0=W0h+b0(t=0).z
=
(W_0+sB_0A_0)h+b_0
=
W_0h+b_0
\qquad (t=0).z=(W0​+sB0​A0​)h+b0​=W0​h+b0​(t=0).
This is stronger than saying the perturbation is “small.” It is not merely small in expectation, nor small under a variance calculation. It is identically zero for every input hhh. The entire network, with LoRA modules inserted, initially represents the same function fθ0f_{\theta_0}fθ0​​ as the frozen pretrained model.
The asymmetry raises a natural question: if B0=0B_0=0B0​=0, have we accidentally blocked learning? The answer is no, but the gradient flow is staged. Let gz=∇zLg_z = \nabla_z \mathcal{L}gz​=∇z​L be the gradient arriving at the output of this linear layer. For the LoRA branch,
z=W0h+sBAh+b0.z = W_0h + sBAh + b_0.z=W0​h+sBAh+b0​.
The gradients with respect to AAA and BBB are
∇AL=sB0⊤gzh⊤=0,\nabla_A\mathcal{L}
=
sB_0^\top g_z h^\top
=
0,∇A​L=sB0⊤​gz​h⊤=0,
while
∇BL=s gz(A0h)⊤,\nabla_B\mathcal{L}
=
s\,g_z(A_0h)^\top,∇B​L=sgz​(A0​h)⊤,
which is generally nonzero as long as A0hA_0hA0​h is not always zero and the loss gradient gzg_zgz​ is nonzero.
So the first update goes into BBB, not AAA. This is the important subtlety: random A0A_0A0​ creates nonzero features A0hA_0hA0​h for BBB to learn from, while zero B0B_0B0​ prevents those features from affecting the model output at initialization. After one or more optimization steps, BtB_tBt​ becomes nonzero, and then the gradient to AAA,
∇AL=sBt⊤gzh⊤,\nabla_A\mathcal{L}
=
sB_t^\top g_z h^\top,∇A​L=sBt⊤​gz​h⊤,
also becomes nonzero. In other words, BBB learns first; once BBB opens the path, AAA begins to adapt as well.
This also explains why initializing both factors to zero would be a bad idea. If
A0=0,B0=0,A_0 = 0,
\qquad B_0 = 0,A0​=0,B0​=0,
then the update still preserves the pretrained function, but now
∇BL=s gz(A0h)⊤=0.\nabla_B\mathcal{L}
=
s\,g_z(A_0h)^\top
=
0.∇B​L=sgz​(A0​h)⊤=0.
Both factors would receive zero gradient at step zero, and the LoRA branch would be stuck. Conversely, initializing both A0A_0A0​ and B0B_0B0​ randomly would allow gradients to flow immediately, but it would produce a random nonzero ΔWLoRA\Delta W_{\mathrm{LoRA}}ΔWLoRA​, perturbing the pretrained function before training begins. The LoRA initialization is therefore a careful compromise:
Random A0A_0A0​: supplies a nondegenerate low-dimensional feature map.
Zero B0B_0B0​: guarantees ΔWLoRA=0\Delta W_{\mathrm{LoRA}}=0ΔWLoRA​=0 at initialization.
Nonzero ∇BL\nabla_B\mathcal{L}∇B​L: lets learning begin immediately.
Delayed ∇AL\nabla_A\mathcal{L}∇A​L: allows AAA to learn once BBB has moved away from zero.
The visual summary emphasizes exactly this “safe start” structure. The frozen pretrained branch carries hhh through W0W_0W0​ and produces the original contribution W0h+b0W_0h+b_0W0​h+b0​. In parallel, the LoRA branch computes A0hA_0hA0​h, but because the next factor is B0=0B_0=0B0​=0, the entire scaled adaptation sB0A0hsB_0A_0hsB0​A0​h vanishes. The two branches recombine, yet the output remains
z=W0h+b0(t=0).z = W_0h+b_0
\qquad (t=0).z=W0​h+b0​(t=0).
The key intuition to keep in mind is that LoRA does not begin by changing the model. It begins by adding a trainable path whose output is exactly silent, but whose parameters are arranged so that gradients can wake it up. That is what makes LoRA a stable fine-tuning method rather than just a low-rank random perturbation of a pretrained network.

14. Where to Put LoRA in a Transformer

With the initialization story in place, the next question is not how to add a LoRA branch to one linear layer, but which linear layers should receive one. A transformer block contains many affine maps, and LoRA is not a new kind of transformer layer; it is a local replacement rule applied repeatedly to a chosen set of existing matrices. The practical art is choosing that set so that the adapter has enough expressive power to steer the model, without turning parameter-efficient fine-tuning back into something close to full fine-tuning.
Recall the single-layer LoRA modification. For a frozen pretrained linear map,
z=W0h+b0,z = W_0 h + b_0,z=W0​h+b0​,
LoRA keeps W0W_0W0​ and b0b_0b0​ fixed and learns a low-rank residual update:
z=W0h+sBAh+b0,ΔWLoRA=sBA.z = W_0h + sBAh + b_0,
\qquad
\Delta W_{\mathrm{LoRA}} = sBA.z=W0​h+sBAh+b0​,ΔWLoRA​=sBA.
Here AAA projects the input representation into a rank-rrr adapter subspace, BBB projects it back to the output dimension, and sss is a scaling factor, often written as α/r\alpha/rα/r. The key point is that this rule is matrix-local: wherever the transformer has a linear map W0W_0W0​, we can decide independently whether to attach a LoRA branch.
In a self-attention block, the most visible candidates are the four attention projections. Given hidden states H∈Rm×dmodelH \in \mathbb{R}^{m \times d_{\mathrm{model}}}H∈Rm×dmodel​, the model forms queries, keys, and values through learned projections WQ,WK,WVW_Q, W_K, W_VWQ​,WK​,WV​, then applies another projection WOW_OWO​ after the attention aggregation. Abstractly, LoRA asks us to choose a target set
S⊆{WQ,WK,WV,WO}plus selected feed-forward matrices.\mathcal{S}\subseteq\{W_Q,W_K,W_V,W_O\}
\quad\text{plus selected feed-forward matrices.}S⊆{WQ​,WK​,WV​,WO​}plus selected feed-forward matrices.
For each W0∈SW_0 \in \mathcal{S}W0​∈S, we replace the frozen map by the LoRA-augmented one:
z=W0h+b0⇝z=W0h+sBAh+b0.z = W_0h+b_0
\quad\leadsto\quad
z = W_0h+sBAh+b_0.z=W0​h+b0​⇝z=W0​h+sBAh+b0​.
For matrices outside S\mathcal{S}S, nothing changes: they remain exactly frozen and receive no trainable low-rank branch.
This choice matters because different projections control different aspects of the computation. Roughly speaking, WQW_QWQ​ influences what each token asks for, WKW_KWK​ influences how tokens are matched, WVW_VWV​ influences what information is carried forward once attention weights are chosen, and WOW_OWO​ determines how the attended information is mixed back into the residual stream. Feed-forward matrices, meanwhile, often carry much of the model’s nonlinear feature transformation capacity. Adapting these locations gives LoRA different routes for changing behavior.
The original LoRA paper found that adapting only queries and values,
S={WQ,WV},\mathcal{S}=\{W_Q,W_V\},S={WQ​,WV​},
was often a strong default. This is intuitively appealing: modifying queries changes the attention pattern, while modifying values changes the content being routed. However, this is not a universal law. Many modern fine-tuning recipes adapt more projections, such as WQ,WK,WV,WOW_Q,W_K,W_V,W_OWQ​,WK​,WV​,WO​, and sometimes the MLP up/down/gate projections as well. The larger the target set, the more expressive the adapter becomes—but also the more parameters, memory traffic, and possible overfitting pressure it introduces.
A useful way to think about the design space is:
Fewer target matrices: cheaper, more compact, often surprisingly effective, but may underfit tasks requiring deeper representational change.
More target matrices: more flexible and often stronger on difficult instruction, reasoning, or domain-shift tasks, but less parameter-efficient.
Attention-only LoRA: directly changes routing and information selection.
Attention + MLP LoRA: also changes the tokenwise feature transformations inside each block.
There is also a subtle failure mode: if S\mathcal{S}S is too restrictive, increasing the rank rrr may not solve the problem. A high-rank update to the “wrong” matrices can still be less useful than a modest-rank update to the matrices where the task needs control. Placement and rank interact. LoRA’s low-rank assumption is about the needed update at a chosen location; it does not say that every downstream adaptation can be well represented by perturbing only one or two projections.
The visual below consolidates this idea as a transformer attention block with LoRA branches attached only to selected frozen projections. The gray boxes represent pretrained matrices that remain fixed. The blue side branches represent the trainable low-rank path sBAhsBAhsBAh, which runs in parallel with the frozen map and is added back to its output.
Notice especially the highlighted WQW_QWQ​ and WVW_VWV​ projections: they mark the common original LoRA default. The same construction could be extended to WKW_KWK​, WOW_OWO​, or feed-forward matrices by adding identical low-rank side branches there too. In other words, LoRA placement is not a new derivation each time—it is the repeated application of the same constrained update rule over a chosen target set S\mathcal{S}S.

15. Theorem: Best Rank-r Approximation via Truncated SVD

Once we know where LoRA is typically inserted in a Transformer—often in attention projections such as WqW_qWq​, WvW_vWv​, and sometimes WoW_oWo​ or MLP matrices—the next question is more fundamental: why should a low-rank update be a reasonable restriction at all? Full fine-tuning allows every entry of a weight matrix to move independently. LoRA, by contrast, insists that the learned update must factor through a small bottleneck dimension rrr. That is a strong constraint, so we need a mathematical reason to believe it can preserve the important part of adaptation.
Recall the contrast. In full fine-tuning, a pretrained weight matrix W∈Rd×kW\in\mathbb{R}^{d\times k}W∈Rd×k is changed to
W+ΔW,W+\Delta W,W+ΔW,
where ΔW\Delta WΔW can be any matrix of the same shape. LoRA freezes WWW and learns only a structured update
ΔWLoRA=sBA,\Delta W_{\mathrm{LoRA}} = sBA,ΔWLoRA​=sBA,
where B∈Rd×rB\in\mathbb{R}^{d\times r}B∈Rd×r, A∈Rr×kA\in\mathbb{R}^{r\times k}A∈Rr×k, and sss is a scalar scaling factor. Because this update passes through an rrr-dimensional inner space, its rank is limited:
rank⁡(ΔWLoRA)=rank⁡(sBA)≤r.\operatorname{rank}(\Delta W_{\mathrm{LoRA}})
=
\operatorname{rank}(sBA)
\le r.rank(ΔWLoRA​)=rank(sBA)≤r.
So LoRA is not merely “using fewer parameters.” It is asserting that the useful task-specific movement in weight space can be expressed, or at least well-approximated, by a matrix whose action lives in only a few dominant directions.
The relevant theorem is the Eckart–Young–Mirsky theorem, specialized here to the Frobenius norm. Let M∈Rd×kM\in\mathbb{R}^{d\times k}M∈Rd×k be any matrix, with singular value decomposition
M=UΣV⊤,M = U\Sigma V^\top,M=UΣV⊤,
where the singular values satisfy
σ1≥σ2≥σ3≥⋯≥0.\sigma_1\ge \sigma_2\ge \sigma_3\ge \cdots \ge 0.σ1​≥σ2​≥σ3​≥⋯≥0.
The singular values measure how much “energy” or variation the matrix has along orthogonal input-output directions. Large singular values correspond to directions in which the matrix has a strong effect; small singular values correspond to weaker directions.
The theorem says that if we are allowed to approximate MMM using only a matrix of rank at most rrr, then the best possible approximation in Frobenius norm is obtained by keeping only the top rrr singular values and discarding the rest:
Mr=Udiag⁡(σ1,…,σr,0,…)V⊤.M_r
=
U\operatorname{diag}(\sigma_1,\ldots,\sigma_r,0,\ldots)V^\top.Mr​=Udiag(σ1​,…,σr​,0,…)V⊤.
Moreover, the approximation error is exactly the energy of the discarded tail:
∥M−Mr∥F2=∑i>rσi2.\|M-M_r\|_F^2
=
\sum_{i>r}\sigma_i^2.∥M−Mr​∥F2​=i>r∑​σi2​.
This is the key bridge to LoRA. Suppose we imagine performing full fine-tuning and obtaining some update ΔW\Delta WΔW. If the singular values of ΔW\Delta WΔW decay quickly, then most of the update’s Frobenius energy is concentrated in a small number of directions. In that case, a rank-rrr approximation may capture nearly all of the meaningful movement:
ΔW≈(ΔW)r.\Delta W \approx (\Delta W)_r.ΔW≈(ΔW)r​.
LoRA does not explicitly compute the SVD of the full fine-tuning update during training. Instead, it directly optimizes a low-rank factorization BABABA. But the theorem tells us why this is a plausible parameterization: if the true useful update is approximately low-rank, then there exists a low-rank matrix that is close to it.
There is an important subtlety here. The theorem is an approximation theorem, not a guarantee that LoRA will always match full fine-tuning. It says that low-rank approximation works well when the spectrum decays. If the desired update has many equally important singular directions, then the tail energy
∑i>rσi2\sum_{i>r}\sigma_i^2i>r∑​σi2​
may remain large, and a small-rank LoRA adapter may underfit. This is one reason the choice of rank rrr, target modules, dataset size, and task difficulty all matter in practice.
The theorem also clarifies what LoRA is betting on:
Good case: the full update has a few dominant singular directions, so a small rrr captures most of the useful change.
Hard case: the full update is high-rank with a flat singular spectrum, so truncating to rank rrr discards substantial information.
Practical compromise: many adaptation tasks appear to require only a small subspace of changes, especially when starting from a strong pretrained model.
The visual below compactly organizes this logic. The upper equation emphasizes the rank constraint imposed by the LoRA factorization sBAsBAsBA. The theorem block then states the optimal low-rank approximation result: among all rank-rrr matrices, truncated SVD gives the closest approximation in Frobenius norm.
The singular-value spectrum on the side is the conceptual heart of the result. The first rrr bars represent the directions retained by the rank-rrr approximation, while the gray tail represents discarded energy. When that tail is small, LoRA’s low-rank restriction is not a severe limitation; it is an efficient way to focus learning on the dominant adaptation directions.

16. Proof Sketch: Why Truncated SVD Is Optimal

The previous result told us what the optimal rank-rrr approximation is: keep the top rrr singular values and discard the rest. But for LoRA, the more important lesson is why this is true. LoRA will soon restrict a weight update ΔW\Delta WΔW to the form BABABA, which automatically has rank at most rrr. So before we turn this into an algorithm, we want to understand the geometry of the constraint: if you are only allowed rrr low-rank degrees of freedom, where should you spend them?
Let MMM be the matrix we want to approximate, with singular value decomposition
M=UΣV⊤,M = U \Sigma V^\top,M=UΣV⊤,
where Σ\SigmaΣ is diagonal with singular values σ1≥σ2≥⋯≥0\sigma_1 \ge \sigma_2 \ge \cdots \ge 0σ1​≥σ2​≥⋯≥0. We compare MMM against an arbitrary rank-rrr candidate, written in factored form as BABABA. This notation is deliberately reminiscent of LoRA: if B∈Rdout×rB \in \mathbb{R}^{d_{\text{out}} \times r}B∈Rdout​×r and A∈Rr×dinA \in \mathbb{R}^{r \times d_{\text{in}}}A∈Rr×din​, then
rank⁡(BA)≤r.\operatorname{rank}(BA) \le r.rank(BA)≤r.
The proof begins with a change of coordinates. Instead of measuring the approximation error directly in the original basis, rotate both sides into the singular-vector coordinates of MMM. The Frobenius norm is invariant under multiplication by orthogonal matrices, so
∥M−BA∥F2=∥U⊤(M−BA)V∥F2=∥Σ−U⊤BAV∥F2.\|M-BA\|_F^2
=
\|U^\top(M-BA)V\|_F^2
=
\|\Sigma - U^\top BA V\|_F^2.∥M−BA∥F2​=∥U⊤(M−BA)V∥F2​=∥Σ−U⊤BAV∥F2​.
This step is powerful because it turns the target matrix into something extremely simple: a diagonal matrix Σ\SigmaΣ. All the complexity of MMM's left and right singular vectors has been absorbed into the coordinate system. Meanwhile, the candidate approximation becomes U⊤BAVU^\top BA VU⊤BAV. It may no longer look like a clean product BABABA, but its rank cannot increase under orthogonal rotations:
rank⁡(U⊤BAV)≤rank⁡(BA)≤r.\operatorname{rank}(U^\top BA V)
\le
\operatorname{rank}(BA)
\le r.rank(U⊤BAV)≤rank(BA)≤r.
So the problem has been reduced to a more transparent one: approximate a diagonal matrix Σ\SigmaΣ using any matrix of rank at most rrr. In these coordinates, each diagonal entry σi\sigma_iσi​ represents the amount of energy in an independent singular direction. The Frobenius error adds squared entrywise errors, so leaving singular value σi\sigma_iσi​ unmatched costs σi2\sigma_i^2σi2​.
The key intuition is that a rank-rrr matrix can only fully capture rrr independent directions. It cannot independently match all diagonal entries of Σ\SigmaΣ if there are more than rrr nonzero singular values. Therefore, the best strategy is obvious once we are in the right basis: spend the limited rank budget on the directions with the largest singular values, because those are the most expensive to ignore.
Formally, every rank-rrr candidate must incur at least the squared tail energy of the singular values it cannot represent:
∥Σ−U⊤BAV∥F2≥∑i>rσi2.\|\Sigma - U^\top BA V\|_F^2
\ge
\sum_{i>r} \sigma_i^2.∥Σ−U⊤BAV∥F2​≥i>r∑​σi2​.
This inequality is the heart of the theorem. It says that no clever choice of BBB and AAA, no off-diagonal mixing, and no alternative coordinate trick can beat the tail energy left behind after the top rrr singular directions are accounted for. The singular-vector basis is already the coordinate system in which the approximation problem is most clearly decomposed.
Equality is achieved by the truncated SVD approximation
Mr=Udiag⁡(σ1,…,σr,0,…)V⊤,M_r
=
U
\operatorname{diag}(\sigma_1,\ldots,\sigma_r,0,\ldots)
V^\top,Mr​=Udiag(σ1​,…,σr​,0,…)V⊤,
for which
∥M−Mr∥F2=∑i>rσi2.\|M-M_r\|_F^2
=
\sum_{i>r}\sigma_i^2.∥M−Mr​∥F2​=i>r∑​σi2​.
So the lower bound is not merely a pessimistic guarantee; it is tight. Truncated SVD attains the best possible Frobenius error among all rank-rrr matrices.
This proof also clarifies a subtle but important point for LoRA. LoRA is not usually computing the truncated SVD of an optimal full update during training. Instead, it directly learns a low-rank update BABABA. But the SVD theorem tells us why such a parameterization can be reasonable: if the task-specific update mostly lives in a small number of dominant directions, then a low-rank factorization can capture most of its useful energy while using far fewer parameters.
The visual below compresses this argument into the essential chain: rotate into singular-vector coordinates, observe that the candidate still has rank at most rrr, and then see the diagonal singular values split into “kept” directions and the unavoidable residual tail. The blue entries correspond to the rank budget being spent on the largest singular directions; the gray tail corresponds to the error term ∑i>rσi2\sum_{i>r}\sigma_i^2∑i>r​σi2​.
Read this picture as the bridge from linear algebra to LoRA: once updates are constrained to rank rrr, the best possible use of that rank is to align with the most important directions of change. LoRA turns that mathematical constraint into a trainable parameterization.

17. Algorithm: LoRA Fine-Tuning

Having argued that a best low-rank approximation is meaningful in the SVD sense, we can now turn that linear-algebra fact into an actual training algorithm. LoRA is not a mysterious new optimizer or a different learning objective. It is ordinary supervised fine-tuning with one structural constraint: instead of updating the pretrained weight matrix W0W_0W0​, we freeze it and learn a low-rank additive correction.
For a target linear map, suppose the pretrained layer computes
z=W0h+b0,z = W_0 h + b_0,z=W0​h+b0​,
where hhh is the input activation to that layer. Full fine-tuning would replace W0W_0W0​ by W0+ΔWW_0 + \Delta WW0​+ΔW, with ΔW\Delta WΔW unconstrained and the same shape as W0W_0W0​. LoRA restricts that update to factor through a small rank-rrr bottleneck:
ΔW=sBA,s=αr.\Delta W = sBA,
\qquad s = \frac{\alpha}{r}.ΔW=sBA,s=rα​.
So the adapted layer becomes
z=W0h+sBAh+b0.z = W_0h + sBAh + b_0.z=W0​h+sBAh+b0​.
Here A∈Rr×dinA \in \mathbb{R}^{r \times d_{\text{in}}}A∈Rr×din​ projects the input activation into a low-dimensional adaptation space, and B∈Rdout×rB \in \mathbb{R}^{d_{\text{out}} \times r}B∈Rdout​×r maps it back to the output dimension. The pretrained matrix W0∈Rdout×dinW_0 \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}W0​∈Rdout​×din​ remains fixed throughout training.
The scaling factor s=α/rs=\alpha/rs=α/r is a practical normalization convention. Since changing the rank rrr changes the number of paths through the low-rank adapter, the scale α/r\alpha/rα/r helps keep the magnitude of the update reasonably controlled as rrr varies. In practice, α\alphaα is often treated as a LoRA hyperparameter analogous to an adapter strength.
Training then minimizes the usual empirical risk over the trainable LoRA parameters ϕ\phiϕ, which collect all inserted A,BA,BA,B factors:
L(ϕ)=1n∑i=1nℓ(y^i,yi).\mathcal{L}(\phi)
=
\frac{1}{n}
\sum_{i=1}^{n}
\ell(\hat{y}_i,y_i).L(ϕ)=n1​i=1∑n​ℓ(y^​i​,yi​).
The important distinction is not the loss, but the parameter set receiving optimizer updates. Gradients still flow through the network computational graph, because the LoRA factors may be deep inside a transformer block. But the optimizer state is allocated only for AAA and BBB, not for the original pretrained parameters θ0\theta_0θ0​.
A common initialization is asymmetric:
initialize AAA randomly,
initialize B=0B=0B=0,
freeze W0W_0W0​ and b0b_0b0​.
This makes the initial LoRA update exactly zero:
sBA=0,sBA = 0,sBA=0,
so the model begins training with precisely the pretrained function. That is a useful stability property: before seeing any task data, LoRA has not perturbed the base model at all. There is one subtle consequence: at the very first step, the gradient into AAA is zero because it is multiplied by B⊤B^\topB⊤. The gradient into BBB, however, is generally nonzero because AAA is random. After BBB moves away from zero, both factors can train normally.
The algorithm also depends on a choice of target modules S\mathcal{S}S. LoRA is usually not inserted into every matrix in the model. In transformer language models, common targets include attention projection matrices such as query and value projections, and sometimes key, output, or MLP projection matrices. This choice is a systems and modeling trade-off:
targeting more matrices increases adaptation capacity,
targeting fewer matrices reduces memory and compute overhead,
attention projections often give strong performance per parameter,
very small ranks may underfit tasks requiring broad behavioral change.
The core training loop is therefore simple. Freeze the pretrained model, attach low-rank trainable branches to selected matrices, compute predictions using the modified layer rule, evaluate the supervised loss, and update only the LoRA factors with Adam or another optimizer. The base model participates in the forward computation, but it does not accumulate trainable updates.
This is why LoRA is best understood as constrained fine-tuning, not as prompt tuning or retrieval augmentation. The model’s internal computations are genuinely modified, but only through a low-rank subspace of possible weight updates. Compared with full fine-tuning, LoRA dramatically reduces trainable parameters and optimizer memory. Compared with purely external methods, it still changes the effective network function in a task-specific way.
The visual below condenses this into the algorithmic skeleton: first freeze θ0\theta_0θ0​, then initialize one pair of low-rank factors for each selected target matrix W0∈SW_0 \in \mathcal{S}W0​∈S, then run the ordinary supervised optimization loop. The highlighted layer equation
z=W0h+sBAh+b0z = W_0h + sBAh + b_0z=W0​h+sBAh+b0​
is the entire mechanism: the frozen pretrained path remains intact, while the trainable low-rank path learns the task-specific residual.
The key takeaway is that LoRA fine-tuning does not require a new learning objective or special gradient estimator. It is standard backpropagation with a carefully restricted parameterization. All of the efficiency gains follow from the fact that Adam moments, gradients, and trainable weights are maintained only for the small matrices AAA and BBB, while the large pretrained matrices stay fixed.

18. Algorithm: Merge and Unmerge for Inference

Once the adapter matrices $A$ and $B$ have been learned, LoRA has one more practical trick that is easy to miss: the trained adapter does not have to remain as a separate computational branch at inference time. During training, keeping the branch explicit is useful because $W_0$ stays frozen while only the low-rank factors are updated. But after training, the model only needs to compute the same function. It does not care whether the update is represented as two skinny matrices applied after $h$ , or as one already-added dense matrix.

Recall the LoRA layer during training. For a base weight $W_0 \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ , LoRA learns

$A \in \mathbb{R}^{r \times d_{\text{in}}}, \qquad B \in \mathbb{R}^{d_{\text{out}} \times r},$

with scaling

$s = \frac{\alpha}{r}.$

The forward pass is

$z = W_0 h + sBAh + b_0.$

Since $BA$ has the same shape as $W_0$ , the adapter is simply an additive weight update:

$\Delta W = sBA.$

So after training, we can define a merged weight

$W = W_0 + sBA.$

Then ordinary inference through the original linear layer gives

$z = Wh + b_0 = (W_0+sBA)h+b_0 = W_0h+sBAh+b_0.$

That equality is the entire reason merging works. The computation graph changes, but the mathematical function is the same, up to finite-precision arithmetic.

This distinction matters because LoRA’s training-time efficiency and inference-time efficiency come from different places. During training, LoRA saves optimizer state, gradient memory, and trainable parameters by freezing $W_0$ . During inference, however, an explicit LoRA branch would still require extra operations: compute $Ah$ , then $B(Ah)$ , scale it, and add it to the base output. Merging folds those operations into the stored matrix once, so each future token uses the same linear layer implementation as the original model.

Operationally, the merge algorithm is almost embarrassingly simple:

set s = alpha / r
for each adapted matrix W_0:
    W = W_0 + s B A
    disable explicit LoRA branch

Unmerging reverses the operation:

set s = alpha / r
for each merged matrix W:
    W_0 = W - s B A
    enable explicit LoRA branch

Unmerging is useful when you want to continue training, swap adapters, evaluate multiple tasks on the same base model, or preserve the base model as a clean checkpoint. In practice, implementations usually track whether a layer is currently merged, because accidentally merging twice would add the adapter twice:

$W_0 + sBA + sBA.$

That kind of state bug is more common than the algebra suggests, especially when models are saved, reloaded, quantized, or wrapped by distributed training libraries.

There are also a few numerical caveats. In exact arithmetic, merge followed by unmerge restores $W_0$ exactly. In floating-point arithmetic, especially with low-precision weights, repeated additions and subtractions can introduce small drift. Quantized inference adds another wrinkle: if the base weight is stored in 8-bit or 4-bit form, merging may require dequantizing, adding $\Delta W$ , and possibly requantizing. That can slightly change the effective function compared with the explicit branch. The conceptual equivalence still holds, but the storage format determines how exact the implementation can be.

The key assumption behind merging is that the LoRA update is inserted as an additive linear update to a weight matrix. This is why the trick is so clean for attention projections and feed-forward projections: the adapted computation has the form

$W_0h + \Delta W h.$

If an adaptation method inserts nonlinear operations, routing decisions, activation-dependent gates, or other input-dependent transformations, then it may not be possible to fold the adapter into a single static matrix. LoRA’s mergeability is a direct consequence of its low-rank linear parameterization.

The visual below compresses this idea into two equivalent views of the same layer. Before merging, the input $h$ flows through the frozen base path and a separate LoRA path, and their outputs are summed. After merging, the LoRA update has been absorbed into a single matrix $W = W_0+sBA$ , so inference uses the ordinary linear layer again.

The important takeaway is not merely that merging saves a few operations. It means LoRA can behave like a lightweight training-time modification while leaving behind a deployment artifact that looks like a normal dense model. That is why LoRA is attractive in production settings: adapters are cheap to train and store, but once selected for a task, they can be folded into the model so there is no extra inference branch after merge.

18. Algorithm: Merge and Unmerge for Inference

19. Systems Accounting: Parameters, Memory, and Latency

Once merging has made the inference story clean, the next question is the systems one: what exactly did LoRA buy us, and where did the costs move? The answer is not simply “fewer parameters” in the abstract. LoRA changes the accounting of trainable state, optimizer memory, and training-time compute, while leaving the merged inference path essentially identical to the original dense layer.
Consider one pretrained weight matrix W0∈Rd×kW_0 \in \mathbb{R}^{d \times k}W0​∈Rd×k. Full fine-tuning makes every entry trainable, so the number of trainable parameters for this matrix is
pfull=dk.p_{\mathrm{full}} = dk.pfull​=dk.
LoRA instead freezes W0W_0W0​ and learns a low-rank update
ΔW=sBA,\Delta W = sBA,ΔW=sBA,
where A∈Rr×kA \in \mathbb{R}^{r \times k}A∈Rr×k, B∈Rd×rB \in \mathbb{R}^{d \times r}B∈Rd×r, and r≪min⁡(d,k)r \ll \min(d,k)r≪min(d,k). The number of trainable parameters is therefore
pLoRA=rk+dr=r(k+d).p_{\mathrm{LoRA}} = rk + dr = r(k+d).pLoRA​=rk+dr=r(k+d).
So the parameter fraction is
ρ=pLoRApfull=r(k+d)dk.\rho
=
\frac{p_{\mathrm{LoRA}}}{p_{\mathrm{full}}}
=
\frac{r(k+d)}{dk}.ρ=pfull​pLoRA​​=dkr(k+d)​.
This ratio is the basic systems lever. When ddd and kkk are large and rrr is small, ρ\rhoρ can be tiny. For example, if d=k=4096d=k=4096d=k=4096 and r=8r=8r=8, then
ρ=8(4096+4096)40962≈0.0039,\rho
=
\frac{8(4096+4096)}{4096^2}
\approx 0.0039,ρ=409628(4096+4096)​≈0.0039,
or about 0.39%0.39\%0.39% of the dense matrix’s parameter count. That is the core reason LoRA can make adaptation of very large models practical: the fine-tuning problem no longer requires storing gradients and optimizer statistics for every pretrained weight.
The memory consequence is especially important because training memory is not just parameter memory. For each trainable parameter, we may store the parameter itself, its gradient, and optimizer state such as Adam’s first and second moments. Abstractly, we can write the trainable-state memory as
Mtrain≈p(cparam+copt),M_{\mathrm{train}}
\approx
p(c_{\mathrm{param}} + c_{\mathrm{opt}}),Mtrain​≈p(cparam​+copt​),
where ppp is the number of trainable parameters, cparamc_{\mathrm{param}}cparam​ summarizes parameter and gradient storage, and coptc_{\mathrm{opt}}copt​ summarizes optimizer state. Under full fine-tuning,
Mtrain≈pfull(cparam+copt),M_{\mathrm{train}}
\approx
p_{\mathrm{full}}(c_{\mathrm{param}}+c_{\mathrm{opt}}),Mtrain​≈pfull​(cparam​+copt​),
whereas under LoRA,
Mtrain≈pLoRA(cparam+copt).M_{\mathrm{train}}
\approx
p_{\mathrm{LoRA}}(c_{\mathrm{param}}+c_{\mathrm{opt}}).Mtrain​≈pLoRA​(cparam​+copt​).
If the same precision and optimizer are used, the trainable-state memory fraction is again approximately ρ\rhoρ.
There are a few assumptions hiding in this clean accounting. The frozen base weights W0W_0W0​ still have to live somewhere during training; LoRA does not make the pretrained model disappear. Activations, attention caches, temporary buffers, communication overhead, and framework-level bookkeeping can dominate in some regimes. So ρ\rhoρ is best understood as the reduction in trainable parameter state, not necessarily the exact reduction in total GPU memory. Still, for large models trained with optimizers like Adam, removing optimizer state for the frozen weights is often a major win.
Compute has a slightly different story. During training, LoRA usually evaluates the frozen dense path and the low-rank branch separately:
z=W0h+sBAh+b0.z = W_0h + sBAh + b_0.z=W0​h+sBAh+b0​.
The base multiplication W0hW_0hW0​h costs on the order of dkdkdk multiply-adds for one input vector hhh. The LoRA branch can be evaluated as AhA hAh followed by B(Ah)B(Ah)B(Ah), costing
rk+dr=r(k+d)rk + dr = r(k+d)rk+dr=r(k+d)
additional multiply-adds. This is small relative to dkdkdk when rrr is small, but it is not zero. LoRA saves dramatically on trainable state, while adding a modest training-time branch.
The merge operation is what prevents that branch from becoming an inference penalty. After training, we can form
W=W0+sBA,W = W_0 + sBA,W=W0​+sBA,
and then compute the layer as the ordinary dense affine map
z=Wh+b0.z = Wh + b_0.z=Wh+b0​.
At that point, the deployed model no longer needs to explicitly compute AhAhAh, then B(Ah)B(Ah)B(Ah), then add the result to W0hW_0hW0​h. The adaptation has been absorbed into the dense weight. This is why LoRA is often described as having no additional inference latency after merging, assuming the merged matrix is materialized and served like any other weight.
The main takeaway is a separation of concerns:
Training memory scales with the number of trainable parameters, so LoRA reduces optimizer and gradient state by roughly ρ\rhoρ.
Training latency includes a small extra low-rank branch, costing r(k+d)r(k+d)r(k+d) operations per adapted matrix application.
Merged inference latency can match the original dense path because the low-rank update has been folded into WWW.
The subtle failure mode is to confuse these three budgets. A method can be cheap in trainable state but still require the full frozen model in memory. It can reduce optimizer memory while leaving activation memory unchanged. It can add a branch during training but remove that branch at inference. Good systems accounting keeps these categories separate instead of collapsing them into a single claim like “LoRA is cheaper.”
The visual below compresses this accounting into two complementary views. The left side treats LoRA as a parameter and memory calculation: full fine-tuning pays for dkdkdk trainable entries, while LoRA pays for only r(k+d)r(k+d)r(k+d), giving the fraction ρ\rhoρ. The right side treats LoRA as a computation graph: during training, the input flows through both the frozen base path and the low-rank adaptation path.
The final part of the visual emphasizes the operational payoff of merging. Before merging, the LoRA factors AAA and BBB are explicit modules in the forward pass. After MERGE-LORA, the model serves a single dense matrix W=W0+sBAW = W_0+sBAW=W0​+sBA, so the low-rank branch is no longer on the inference path.

20. Implementation Choices and Pitfalls

After accounting for parameters, memory, and latency, it is tempting to treat LoRA as “just add two small matrices.” Algebraically, that is almost true. Operationally, it is not. A LoRA layer is correct only if the frozen path, low-rank path, scaling convention, optimizer state, and merge state all agree with one another. Many failed LoRA runs are not failures of the low-rank hypothesis; they are bookkeeping errors that quietly change the effective model being trained.
For a pretrained linear map W0∈Rd×kW_0 \in \mathbb{R}^{d \times k}W0​∈Rd×k, LoRA replaces direct fine-tuning of W0W_0W0​ with a constrained update
ΔWLoRA=sBA,s=α/r,\Delta W_{\mathrm{LoRA}} = sBA,\qquad s=\alpha/r,ΔWLoRA​=sBA,s=α/r,
where A∈Rr×kA \in \mathbb{R}^{r \times k}A∈Rr×k, B∈Rd×rB \in \mathbb{R}^{d \times r}B∈Rd×r, and r≪min⁡(d,k)r \ll \min(d,k)r≪min(d,k). The forward pass is therefore
z=W0h+sBAh+b0.z = W_0h+sBAh+b_0.z=W0​h+sBAh+b0​.
The crucial implementation detail is that W0W_0W0​ is not merely “not optimized in spirit”; it must be frozen in the computational graph and excluded from the optimizer. If W0W_0W0​ accidentally receives gradients or optimizer state, then the model is no longer doing LoRA fine-tuning. It is doing full or partial fine-tuning plus a low-rank side update, which changes both the memory accounting and the experimental interpretation.
The scale s=α/rs=\alpha/rs=α/r is another small detail with large consequences. Without it, increasing the rank changes not only the expressivity of the update but also its typical magnitude. That makes rank comparisons misleading: a higher-rank adapter may appear better simply because it is allowed to produce larger updates. The usual convention keeps the effective update scale controlled as rrr varies, so that changing rrr more cleanly probes capacity rather than accidentally changing the learning dynamics.
Target selection also matters. LoRA is not automatically attached to every matrix in the network. One chooses a set of target matrices S\mathcal{S}S: for example attention query and value projections, sometimes key and output projections, and in some models MLP projections as well. Each adapted matrix with shape d×kd \times kd×k contributes
pLoRA=r(k+d)p_{\mathrm{LoRA}} = r(k+d)pLoRA​=r(k+d)
trainable parameters, so the memory cost scales linearly with the number and shapes of selected targets. A small rank applied broadly can cost more than a larger rank applied selectively. This is why LoRA design is partly an architectural choice, not only a numerical one.
Dropout introduces another subtle branch-specific distinction. If LoRA dropout is used, it is usually applied only to the adapter input, not to the frozen pretrained path:
z=W0h+sBA(Mdroph)+b0.z = W_0h+sBA(M_{\mathrm{drop}}h)+b_0.z=W0​h+sBA(Mdrop​h)+b0​.
This preserves the deterministic pretrained computation while regularizing the learned update. Applying dropout to the whole input hhh, or to both branches, changes the behavior of the frozen model itself and can degrade the very prior that LoRA is trying to preserve.
Bias terms deserve similar care. In the basic LoRA formulation, the pretrained bias b0b_0b0​ remains frozen along with W0W_0W0​. Some implementations optionally train biases, but then those biases are part of the trainable parameter set ϕ\phiϕ, and the method is no longer the minimal “only A,BA,BA,B” variant. That may be perfectly reasonable, but it should be a deliberate experimental choice rather than an accidental default inherited from a framework.
The optimizer should also be configured for the adapter parameters, not for the original model. Adam statistics, learning rates, weight decay, and precision handling should apply to AAA and BBB only, unless additional trainable parameters have been explicitly enabled. A common bug is to freeze gradients but still pass too many parameters to the optimizer, or to create optimizer state before freezing. In large models, that mistake can erase much of LoRA’s memory advantage even if the numerical updates to W0W_0W0​ are zero.
Finally, merging requires exact state bookkeeping. During training or adapter-based inference, the computation uses the separate form W0h+sBAhW_0h+sBAhW0​h+sBAh. For deployment, one may merge the adapter into the base weight:
W0←W0+sBA.W_0 \leftarrow W_0+sBA.W0​←W0​+sBA.
If we later want to resume adapter training or switch adapters, we must unmerge exactly once:
W0←W0−sBA.W_0 \leftarrow W_0-sBA.W0​←W0​−sBA.
Merging twice, forgetting that a layer is already merged, or unmerging with a different adapter state can silently corrupt the base weights. A robust implementation therefore tracks whether each adapter is currently merged and treats merge/unmerge as inverse operations, not as informal convenience calls.
A good mental checklist for LoRA implementation is:
Freeze the base path: W0W_0W0​ and usually b0b_0b0​ should not update.
Scale the adapter: always include s=α/rs=\alpha/rs=α/r.
Choose targets intentionally: S\mathcal{S}S determines both capability and memory.
Apply dropout only to the LoRA branch if using LoRA dropout.
Optimize only trainable adapter parameters.
Track merge state exactly to avoid double-adding or subtracting the update.
The visual below condenses these decisions into a comparison table: each row corresponds to a choice that looks minor in code but changes the mathematical object being trained. The highlighted scaling row emphasizes that rrr and α\alphaα should be interpreted together, while the warning strip collects the most common implementation bugs: missing the scale factor, accidentally updating W0W_0W0​, and merging without consistent inverse bookkeeping.
Read the table as a systems-level companion to the formula ΔWLoRA=sBA\Delta W_{\mathrm{LoRA}}=sBAΔWLoRA​=sBA. The equation defines the adapter; the implementation choices define whether the code actually realizes that adapter.

21. Worked Example: Synthetic Rank-2 Linear Adaptation

After all the implementation details—where to insert adapters, how to scale them, how to initialize them, and how to merge them—it is useful to step back and test the central LoRA claim in the cleanest possible setting. The claim is not merely that LoRA uses fewer parameters. The stronger claim is that, when the needed change to a pretrained model is intrinsically low-rank, LoRA can represent that change exactly once the chosen adapter rank is large enough.
Consider a synthetic linear adaptation problem. We start with a frozen pretrained weight matrix W0W_0W0​, and the data are generated by a true adapted model
Y=(W0+M)X+E,rank⁡(M)=2.Y=(W_0+M)X+E,\qquad \operatorname{rank}(M)=2.Y=(W0​+M)X+E,rank(M)=2.
Here XXX is the input matrix, YYY is the target output matrix, MMM is the unknown task-specific update we wish we could apply to W0W_0W0​, and EEE is noise. The important design choice is that MMM has rank exactly 222. This makes the LoRA hypothesis testable: if the target adaptation is truly rank-2, then a LoRA adapter of rank r≥2r\ge 2r≥2 should be expressive enough to recover it, up to noise and optimization error.
LoRA does not train W0W_0W0​ directly. Instead, it freezes W0W_0W0​ and learns a low-rank update
W−W0=ΔWLoRA=sBA,s=α/r.W-W_0=\Delta W_{\mathrm{LoRA}}=sBA,\qquad s=\alpha/r.W−W0​=ΔWLoRA​=sBA,s=α/r.
If W0∈Rdout×dinW_0\in\mathbb{R}^{d_{\text{out}}\times d_{\text{in}}}W0​∈Rdout​×din​, then typically
B∈Rdout×r,A∈Rr×din,B\in\mathbb{R}^{d_{\text{out}}\times r},
\qquad
A\in\mathbb{R}^{r\times d_{\text{in}}},B∈Rdout​×r,A∈Rr×din​,
so BABABA has rank at most rrr. The scalar s=α/rs=\alpha/rs=α/r controls the effective scale of the learned update, but it does not change the rank. Therefore the representational question is simple: can a rank-rrr matrix sBAsBAsBA equal the true update MMM?
When r=1r=1r=1, the answer is no in general. Since
rank⁡(sBA)≤1,\operatorname{rank}(sBA)\le 1,rank(sBA)≤1,
a rank-1 LoRA adapter can only capture one independent adaptation direction. But the ground-truth update MMM has two independent directions. Even with unlimited data and perfect optimization, one component of MMM must be left out. This is a representation error, not merely a training failure.
The best possible rank-1 approximation to MMM is given by truncated SVD. If
M=UΣV⊤M = U\Sigma V^\topM=UΣV⊤
has two nonzero singular values σ1≥σ2>0\sigma_1\ge \sigma_2>0σ1​≥σ2​>0, then the optimal rank-1 approximation M1M_1M1​ keeps only the top singular component. The unavoidable squared Frobenius error is
∥M−M1∥F2=σ22.\|M-M_1\|_F^2=\sigma_2^2.∥M−M1​∥F2​=σ22​.
This equation is the cleanest way to see what rank mismatch costs. The rank-1 adapter is not “almost LoRA enough” unless the second singular value is tiny. If σ2\sigma_2σ2​ is substantial, the missing direction produces a visible loss gap.
Once r≥2r\ge 2r≥2, the situation changes qualitatively. Because MMM itself has rank 222, there exists a factorization of MMM through a two-dimensional bottleneck. For example, using the compact SVD M=U2Σ2V2⊤M=U_2\Sigma_2V_2^\topM=U2​Σ2​V2⊤​, we can choose factors of rank 222 so that
M=sBA.M=sBA.M=sBA.
One valid construction is to absorb the singular values and scaling into the factors, such as
B=U2Σ2,A=1sV2⊤,B = U_2\Sigma_2,
\qquad
A = \frac{1}{s}V_2^\top,B=U2​Σ2​,A=s1​V2⊤​,
or any equivalent rescaling. More generally,
rank⁡(M)=2, r≥2⇒∃ A,B such that M=sBA.\operatorname{rank}(M)=2,\ r\ge 2
\quad\Rightarrow\quad
\exists\ A,B\ \text{such that}\ M=sBA.rank(M)=2, r≥2⇒∃ A,B such that M=sBA.
So in this synthetic example, LoRA’s expressivity becomes exact at r=2r=2r=2. Increasing the rank beyond 222 does not improve the best achievable noiseless fit, because there is no additional signal direction to capture.
That does not mean the training loss becomes zero. The data contain noise EEE, so even the correct model class cannot explain every observed target perfectly. Once r≥2r\ge 2r≥2, the remaining error should be governed by the noise floor and by optimization details, not by the rank constraint. This distinction is important in real experiments too: a plateau in validation or training loss after increasing rrr may indicate that rank is no longer the bottleneck.
This example also clarifies a common failure mode. If LoRA underperforms at small rank, that does not automatically mean the method is broken; it may simply mean the chosen rank is below the intrinsic rank of the needed update. Conversely, if increasing rrr beyond some point yields little benefit, the task adaptation may already be well captured by a low-dimensional subspace, or other bottlenecks—data quality, optimization, regularization, placement of adapters—may dominate.
The visual below compresses this reasoning into a single experiment: generate data from a frozen W0W_0W0​ plus a true rank-2 update MMM, then fit LoRA adapters with ranks 1,2,4,81,2,4,81,2,4,8. The predicted curve is not gradual in the idealized setting. It should show a sharp drop when the LoRA rank reaches the true rank, followed by a plateau near the noise floor induced by EEE.
The key takeaway is therefore very concrete: LoRA succeeds exactly in this synthetic setting once its rank reaches the true rank of the required weight update. Rank 111 misses a direction; rank 222 can represent the target update; larger ranks mostly add unused capacity unless noise, finite data, or optimization behavior makes them matter.

22. Worked Example: LoRA on Attention Projections

The synthetic rank-2 example was deliberately stripped down: one linear map, one desired low-rank correction, and a clean demonstration that LoRA can learn the update without touching the pretrained matrix. The same logic becomes more useful when we place it inside a transformer block, because transformer layers contain several large linear projections repeated many times across depth. The practical question is not merely can we write W0+ΔWW_0 + \Delta WW0​+ΔW, but which matrices should receive ΔW\Delta WΔW so that the adaptation is expressive without becoming another form of full fine-tuning?
A common and effective choice is to adapt selected attention projections, especially the query and value projections. In a standard attention block, hidden states are projected into queries, keys, and values:
Q=HWQ⊤,K=HWK⊤,V=HWV⊤.Q = H W_Q^\top,\qquad K = H W_K^\top,\qquad V = H W_V^\top.Q=HWQ⊤​,K=HWK⊤​,V=HWV⊤​.
Here HHH is the batch-and-sequence collection of hidden states, often represented as a row-vector matrix of shape roughly (tokens,dmodel)(\text{tokens}, d_{\text{model}})(tokens,dmodel​). The matrices WQ,WK,WVW_Q, W_K, W_VWQ​,WK​,WV​ map from the model hidden dimension into attention-space representations. LoRA does not change the fact that attention is computed from QQQ, KKK, and VVV; it only changes how some of those projections are produced.
For this worked example, we choose the target set
S={WQ,WV}.\mathcal{S}=\{W_Q,W_V\}.S={WQ​,WV​}.
That means WQW_QWQ​ and WVW_VWV​ remain frozen pretrained matrices, but each receives its own trainable low-rank correction. The key projection WKW_KWK​, the output projection WOW_OWO​, the feed-forward layers, layer norms, residual connections, and attention softmax are all left structurally unchanged in this simplified placement rule. This is important: LoRA is not inserting a new transformer submodule. It is modifying selected linear maps by adding a parallel low-rank branch.
For the query projection, the adapted computation becomes
Q=H(WQ+sBA)⊤.Q = H\bigl(W_Q+sBA\bigr)^\top.Q=H(WQ​+sBA)⊤.
Similarly, for the value projection,
V=H(WV+sBA)⊤.V = H\bigl(W_V+sBA\bigr)^\top.V=H(WV​+sBA)⊤.
Strictly speaking, the AAA and BBB used for WQW_QWQ​ are distinct trainable parameters from the AAA and BBB used for WVW_VWV​. We often write the same symbols because the parameterization is the same, but each adapted matrix owns its own LoRA factors. The frozen matrix supplies the pretrained behavior; the learned product BABABA supplies the task-specific displacement; and the scalar sss, often s=α/rs=\alpha/rs=α/r, controls the scale of that displacement.
For a single token vector hhh, the same idea is easier to read in column-vector form:
W0h↦W0h+sBAh.W_0h \mapsto W_0h+sBAh.W0​h↦W0​h+sBAh.
This expression reveals the computational structure. The token first follows the original frozen path W0hW_0hW0​h. In parallel, it passes through a rank-rrr bottleneck: AAA maps the hidden vector into a smaller rrr-dimensional space, and BBB maps it back to the output dimension. The two paths are then added. Because BABABA has rank at most rrr, LoRA can only express updates inside a low-dimensional subspace of all possible weight changes—but that constraint is precisely what makes it parameter-efficient and regularizing.
There are a few subtle implementation details worth keeping straight. First, adapting WQW_QWQ​ and WVW_VWV​ changes the behavior of attention in different ways. A query update changes what each token asks for when attending to the sequence; a value update changes what information is carried back once attention weights have selected positions. This pairing often gives strong downstream performance because it can alter both the addressing mechanism and the retrieved content. By contrast, leaving WKW_KWK​ fixed means the representation being matched against queries remains anchored to the pretrained model’s key space.
Second, LoRA preserves the surrounding computation graph. There are:
no new prompt tokens inserted into the sequence,
no adapter MLP placed between transformer blocks,
no change to the attention softmax formula,
no update to the frozen pretrained projection weights.
This matters operationally. During training, only the LoRA factors receive gradients. During inference, the low-rank update can either remain as a small side branch or be merged into the frozen weight by replacing W0W_0W0​ with W0+sBAW_0+sBAW0​+sBA. After merging, the runtime computation is just an ordinary dense projection again.
The main failure mode to keep in mind is under-allocating rank or adapting too few modules. If the downstream task requires a broad reconfiguration of the representation space, a tiny low-rank correction on only WQW_QWQ​ and WVW_VWV​ may not provide enough capacity. On the other hand, increasing the rank or adapting many more projections gradually trades away the parameter-efficiency advantage. LoRA’s empirical success comes from the observation that many useful task adaptations appear to occupy a much lower-dimensional subspace than the full parameter space would suggest.
The visual below consolidates this placement rule inside one attention block. The hidden states HHH feed the usual projection matrices, but only WQW_QWQ​ and WVW_VWV​ receive orange low-rank side branches. The key and output projections remain gray and unchanged, reinforcing the point that the transformer architecture is not being rebuilt around the adaptation.
The equation panel in the visual is the compact algebraic summary: for selected projections, replace WWW by W+sBAW+sBAW+sBA; for unselected projections, keep the pretrained matrix as-is. In other words, LoRA’s intervention is local, additive, and low-rank: the base attention computation remains intact while the task learns small trainable corrections exactly where we choose to attach them.

23.

After working through a concrete attention-projection example, it is worth stepping back and noticing what LoRA really gives us: not merely a smaller set of trainable matrices, but a controlled adaptation budget. We freeze the pretrained model as the stable computational backbone, then decide where and how much freedom to give the model through low-rank residual updates.
That distinction matters because LoRA is not uniformly powerful everywhere. A low-rank update can only express changes that lie in a small-dimensional subspace. If the task mainly requires redirecting existing features—changing which tokens attend to which others, or slightly reshaping an already useful representation—then a small adapter can be surprisingly effective. But if the task requires learning entirely new internal features, new reasoning routines, or domain-specific transformations absent from pretraining, the low-rank constraint may become a real bottleneck.
In practice, a LoRA configuration is defined by a few coupled choices:
Target modules: which weight matrices receive adapters.
Rank rrr: the dimension of the trainable update subspace.
Scaling: how strongly the adapter contribution is injected.
Dropout or regularization: whether the adapter is encouraged not to overfit.
Merge strategy: whether the learned update is folded into the base weights for inference.
The most common insertion points are attention projections, especially WqW_qWq​ and WvW_vWv​. This is not accidental. Query and value projections directly affect where information is retrieved from and what content is passed forward. For many instruction-following or task-adaptation settings, modifying these projections gives the model enough behavioral flexibility without touching every dense layer.
But this is also an assumption. If the adaptation requires changing token mixing, output composition, or feed-forward feature transformations, then limiting LoRA to WqW_qWq​ and WvW_vWv​ may underfit. Adding adapters to WkW_kWk​, WoW_oWo​, or MLP matrices increases capacity, but it also increases trainable parameters, optimizer memory, and potentially training instability. LoRA’s efficiency is therefore not just a theorem about low rank; it is an empirical design trade-off.
A useful way to think about the rank rrr is that it controls the number of independent “directions of change” available to the update. Small ranks often work because fine-tuning gradients in large pretrained models are empirically structured: many useful adaptations appear to live in low-dimensional subspaces. However, this is not guaranteed. If rrr is too small, the adapter cannot represent the needed update. If rrr is too large, LoRA begins to lose its parameter-efficiency advantage and can overfit, especially on small datasets.
There is also an important deployment distinction. During training, the model computes the frozen path and the adapter path separately. At inference time, however, the adapter update can often be merged into the frozen weight, producing an effective weight matrix with no extra adapter computation. This is one reason LoRA is attractive operationally: it can behave like a lightweight fine-tuning method during training while still allowing ordinary dense inference afterward.
The main failure mode is forgetting that LoRA is a constrained update, not magic compression. Its success depends on the pretrained model already containing reusable structure. LoRA is strongest when the task is close enough to pretraining that adaptation can be expressed by recombining or redirecting existing capabilities. It is weaker when the task demands large architectural, representational, or domain shifts.
The visual below serves as a compact map of these ideas: the frozen model remains the center of gravity, while small trainable adapters are inserted only at selected locations. The key intuition is that LoRA lets us choose a thin path for learning through a very large network, rather than reopening every parameter.
It also prepares us for the next question: once we have this design space, which choices actually matter most? Rank, target modules, and inference latency are not independent details; they are the knobs that determine whether LoRA is merely cheap, genuinely effective, or both.

24. Ablations: Rank, Target Modules, and Latency

A useful way to make LoRA feel less like a trick and more like an engineering tool is to ask a practical question: if the adapter is cheap, how cheap should we make it? The low-rank update gives us a knob, the rank rrr, but that knob does not behave like an ordinary “more is always better” hyperparameter. Empirically, LoRA often enters a regime where task quality has mostly saturated while cost continues to grow predictably. The art is to stop increasing capacity when the model has enough adaptation freedom, rather than when the adapter is as large as we can afford.
For a single adapted weight matrix W0∈Rd×kW_0 \in \mathbb{R}^{d \times k}W0​∈Rd×k, LoRA replaces a dense trainable update ΔW\Delta WΔW with a factorized update BABABA, where B∈Rd×rB \in \mathbb{R}^{d \times r}B∈Rd×r and A∈Rr×kA \in \mathbb{R}^{r \times k}A∈Rr×k. The number of trainable LoRA parameters is therefore
pLoRA=r(k+d),ρ=r(k+d)dk.p_{\mathrm{LoRA}} = r(k+d),
\qquad
\rho = \frac{r(k+d)}{dk}.pLoRA​=r(k+d),ρ=dkr(k+d)​.
Here ρ\rhoρ is the fraction of parameters trained relative to full fine-tuning of that matrix. The important structural fact is that LoRA cost grows linearly in rrr, while the full matrix has dkdkdk degrees of freedom. When r≪min⁡(d,k)r \ll \min(d,k)r≪min(d,k), the adapter is a small subspace of possible dense updates.
This linear dependence makes rank ablations especially interpretable. If we double rrr, we roughly double the number of trainable adapter parameters for that matrix. We also roughly double the optimizer state associated with those parameters. A simple training-memory proxy is
Mtrain≈pLoRA(cparam+copt),M_{\mathrm{train}}
\approx
p_{\mathrm{LoRA}}(c_{\mathrm{param}} + c_{\mathrm{opt}}),Mtrain​≈pLoRA​(cparam​+copt​),
where cparamc_{\mathrm{param}}cparam​ accounts for storing trainable parameters and coptc_{\mathrm{opt}}copt​ accounts for optimizer state such as Adam moments. The frozen base weights still occupy memory, of course, but the additional trainable state scales with the LoRA adapter size. This is one of the main reasons LoRA is attractive: it decouples adaptation capacity from the full parameter count of the pretrained model.
The subtle point is that accuracy does not usually grow linearly with rrr. A common empirical pattern is:
for very small rrr, the adapter may underfit because the allowed update subspace is too restrictive;
for moderate rrr, accuracy can improve quickly as the adapter captures the dominant task-specific directions;
after some point, additional rank gives diminishing returns, because the relevant update directions have already been represented well enough.
This is not guaranteed for every task, dataset, or model, but it is a strong practical prior. LoRA works best when the downstream adaptation is intrinsically lower-dimensional than the full weight space. If the task requires broad changes across many features, layers, or attention pathways, increasing rrr may continue to help for longer. If the task is close to the pretraining distribution, small ranks can be surprisingly competitive.
The second ablation knob is where to insert LoRA. In Transformer attention blocks, a strong default is often
S={WQ,WV},\mathcal{S} = \{W_Q, W_V\},S={WQ​,WV​},
meaning that LoRA adapters are placed on the query and value projection matrices. This choice is popular because it gives the model direct control over what tokens attend to and what information is transmitted through attention, while keeping the adapter budget small. Expanding the target set to include more projections, such as
S={WQ,WK,WV,WO},\mathcal{S} = \{W_Q, W_K, W_V, W_O\},S={WQ​,WK​,WV​,WO​},
can improve performance on harder tasks, but it increases pLoRAp_{\mathrm{LoRA}}pLoRA​ proportionally to the number and size of adapted matrices. The same trade-off applies if LoRA is added to MLP projections: more insertion points mean more adaptation capacity, but also more trainable parameters, optimizer state, and training compute.
This creates a useful diagnostic distinction. If performance is poor with a small rank and narrow target set, the model may be under-adapted. But the remedy is not always “increase rank.” Sometimes adding LoRA to additional modules is better than making the same modules higher rank. Rank controls the expressiveness of each local update; target-module selection controls where in the computation graph the model is allowed to change. A small adapter in the right place can beat a larger adapter in the wrong place.
Latency introduces a final practical wrinkle. During unmerged LoRA training, the forward pass through an adapted linear layer has two contributions: the frozen base projection and the low-rank residual. Schematically, the model computes both W0hW_0hW0​h and sBAhsBAhsBAh, then adds them. That extra branch is usually acceptable during training, especially because rrr is small, but it is still an additional computation.
At inference time, LoRA has a clean escape hatch: merge the adapter into the base weight. After merging, the adapted layer is just
z=(W0+sBA)h+b0.z = (W_0 + sBA)h + b_0.z=(W0​+sBA)h+b0​.
This is algebraically equivalent to the unmerged computation, up to numerical precision, but it removes the separate adapter branch. The deployed model can therefore run like an ordinary dense model with modified weights. This is why LoRA can offer parameter-efficient training without necessarily imposing extra inference latency, provided the adapters are merged before serving.
The practical rule is simple: start small, then expand only when the ablation says you are underfitting. A reasonable workflow is:
begin with a small rank rrr, such as 444, 888, or 161616, depending on model size and task difficulty;
adapt a narrow target set, often {WQ,WV}\{W_Q, W_V\}{WQ​,WV​};
inspect whether validation accuracy is still improving meaningfully with rank;
if quality saturates, stop increasing rrr;
if quality remains low, try broader target modules or a modest rank increase;
merge adapters for inference when you want the standard dense forward path.
The visual below compresses these ablation lessons into one picture. The central empirical pattern is the mismatch between the blue accuracy curve and the orange parameter-cost line: accuracy may rise quickly and flatten, while pLoRAp_{\mathrm{LoRA}}pLoRA​ keeps increasing linearly with rrr. That gap is exactly where LoRA’s efficiency decision lives. We are not looking for the largest adapter; we are looking for the smallest adapter that reaches the plateau.
The target-module bars summarize the second axis of the search: {WQ,WV}\{W_Q,W_V\}{WQ​,WV​} is the compact baseline, while adding more attention projections expands capacity at additional cost. The merge schematic captures the deployment story: unmerged LoRA is a two-branch computation during adaptation, but after merging it becomes the single dense expression (W0+sBA)h(W_0+sBA)h(W0​+sBA)h. Together, these ablations support the core recipe: tune rank and insertion sites for accuracy, but merge for inference so that parameter-efficient training does not become unnecessary serving overhead.

25. Limitations and Extensions

The ablations make an important point that is easy to miss: LoRA is not a universally “small version” of full fine-tuning. It is a structured hypothesis about what kind of change the pretrained model needs. When the rank, target modules, and scaling choices line up with the task, LoRA can look almost magically efficient. When they do not, the same mechanism can become a bottleneck.
At the center of the limitation is the low-rank constraint itself. For a frozen pretrained weight matrix W0∈Rdout×dinW_0 \in \mathbb{R}^{d_{\text{out}}\times d_{\text{in}}}W0​∈Rdout​×din​, LoRA replaces full fine-tuning of WWW with
W=W0+ΔWLoRA,ΔWLoRA=sBA,W = W_0 + \Delta W_{\mathrm{LoRA}},
\qquad
\Delta W_{\mathrm{LoRA}} = sBA,W=W0​+ΔWLoRA​,ΔWLoRA​=sBA,
where
B∈Rdout×r,A∈Rr×din,s=α/r.B \in \mathbb{R}^{d_{\text{out}}\times r},
\qquad
A \in \mathbb{R}^{r\times d_{\text{in}}},
\qquad
s = \alpha/r.B∈Rdout​×r,A∈Rr×din​,s=α/r.
Because the update factors through an rrr-dimensional bottleneck, its rank is bounded:
rank⁡(ΔWLoRA)=rank⁡(sBA)≤r.\operatorname{rank}(\Delta W_{\mathrm{LoRA}})
=
\operatorname{rank}(sBA)
\le r.rank(ΔWLoRA​)=rank(sBA)≤r.
This is exactly why LoRA is parameter-efficient, but it is also exactly why it can fail. If the task requires a genuinely high-rank change to some weight matrix, then no optimizer can recover that update from a rank-rrr parameterization. Increasing training time or tuning the learning rate will not remove the geometric constraint. The best LoRA can do is find a good low-rank approximation to the desired update.
This connects back to the rank-2 toy example: LoRA succeeds cleanly when the needed movement in parameter space is itself low-dimensional. In large pretrained models, this is often a reasonable assumption because many downstream tasks reuse most of the pretrained representation and only need to steer a relatively small number of features. But “often” is not “always.” Some tasks may require broad changes across many features, layers, or attention pathways. In those cases, a small rank may underfit even if the training loss appears stable.
A second failure mode is placing the adapter in the wrong part of the network. LoRA does not adapt every frozen weight unless we explicitly choose to insert it everywhere. We usually define a target set S\mathcal{S}S, such as the query and value projection matrices WQW_QWQ​ and WVW_VWV​ in attention layers. If the task-relevant change lives mostly in WKW_KWK​, WOW_OWO​, MLP projections, embeddings, or later normalization-sensitive pathways, then adapting only WQW_QWQ​ and WVW_VWV​ may leave performance on the table. In this sense, LoRA’s efficiency comes not only from low rank, but also from a sparsity assumption over which modules need to move.
There is also a more practical limitation: LoRA has fewer trainable parameters, but it does not eliminate the frozen model. At inference time we still need access to W0W_0W0​, either explicitly or after merging the update into it. For very large models, storing the frozen base can dominate the memory footprint. This is why LoRA is often paired with quantization: keep the large frozen backbone in low precision, and train or store the low-rank adapter separately. Methods in this family preserve LoRA’s adaptation efficiency while attacking the memory cost that LoRA alone does not solve.
The hyperparameters are another source of fragility. The scaling s=α/rs=\alpha/rs=α/r, adapter dropout pdropp_{\mathrm{drop}}pdrop​, optimizer choice, learning rate η\etaη, and initialization all influence the effective size and stability of the update. A common mistake is to treat LoRA rank as the only meaningful knob. In practice, rank, scale, dropout, and target modules interact. For example:
increasing rrr expands the possible update subspace;
increasing α\alphaα changes the update magnitude without changing rank;
dropout can regularize the adapter but may slow or weaken adaptation;
changing S\mathcal{S}S can matter more than changing rrr.
Finally, some extensions modify the parameterization itself. Standard LoRA writes the update as BABABA, which couples the direction of the update with its magnitude through the same learned factors. Decomposition variants try to separate or reweight these roles, or allocate rank adaptively across layers. The common theme is not to abandon LoRA’s frozen-base-plus-trainable-update template, but to relax one of its bottlenecks: fixed rank, fixed target modules, fixed scaling behavior, storage cost, or the particular BABABA factorization.
The visual below condenses these limitations into a comparison: each weakness of the basic LoRA recipe corresponds to a natural family of extensions. High-rank tasks motivate larger or adaptive ranks; poor target selection motivates expanding S\mathcal{S}S; memory pressure motivates quantized frozen bases; sensitivity to sss, pdropp_{\mathrm{drop}}pdrop​, and η\etaη motivates systematic ablations rather than fixed defaults.
The key takeaway is that LoRA is powerful when the desired update is both low-rank and placed where the model needs to change. Its extensions are best understood as ways of making that assumption less brittle, while preserving the central advantage: most of the pretrained network remains frozen, and only a compact adaptation mechanism is trained.

26. Unifying Summary: Forms of Adaptation

After looking at the limitations and extensions, it is useful to step back and name the common structure underneath all of these methods. The original problem was never merely that fine-tuning works poorly; full fine-tuning often works very well. The problem is that, for large pretrained models, it asks us to store and optimize a nearly complete copy of the model for every downstream task. If the base parameters are θ0\theta_0θ0​, full fine-tuning replaces them with task-specific parameters θ\thetaθ, and in each adapted linear layer we can write the change as
W=W0+ΔW.W = W_0 + \Delta W.W=W0​+ΔW.
That equation is the organizing lens for the whole lecture. Fine-tuning says: “learn an unrestricted ΔW\Delta WΔW.” Parameter-efficient adaptation asks: “can we get most of the benefit of ΔW\Delta WΔW, while making the trainable object ϕ\phiϕ much smaller?”
For a single dense matrix W0∈Rd×kW_0 \in \mathbb{R}^{d \times k}W0​∈Rd×k, full fine-tuning trains every entry of the update. The adapted computation is ordinary linear inference:
z=Wh+b0=(W0+ΔW)h+b0,z = Wh + b_0 = (W_0 + \Delta W)h + b_0,z=Wh+b0​=(W0​+ΔW)h+b0​,
and the trainable parameter count for that matrix is
pfull=dk.p_{\mathrm{full}} = dk.pfull​=dk.
This is maximally flexible, but it scales badly when we replicate it across many layers, many tasks, and many deployment variants. The inefficiency is not only storage; it also affects optimizer memory, checkpoint management, multi-task serving, and the operational burden of deciding which full model copy should be loaded for which request.
The alternatives differ mainly in where they place the small trainable object ϕ\phiϕ. Prompt and prefix methods keep the model weights frozen and instead alter the input-side or attention-side context. They can be extremely compact, but their adaptation is mediated through tokens or attention states rather than through the internal weight matrices themselves. This can introduce sequence-length or attention overhead, and the adapted behavior is not usually mergeable into the original weights.
Adapters choose a different compromise: they insert small trainable modules between frozen layers. These modules often use a bottleneck structure, so the number of trainable parameters remains modest. But the adapted computation is no longer exactly the original network graph. Even if the base model is frozen, inference now includes extra module calls, which can add latency and complicate deployment. Adapters are powerful because they introduce new nonlinear computation; they are less attractive when the goal is to preserve the original architecture and recover a single ordinary weight matrix at inference.
LoRA occupies a particularly elegant point in this design space. It keeps the full fine-tuning viewpoint that adaptation is an additive weight update, but restricts that update to a low-rank factorization. For selected matrices W0∈SW_0 \in \mathcal{S}W0​∈S, LoRA uses
ΔW=sBA,\Delta W = sBA,ΔW=sBA,
where A∈Rr×kA \in \mathbb{R}^{r \times k}A∈Rr×k, B∈Rd×rB \in \mathbb{R}^{d \times r}B∈Rd×r, and r≪min⁡(d,k)r \ll \min(d,k)r≪min(d,k). The adapted computation becomes
z=W0h+sBAh+b0.z = W_0h + sBAh + b_0.z=W0​h+sBAh+b0​.
The key is that LoRA does not need to train the full d×kd \times kd×k matrix. It trains only the two skinny factors, so the per-matrix trainable parameter count is
pLoRA=r(k+d),p_{\mathrm{LoRA}} = r(k+d),pLoRA​=r(k+d),
which is much smaller than dkdkdk when rrr is small.
This is why LoRA is best understood not as a mysterious new module, but as a rank-constrained version of full fine-tuning. It says that the task-specific displacement in weight space does not need to point in an arbitrary dkdkdk-dimensional direction. Instead, useful adaptation often lies in a much lower-dimensional subspace. The assumption is empirical but plausible: pretrained models already contain rich features, so many downstream tasks may require coordinated changes along relatively few directions rather than independent changes to every scalar weight.
There are two important subtleties. First, low rank is a constraint on each adapted matrix update, not necessarily on the entire function learned by the network. When LoRA is inserted into many layers, the resulting functional change can still be expressive because these low-rank perturbations compose through the nonlinear transformer. Second, LoRA’s efficiency depends on choosing the insertion set S\mathcal{S}S wisely. Applying LoRA to attention projection matrices is common because those matrices strongly control routing and representation mixing, but different tasks and model families may benefit from adapting MLP projections, output projections, or additional layers.
The deployment story is what makes LoRA especially attractive. During training, we compute the frozen path W0hW_0hW0​h and the low-rank residual sBAhsBAhsBAh. After training, however, the update can be merged:
W0←W0+sBA.W_0 \leftarrow W_0 + sBA.W0​←W0​+sBA.
Once merged, inference uses an ordinary dense matrix multiplication again. There is no extra adapter module, no extra prompt length, and no need to evaluate the low-rank branch separately unless we want to keep adapters dynamically swappable. This mergeability is one of LoRA’s defining practical advantages: it behaves like a compact training-time parameterization but can become standard weights at serving time.
The final comparison is therefore not “which method is universally best?” but rather “which compromise matches the constraint?” A useful summary is:
Full fine-tuning gives maximum flexibility but requires a full task-specific parameter copy.
Prompt and prefix tuning minimize trainable parameters but adapt indirectly through context.
Adapters add compact trainable modules but change the inference graph.
LoRA directly approximates the full fine-tuning update with a low-rank, mergeable residual.
The visual summary that follows compresses this whole design space into one table. Each row answers the same questions: what is trained, how the forward computation changes, how many parameters are introduced, whether inference becomes more expensive, and whether the adaptation can be folded back into ordinary weights.
The LoRA row is the important synthesis. It connects the full fine-tuning equation W=W0+ΔWW=W_0+\Delta WW=W0​+ΔW to the constrained update sBAsBAsBA, shows the parameter reduction from dkdkdk to r(k+d)r(k+d)r(k+d), and highlights the merge operation W0←W0+sBAW_0 \leftarrow W_0+sBAW0​←W0​+sBA. In one line, it captures the central takeaway of the lecture: LoRA is full fine-tuning’s additive matrix update, restricted to a trainable low-rank factorization that is cheap to optimize and recoverable as ordinary weights for inference.