Maximum Likelihood Reinforcement Learning (MaxRL): A Compute-Indexed Bridge from RL to Log-Likelihood - FeynmanWiki

CONTENTS

Bookmark this paper

Save for later reading

MACHINE LEARNING, REINFORCEMENT LEARNING - 45 MIN READ

Maximum Likelihood Reinforcement Learning (MaxRL): A Compute-Indexed Bridge from RL to Log-Likelihood

1. Why Standard RL Fails on Hard Correctness Tasks

When we fine-tune language models on reasoning and problem-solving tasks where the feedback is a simple pass/fail judgment, the choice of objective fundamentally shapes which problems the model learns from. Two natural population-level objectives stand out: the expected pass rate, favored by standard reinforcement learning, and the expected log pass probability, which corresponds to maximum likelihood estimation over successful trajectories. At first glance they seem like minor variants, but their gradients reveal a dramatic difference in how they allocate learning signal across easy and hard problems.
Let pθpass(x)p_{\theta}^{\text{pass}}(x)pθpass​(x) denote the probability that the model, parameterized by θ\thetaθ, produces a correct answer for input xxx. The RL objective that simply maximizes the overall proportion of solved problems is
JRL(θ)=Ex∼ρ[pθpass(x)],J_{\text{RL}}(\theta) = \mathbb{E}_{x\sim\rho}\big[p_{\theta}^{\text{pass}}(x)\big],JRL​(θ)=Ex∼ρ​[pθpass​(x)],
where ρ\rhoρ is the distribution over prompts. In contrast, the log-likelihood (ML) objective – often used in supervised fine-tuning where we have a set of correct demonstrations – can be written at the population level as
JML(θ)=Ex∼ρ[log⁡pθpass(x)].J_{\text{ML}}(\theta) = \mathbb{E}_{x\sim\rho}\big[\log p_{\theta}^{\text{pass}}(x)\big].JML​(θ)=Ex∼ρ​[logpθpass​(x)].
Both are legitimate goals, but they encode very different preferences. The RL objective rewards the model for achieving high average accuracy; it doesn’t care whether that average comes from acing easy problems while ignoring the hardest ones. The ML objective, on the other hand, penalizes the model heavily whenever a problem remains unsolved, even if it’s already extremely difficult – because log⁡p\log plogp is sensitive to small probabilities.
The real tension appears when we inspect the gradients. Taking the derivative under the expectation gives
∇θJRL=Ex[∇θpθpass(x)],∇θJML=Ex ⁣[1pθpass(x)∇θpθpass(x)].\nabla_\theta J_{\text{RL}} = \mathbb{E}_x\Big[\nabla_\theta p_{\theta}^{\text{pass}}(x)\Big],
\qquad
\nabla_\theta J_{\text{ML}} = \mathbb{E}_x\!\left[\frac{1}{p_{\theta}^{\text{pass}}(x)}\nabla_\theta p_{\theta}^{\text{pass}}(x)\right].∇θ​JRL​=Ex​[∇θ​pθpass​(x)],∇θ​JML​=Ex​[pθpass​(x)1​∇θ​pθpass​(x)].
In the RL gradient, each problem’s gradient vector ∇θpθpass(x)\nabla_\theta p_{\theta}^{\text{pass}}(x)∇θ​pθpass​(x) is weighted equally, with a coefficient of 111. In the ML gradient, that same vector is amplified by the inverse pass probability 1/pθpass(x)1/p_{\theta}^{\text{pass}}(x)1/pθpass​(x). Consequently, hard problems, where ppp is tiny, receive an enormous effective weight in the ML update, while easy problems, with ppp close to 111, contribute roughly the same as they do under RL. The practical outcome is a binary feedback dilemma: RL almost entirely ignores the hardest prompts, whereas ML over-amplifies them to the point of instability.
To see how severe this imbalance can be, consider a batch of five problems – two hard ones with p=0.01p = 0.01p=0.01, and three easy ones with p=0.9p = 0.9p=0.9. Under the RL gradient, the hard problems each contribute a weight of 0.010.010.01 (reflecting the tiny magnitude of their ∇p\nabla p∇p relative to easier problems, or more precisely, the gradient coefficient ppp if we rewrite ∇p=p ∇log⁡p\nabla p = p\,\nabla\log p∇p=p∇logp). Under the ML gradient, the hard problems are each multiplied by 1/0.01=1001/0.01 = 1001/0.01=100, while the easy ones receive a modest 1/0.9≈1.111/0.9 \approx 1.111/0.9≈1.11. In other words, the two hard problems, which together constitute 40% of the batch, are virtually invisible to the RL update, yet they dominate the ML update, receiving roughly 10,000 times the weight of the easy problems in terms of the factor applied to ∇p\nabla p∇p.
This extreme asymmetry is not just a quirk; it exposes a fundamental gap for correctness-based tasks where we have binary reward signals. Neither objective gives a principled way to control how much we care about hard examples relative to easy ones. We need a compute-indexed bridge that allows us to smoothly interpolate between these two poles – giving enough emphasis to challenging problems to learn from them, while still maintaining the stability that comes from solving a broad set of tasks. The MaxRL framework, introduced after this motivation, defines exactly such a family of objectives, controlled by a single hyperparameter NNN that governs how many samples we invest per prompt.
The visual below – a clean diagrammatic slide titled Why Standard RL Fails on Hard Correctness Tasks – reinforces this idea at a glance. On the left, it displays the gradient equations side by side, highlighting the coefficient 111 versus 1/p1/p1/p. On the right, a simple color-coded table lays out the five‑task example, with hard rows in a muted red and easy rows in green. An arrow dramatically connects the RL weight column to the ML column, underscoring the factor‑of‑10,00010,00010,000 jump. The image distills the quantitative argument into a single compelling snapshot: it leaves no doubt that neither RL nor ML alone provides a satisfactory answer for binary‑correctness training, and that the missing ingredient is a tunable knob between them.

CONTENTS

Bookmark this paper

Save for later reading

MACHINE LEARNING, REINFORCEMENT LEARNING - 45 MIN READ

Maximum Likelihood Reinforcement Learning (MaxRL): A Compute-Indexed Bridge from RL to Log-Likelihood

1. Why Standard RL Fails on Hard Correctness Tasks

When we fine-tune language models on reasoning and problem-solving tasks where the feedback is a simple pass/fail judgment, the choice of objective fundamentally shapes which problems the model learns from. Two natural population-level objectives stand out: the expected pass rate, favored by standard reinforcement learning, and the expected log pass probability, which corresponds to maximum likelihood estimation over successful trajectories. At first glance they seem like minor variants, but their gradients reveal a dramatic difference in how they allocate learning signal across easy and hard problems.
Let pθpass(x)p_{\theta}^{\text{pass}}(x)pθpass​(x) denote the probability that the model, parameterized by θ\thetaθ, produces a correct answer for input xxx. The RL objective that simply maximizes the overall proportion of solved problems is
JRL(θ)=Ex∼ρ[pθpass(x)],J_{\text{RL}}(\theta) = \mathbb{E}_{x\sim\rho}\big[p_{\theta}^{\text{pass}}(x)\big],JRL​(θ)=Ex∼ρ​[pθpass​(x)],
where ρ\rhoρ is the distribution over prompts. In contrast, the log-likelihood (ML) objective – often used in supervised fine-tuning where we have a set of correct demonstrations – can be written at the population level as
JML(θ)=Ex∼ρ[log⁡pθpass(x)].J_{\text{ML}}(\theta) = \mathbb{E}_{x\sim\rho}\big[\log p_{\theta}^{\text{pass}}(x)\big].JML​(θ)=Ex∼ρ​[logpθpass​(x)].
Both are legitimate goals, but they encode very different preferences. The RL objective rewards the model for achieving high average accuracy; it doesn’t care whether that average comes from acing easy problems while ignoring the hardest ones. The ML objective, on the other hand, penalizes the model heavily whenever a problem remains unsolved, even if it’s already extremely difficult – because log⁡p\log plogp is sensitive to small probabilities.
The real tension appears when we inspect the gradients. Taking the derivative under the expectation gives
∇θJRL=Ex[∇θpθpass(x)],∇θJML=Ex ⁣[1pθpass(x)∇θpθpass(x)].\nabla_\theta J_{\text{RL}} = \mathbb{E}_x\Big[\nabla_\theta p_{\theta}^{\text{pass}}(x)\Big],
\qquad
\nabla_\theta J_{\text{ML}} = \mathbb{E}_x\!\left[\frac{1}{p_{\theta}^{\text{pass}}(x)}\nabla_\theta p_{\theta}^{\text{pass}}(x)\right].∇θ​JRL​=Ex​[∇θ​pθpass​(x)],∇θ​JML​=Ex​[pθpass​(x)1​∇θ​pθpass​(x)].
In the RL gradient, each problem’s gradient vector ∇θpθpass(x)\nabla_\theta p_{\theta}^{\text{pass}}(x)∇θ​pθpass​(x) is weighted equally, with a coefficient of 111. In the ML gradient, that same vector is amplified by the inverse pass probability 1/pθpass(x)1/p_{\theta}^{\text{pass}}(x)1/pθpass​(x). Consequently, hard problems, where ppp is tiny, receive an enormous effective weight in the ML update, while easy problems, with ppp close to 111, contribute roughly the same as they do under RL. The practical outcome is a binary feedback dilemma: RL almost entirely ignores the hardest prompts, whereas ML over-amplifies them to the point of instability.
To see how severe this imbalance can be, consider a batch of five problems – two hard ones with p=0.01p = 0.01p=0.01, and three easy ones with p=0.9p = 0.9p=0.9. Under the RL gradient, the hard problems each contribute a weight of 0.010.010.01 (reflecting the tiny magnitude of their ∇p\nabla p∇p relative to easier problems, or more precisely, the gradient coefficient ppp if we rewrite ∇p=p ∇log⁡p\nabla p = p\,\nabla\log p∇p=p∇logp). Under the ML gradient, the hard problems are each multiplied by 1/0.01=1001/0.01 = 1001/0.01=100, while the easy ones receive a modest 1/0.9≈1.111/0.9 \approx 1.111/0.9≈1.11. In other words, the two hard problems, which together constitute 40% of the batch, are virtually invisible to the RL update, yet they dominate the ML update, receiving roughly 10,000 times the weight of the easy problems in terms of the factor applied to ∇p\nabla p∇p.
This extreme asymmetry is not just a quirk; it exposes a fundamental gap for correctness-based tasks where we have binary reward signals. Neither objective gives a principled way to control how much we care about hard examples relative to easy ones. We need a compute-indexed bridge that allows us to smoothly interpolate between these two poles – giving enough emphasis to challenging problems to learn from them, while still maintaining the stability that comes from solving a broad set of tasks. The MaxRL framework, introduced after this motivation, defines exactly such a family of objectives, controlled by a single hyperparameter NNN that governs how many samples we invest per prompt.
The visual below – a clean diagrammatic slide titled Why Standard RL Fails on Hard Correctness Tasks – reinforces this idea at a glance. On the left, it displays the gradient equations side by side, highlighting the coefficient 111 versus 1/p1/p1/p. On the right, a simple color-coded table lays out the five‑task example, with hard rows in a muted red and easy rows in green. An arrow dramatically connects the RL weight column to the ML column, underscoring the factor‑of‑10,00010,00010,000 jump. The image distills the quantitative argument into a single compelling snapshot: it leaves no doubt that neither RL nor ML alone provides a satisfactory answer for binary‑correctness training, and that the missing ingredient is a tunable knob between them.

2. Latent Generation Model and Pass Rate

The limitations of standard RL on hard correctness tasks force us to look beneath the surface of the final answer. If the only signal a model receives is whether its decoded output matches the ground truth, then any two models that achieve identical pass rates are indistinguishable to the optimizer, regardless of how they produce their answers. To make precise statements about what an objective can and cannot recover, we need a generative model that exposes the unobserved reasoning process while still connecting to the observable binary reward. This is the latent generation model, and it is the formal backbone of the entire MaxRL analysis.
We assume an input xxx is drawn from a distribution ρ\rhoρ over a space XXX. The model itself is a policy mθ(⋅∣x)m_\theta(\cdot|x)mθ​(⋅∣x) that, given xxx, produces a trajectory z∈Zz \in Zz∈Z. Crucially, zzz is latent: it may correspond to a chain-of-thought, a sequence of tool calls, or an internal navigation plan. The model does not output zzz directly as the user-visible answer. Instead, a deterministic decoding function f:Z→Yf: Z \to Yf:Z→Y maps the trajectory to a final answer y=f(z)∈Yy = f(z) \in Yy=f(z)∈Y. For training, we assume that for every input xxx we know the correct answer y∗(x)y^*(x)y∗(x). This abstraction is remarkably general — it accommodates mathematical reasoning, code generation, multi‑step retrieval, and any task where correctness can be judged by comparing the decoded output against a known target.
Because the decoder is deterministic, the only source of randomness in the final answer is the stochastic policy mθm_\thetamθ​. So we can define a binary reward that indicates correctness:
r(x,z)=I{f(z)=y∗(x)}.r(x,z) = \mathbb{I}\{ f(z) = y^*(x) \}.r(x,z)=I{f(z)=y∗(x)}.
This reward is all‑or‑nothing: 1 if the final answer matches, 0 otherwise. The expected reward over the model’s own latent distribution, conditioned on xxx, is the per‑input pass rate:
pθpass(x)=Ez∼mθ(⋅∣x)[ r(x,z) ].p_\theta^{\text{pass}}(x) = \mathbb{E}_{z \sim m_\theta(\cdot|x)}[\, r(x,z) \,].pθpass​(x)=Ez∼mθ​(⋅∣x)​[r(x,z)].
In words, pθpass(x)p_\theta^{\text{pass}}(x)pθpass​(x) is the probability that a single answer obtained by sampling zzz from the policy and then applying fff will be exactly correct. It is the fundamental quantity that standard RL methods optimize — but as we saw earlier, optimizing only the pass rate discards all information about the latent trajectories that produced the correct output.
That latent information, however, is exactly what we need if we hope to recover the true conditional distribution over correct trajectories, i.e., maximum likelihood. The pass rate is a coarser statistic: it collapses the rich structure of zzz into a single number between 0 and 1. Two models with radically different reasoning patterns can have the same pass rate, yet one might produce a correct answer by genuine understanding while the other might guess wildly but sometimes land on the right token sequence. Distinguishing them requires looking at the collection of trajectories, not just the aggregate success frequency.
This is where the idea of multiple rollouts enters naturally. If we draw kkk independent trajectories z1,…,zk∼mθ(⋅∣x)z_1, \dots, z_k \sim m_\theta(\cdot|x)z1​,…,zk​∼mθ​(⋅∣x), we can compute the probability that at least one of the corresponding decoded answers is correct:
pass@k(x)=1−(1−pθpass(x))k,fail@k(x)=(1−pθpass(x))k.\text{pass@k}(x) = 1 - \bigl(1 - p_\theta^{\text{pass}}(x)\bigr)^k,
\qquad
\text{fail@k}(x) = \bigl(1 - p_\theta^{\text{pass}}(x)\bigr)^k.pass@k(x)=1−(1−pθpass​(x))k,fail@k(x)=(1−pθpass​(x))k.
The pass@k metric quantifies how compute—in the form of additional sampling—improves the chance of seeing a correct answer. If the base pass rate is tiny (say p=0.01p = 0.01p=0.01), then with k=100k=100k=100 attempts we get pass@k≈1−(0.99)100≈0.634\text{pass@k} \approx 1 - (0.99)^{100} \approx 0.634pass@k≈1−(0.99)100≈0.634. The complement fail@k\text{fail@k}fail@k decays exponentially in kkk, which will later become crucial for designing objectives that connect the number of samples to the order of a Taylor‑like expansion of the log-pass probability.
Note that pass@k is still a function of the per‑input pass rate; it adds no new information about the latent trajectories themselves. However, the joint distribution of the kkk rollouts—and in particular the number of correct answers among them—does contain stochastic information that, with the right objective, can guide the model toward high‑likelihood reasoning paths. The stage is now set to ask: how can we design a training signal that uses this richer data, ideally recovering something akin to maximum likelihood as a limiting case?
The visual below consolidates this generative story. At a glance you see the flow from input distribution, through the stochastic policy and deterministic decoder, into the binary comparator that yields the reward. The pass rate appears as the expected reward, and the separate inset shows how kkk independent draws give rise to the pass@k formula. The color coding — blue for input/output, green for the latent policy, red for the binary reward, gray for the known target — makes the signal‑flow interpretation immediate. This diagram is not just a static definition; it will recur as the core abstraction throughout the MaxRL development, anchoring every subsequent theorem and estimator construction in the same latent generation model.

3. ML vs RL Objectives in the Binary-Correctness Setting

The previous post formalized the notion of a latent generation model: a policy mθ(z∣x)m_\theta(z \mid x)mθ​(z∣x) that stochastically produces candidate answers zzz, and a correctness oracle r(x,z)∈{0,1}r(x,z) \in \{0,1\}r(x,z)∈{0,1} that judges each one. For any fixed prompt xxx, the pass rate
pθpass(x)  =  Ez∼mθ(⋅∣x)[r(x,z)]p_\theta^{\text{pass}}(x) \;=\; \mathbb{E}_{z \sim m_\theta(\cdot \mid x)}\bigl[r(x,z)\bigr]pθpass​(x)=Ez∼mθ​(⋅∣x)​[r(x,z)]
captures the probability that a single attempt from the policy succeeds. From this, we can define two natural optimization targets that extract different summaries of the pass rate distribution over prompts.
The first, maximum likelihood (ML), asks: what parameters make the observed successes most probable, in the sense of maximizing the expected log pass rate? Its objective is
JML(θ)  =  Ex∼ρ[log⁡pθpass(x)].J_{\text{ML}}(\theta) \;=\; \mathbb{E}_{x \sim \rho}\bigl[\log p_\theta^{\text{pass}}(x)\bigr].JML​(θ)=Ex∼ρ​[logpθpass​(x)].
The second, which we will call RL, simply maximizes the expected pass rate itself:
JRL(θ)  =  Ex∼ρ[pθpass(x)].J_{\text{RL}}(\theta) \;=\; \mathbb{E}_{x \sim \rho}\bigl[p_\theta^{\text{pass}}(x)\bigr].JRL​(θ)=Ex∼ρ​[pθpass​(x)].
These two criteria coincide only when every prompt has the same pass rate – a degenerate case. In any realistic setting, they drive optimization in importantly different directions, and that divergence is the pivot of this entire lecture.
To see why, consider the gradients. Under mild interchangeability conditions we can push ∇θ\nabla_\theta∇θ​ inside the expectation over xxx, yielding
∇θJML  =  Ex∼ρ ⁣[1pθpass(x)∇θpθpass(x)],∇θJRL  =  Ex∼ρ ⁣[∇θpθpass(x)].\nabla_\theta J_{\text{ML}} \;=\; \mathbb{E}_{x \sim \rho}\!\left[\frac{1}{p_\theta^{\text{pass}}(x)} \nabla_\theta p_\theta^{\text{pass}}(x)\right],
\qquad
\nabla_\theta J_{\text{RL}} \;=\; \mathbb{E}_{x \sim \rho}\!\left[\nabla_\theta p_\theta^{\text{pass}}(x)\right].∇θ​JML​=Ex∼ρ​[pθpass​(x)1​∇θ​pθpass​(x)],∇θ​JRL​=Ex∼ρ​[∇θ​pθpass​(x)].
The RL gradient is the population version of a standard policy gradient update: it increases the pass rate wherever its derivative points, but it does so without caring about the absolute magnitude of that pass rate. A prompt with current success probability 0.010.010.01 and a prompt with 0.990.990.99 both receive the same gradient weight of 111. In contrast, the ML gradient weights each prompt’s update by 1/p1/p1/p, so it puts enormous emphasis on prompts where the policy is currently failing (small ppp), and little emphasis on those already mastered.
This contrast is not just a mathematical curiosity. In fully differentiable classification tasks – where the policy directly outputs a softmax distribution over a finite set of labels YYY and correctness is simply the indicator of hitting the right label – the ML objective JMLJ_{\text{ML}}JML​ becomes the familiar cross‑entropy loss. The gradient 1p∇θp\frac{1}{p}\nabla_\theta pp1​∇θ​p arises naturally from the derivative of the log. However, in our setting the policy produces a latent answer zzz and we only observe the binary correctness r(x,z)r(x,z)r(x,z); the inner expectation Ez[r]\mathbb{E}_z[r]Ez​[r] is not directly differentiable with respect to the policy’s parameters. To obtain a gradient estimate we must resort to score‑function (REINFORCE) estimators that use the log‑likelihood of the sampled action. That estimator inevitably introduces a 1/p1/p1/p term when we target the log pass rate, because the full gradient of log⁡p\log plogp equals (1/p)∇θp(1/p)\nabla_\theta p(1/p)∇θ​p and we must estimate both ppp and its gradient from finite samples.
This is the critical point: RL is not “better” than ML; it is a practical necessity born from the sampling step. The RL objective JRLJ_{\text{RL}}JRL​ yields the familiar policy gradient form without an explicit 1/p1/p1/p factor, precisely because the derivative of ppp does not involve a 1/p1/p1/p weighting. That simplifies estimation enormously, but at the cost of abandoning the log‑pass‑rate criterion, which in many correctness tasks would be the principled target (it maximizes the likelihood of observing a correct answer under the latent model).
So we are left with a sharp question: Can we recover the log‑pass‑rate objective JMLJ_{\text{ML}}JML​ using only finite samples, without requiring knowledge of the true pass rates? The rest of this lecture frames a family of estimators that interpolate between the RL gradient and the ML gradient, indexed by the number of samples we are willing to draw per prompt. The conceptual bridge is a Maclaurin expansion of the log, which we’ll begin to unfold in the next section.
The visual below distills this tension into its bare essentials. On the left, the ML side displays the log pass rate objective and its gradient with the telltale 1/p1/p1/p factor. On the right, the RL side shows the expected pass rate objective and the simpler, factor‑free gradient. The two gradients are aligned to emphasize the single difference: the weighting inside the expectation. Beneath them, the key contrast is spelled out in plain terms – the differentiable case versus the sampled latent case – and the central question is boxed in blue as a prompt for the analytical bridge we are about to build.

4. Maclaurin Expansion of Log Pass Rate

Having seen that standard RL on binary correctness reduces to optimizing the pass probability ppp, we might assume that scaling up RL by increasing sample budgets and reward granularity would naturally approach maximum likelihood. However, the true ML objective does not simply weight the reward by the empirical pass rate; it optimizes the log pass probability log⁡pθpass(x)\log p_\theta^{\text{pass}}(x)logpθpass​(x). This objective has a hidden depth: it encodes far more than the first-moment probability of a single correct answer. It distills information from the entire distribution of failures across multiple independent attempts, a structure that a pass-rate-only signal entirely misses.
To uncover that structure, we expand log⁡p\log plogp through the lens of the Maclaurin series for −log⁡(1−z)-\log(1-z)−log(1−z). This classic expansion holds for ∣z∣<1|z|<1∣z∣<1 and writes the logarithm as an infinite sum of powers:
−log⁡(1−z)=∑k=1∞zkk,∣z∣<1.-\log(1-z) = \sum_{k=1}^{\infty} \frac{z^{k}}{k}, \qquad |z| < 1.−log(1−z)=k=1∑∞​kzk​,∣z∣<1.
Now, set z=1−pz = 1-pz=1−p, which satisfies ∣1−p∣<1|1-p|<1∣1−p∣<1 for any p∈(0,1]p\in(0,1]p∈(0,1]. Instantly we obtain
log⁡p=−log⁡(1−(1−p))=−∑k=1∞(1−p)kk.\log p = -\log(1 - (1-p)) = -\sum_{k=1}^{\infty} \frac{(1-p)^{k}}{k}.logp=−log(1−(1−p))=−k=1∑∞​k(1−p)k​.
The term (1−p)k(1-p)^{k}(1−p)k is precisely the probability that all kkk independent samples from the policy are incorrect – that is, the fail@k event, fail@k(x)\mathrm{fail@k}(x)fail@k(x). Substituting this notation yields a crisp identity:
log⁡pθpass(x)=−∑k=1∞1k fail@k(x).\log p_\theta^{\text{pass}}(x) = -\sum_{k=1}^{\infty} \frac{1}{k}\,\mathrm{fail@k}(x).logpθpass​(x)=−k=1∑∞​k1​fail@k(x).
This series is not merely a formal manipulation; it decomposes the log-likelihood into an infinite harmonic mixture of higher-order failure probabilities. The weighting factors 1/k1/k1/k decay gently, meaning that fail@10 carries one-tenth the influence of fail@1, but its contribution is far from negligible.
The real power of this expansion appears when we differentiate with respect to the model parameters θ\thetaθ. Differentiating term by term – assuming standard smoothness conditions that allow the gradient to pass through the sum – we get
∇θlog⁡p=−∑k=1∞1k ∇θ fail@k(x).\nabla_\theta \log p = -\sum_{k=1}^{\infty} \frac{1}{k}\,\nabla_\theta\,\mathrm{fail@k}(x).∇θ​logp=−k=1∑∞​k1​∇θ​fail@k(x).
Each ∇θ fail@k(x)\nabla_\theta\,\mathrm{fail@k}(x)∇θ​fail@k(x) is the policy gradient of the joint failure event over kkk samples. But since the pass event is complementary, pass@k(x)=1−fail@k(x)\mathrm{pass@k}(x) = 1 - \mathrm{fail@k}(x)pass@k(x)=1−fail@k(x) and the gradient of the constant vanishes, we can flip the sign:
∇θJML(x)=∑k=1∞1k ∇θ pass@k(x).\boxed{\nabla_\theta J_{\mathrm{ML}}(x) = \sum_{k=1}^{\infty} \frac{1}{k}\,\nabla_\theta\,\mathrm{pass@k}(x)}.∇θ​JML​(x)=k=1∑∞​k1​∇θ​pass@k(x)​.
This boxed equation is the central revelation of the MaxRL framework. The maximum-likelihood gradient is an infinite harmonic mixture of the policy gradients for pass@k events. In other words, maximising log⁡p\log plogp automatically encourages not only that the model passes on its first try (pass@1) but also that it passes with high probability when given k=2,3,4,…k=2, 3, 4, \dotsk=2,3,4,… independent attempts, each weighted by a diminishing factor 1/k1/k1/k. It rewards a model that becomes robustly reliable under repeated sampling, not just a model that occasionally gets the right answer.
Why does this matter? Standard RL with a binary correctness reward produces a gradient proportional to ∇θ pass@1(x)\nabla_\theta\,\mathrm{pass@1}(x)∇θ​pass@1(x), capturing only the first term of this infinite series. It ignores all k≥2k \ge 2k≥2, discarding information about whether the model overcomes its failures when given multiple tries. The ML gradient, by contrast, explicitly accounts for the full spectrum of sample budgets, revealing that the true likelihood objective inherently relies on multiple samples. This is not an ad‑hoc trick; it is a direct consequence of the logarithmic transformation.
The visual below distils this entire derivation into a clean, colour‑coded equation chain. Starting from the known Maclaurin series for −log⁡(1−z)-\log(1-z)−log(1−z), it substitutes z=1−pz=1-pz=1−p, rewrites the powers in terms of fail@k(x)\mathrm{fail@k}(x)fail@k(x), differentiates, and then uses the complement relation to arrive at the harmonic sum over pass@k\mathrm{pass@k}pass@k gradients. The final identity is placed in a prominent box, with annotations that highlight the interpretation: the harmonic mixture of pass@k gradients. The use of blue for (1−p)(1-p)(1−p), red for fail@k terms, and green for pass@k terms makes the sign flip and the transformation visually immediate, reinforcing the conceptual shift from failure‑centered to success‑centered weighting.

5. MaxRL: A Compute-Indexed Family of Objectives

With the Maclaurin expansion of log⁡p\log plogp established—an exact, albeit infinite, series representation of the log‑pass probability—we can now build a bridge between standard correctness‑based RL and maximum likelihood estimation. Truncating that series at a finite order produces a family of objectives that are directly controllable by a single integer parameter: the truncation length TTT. This parameter becomes a compute index, dictating how close the objective moves toward the full log‑likelihood and, correspondingly, how many rollouts are required to obtain reliable gradient estimates.
Recall from the previous expansion that for a fixed input xxx and model parameters θ\thetaθ, letting p≡pθpass(x)p \equiv p_\theta^{\mathrm{pass}}(x)p≡pθpass​(x) be the probability of generating a correct answer in one attempt, we have
log⁡p=−∑k=1∞(1−p)kk=−∑k=1∞fail@k(x)k.\log p = -\sum_{k=1}^{\infty}\frac{(1-p)^k}{k}
        = -\sum_{k=1}^{\infty}\frac{\mathrm{fail@k}(x)}{k}.logp=−k=1∑∞​k(1−p)k​=−k=1∑∞​kfail@k(x)​.
The terms fail@k(x)=(1−p)k\mathrm{fail@k}(x) = (1-p)^kfail@k(x)=(1−p)k decay exponentially as kkk grows; the infinite sum is convergent, but in practice we cannot compute infinitely many terms. The truncation idea is simple yet powerful: keep only the first TTT terms of the series and discard the remainder. For any truncation level T∈NT \in \mathbb{N}T∈N, the MaxRL truncated objective is defined as
JMaxRL(T)(x)  :=  −∑k=1T(1−p)kk.J^{(T)}_{\mathrm{MaxRL}}(x) \;:=\; -\sum_{k=1}^{T}\frac{(1-p)^k}{k}.JMaxRL(T)​(x):=−k=1∑T​k(1−p)k​.
This is not a lower bound in the strict sense—the neglected tail ∑k=T+1∞(1−p)k/k\sum_{k=T+1}^{\infty}(1-p)^k/k∑k=T+1∞​(1−p)k/k is always positive, so J(T)J^{(T)}J(T) underestimates log⁡p\log plogp. However, the bias shrinks rapidly with TTT and, crucially, the gradient of J(T)J^{(T)}J(T) has a remarkably clean form.
Differentiating with respect to θ\thetaθ (and recalling that ppp depends on θ\thetaθ) yields
∇θJMaxRL(T)(x)=∑k=1T1k ∇θ pass@k(x),\nabla_\theta J^{(T)}_{\mathrm{MaxRL}}(x)
= \sum_{k=1}^{T}\frac{1}{k}\,\nabla_\theta\,\mathrm{pass@k}(x),∇θ​JMaxRL(T)​(x)=k=1∑T​k1​∇θ​pass@k(x),
because pass@k(x)=1−(1−p)k\mathrm{pass@k}(x) = 1 - (1-p)^kpass@k(x)=1−(1−p)k and ∇θpass@k(x)=k(1−p)k−1∇θp\nabla_\theta \mathrm{pass@k}(x) = k(1-p)^{k-1}\nabla_\theta p∇θ​pass@k(x)=k(1−p)k−1∇θ​p, so the factor 1/k1/k1/k leaves (1−p)k−1∇θp(1-p)^{k-1}\nabla_\theta p(1−p)k−1∇θ​p. Summing the geometric series 1+(1−p)+⋯+(1−p)T−11 + (1-p) + \dots + (1-p)^{T-1}1+(1−p)+⋯+(1−p)T−1 gives a compact gradient expression
∇θJMaxRL(T)(x)=(1−(1−p)T) ∇θpp=(1−(1−p)T) ∇θlog⁡p.\nabla_\theta J^{(T)}_{\mathrm{MaxRL}}(x)
= \Bigl(1 - (1-p)^T\Bigr)\,\frac{\nabla_\theta p}{p}
= \Bigl(1 - (1-p)^T\Bigr)\,\nabla_\theta \log p .∇θ​JMaxRL(T)​(x)=(1−(1−p)T)p∇θ​p​=(1−(1−p)T)∇θ​logp.
From this, the weight factor 1−(1−p)T1-(1-p)^T1−(1−p)T directly reveals the trade‑off: it is a multiplicative damping that approaches 111 from below as TTT increases. When T=1T=1T=1, 1−(1−p)=p1-(1-p)=p1−(1−p)=p, and the gradient reduces to ∇θp\nabla_\theta p∇θ​p, which is exactly ∇θ pass@1(x)\nabla_\theta\,\mathrm{pass@1}(x)∇θ​pass@1(x)—the standard RL objective that optimizes pass rate. As T→∞T\to\inftyT→∞, (1−p)T→0(1-p)^T \to 0(1−p)T→0 (for p>0p>0p>0), the damping factor tends to 111, and the gradient converges to ∇θlog⁡p\nabla_\theta\log p∇θ​logp, the full maximum‑likelihood gradient. Thus the MaxRL family interpolates smoothly between RL (T=1T=1T=1) and full ML (T→∞T\to\inftyT→∞), with every intermediate TTT defining a partially damped gradient that approximates the ultimate log‑likelihood target.
This view re‑frames the problem as a compute‑accuracy trade‑off. Estimating ∇θpass@k\nabla_\theta\mathrm{pass@k}∇θ​pass@k for k>1k>1k>1 requires at least kkk independent rollout samples to reliably assess whether any of the kkk attempts is correct. So larger TTT demands more samples, but it also supplies a gradient that is closer to the true ML direction. In this sense, TTT acts as a compute knob: for a fixed compute budget one can choose the largest affordable truncation level, thereby maximising the approximation quality while respecting practical constraints.
The visual below captures this idea at a glance. It opens with a faint reference to the Maclaurin series, reminding the reader of the expansion’s structure. At the centre, a prominently boxed definition displays the truncated objective JMaxRL(T)(x)J^{(T)}_{\mathrm{MaxRL}}(x)JMaxRL(T)​(x) and its gradient as a sum of weighted pass@k\mathrm{pass@k}pass@k gradients, with TTT highlighted as the controlling index. To the sides, two parallel mini‑boxes anchor the endpoints: T=1T=1T=1 (standard RL, gradient driven only by pass@1\mathrm{pass@1}pass@1) and T→∞T\to\inftyT→∞ (full ML, exact log‑likelihood gradient). A horizontal arrow connects these extremes, labeled with the progression “increasing TTT → more compute, better ML approximation”. Concise takeaways at the bottom reinforce the central trade‑off: larger TTT brings the objective closer to ML but demands proportionally more rollouts. The composition makes the family’s conceptual structure immediately legible—a graded spectrum from RL to ML, with compute as the currency that determines how far along that spectrum we can afford to go.

6. Theorem 1: Conditional Form of the ML Gradient

The family of objectives we just explored—MaxRL with different truncation orders NNN—provides a practical ladder from modest compute to ideal behaviour. But to understand precisely what those objectives are trying to approximate, and why a conditional expectation is the right building block, we need to examine the exact gradient of the maximum likelihood objective without any approximation. That gradient turns out to have a strikingly clean conditional form, something that is not immediately obvious from the definition LML(x)=log⁡pθpass(x)\mathcal{L}_{\text{ML}}(x)=\log p_\theta^{\text{pass}}(x)LML​(x)=logpθpass​(x).
Recall the starting point for any input xxx. The ML gradient is simply the derivative of the log pass probability,
∇θJML(x)=1pθpass(x) ∇θ pθpass(x).\nabla_\theta J_{\text{ML}}(x) = \frac{1}{p_\theta^{\text{pass}}(x)}\,\nabla_\theta\,p_\theta^{\text{pass}}(x).∇θ​JML​(x)=pθpass​(x)1​∇θ​pθpass​(x).
This expression already hints at the need to know the pass rate itself—a quantity we cannot cheaply evaluate—but we can rewrite it into a form that eliminates the explicit division and reveals a much more intuitive structure.
Theorem 1 (Conditional Gradient Identity). Assume pθpass(x)>0p_\theta^{\text{pass}}(x) > 0pθpass​(x)>0. Then
∇θJML(x)=Ez∼mθ(⋅∣x) ⁣[∇θlog⁡mθ(z∣x)  |  f(z)=y∗(x)].\nabla_\theta J_{\text{ML}}(x)
= \mathbb{E}_{z\sim m_\theta(\cdot|x)}\!\left[\nabla_\theta\log m_\theta(z|x) \;\middle|\; f(z) = y^*(x)\right].∇θ​JML​(x)=Ez∼mθ​(⋅∣x)​[∇θ​logmθ​(z∣x)∣f(z)=y∗(x)].
Equivalently,
∇θJML(x)=E ⁣[ S(x,z)  ∣  r(x,z)=1 ],\nabla_\theta J_{\text{ML}}(x)
= \mathbb{E}\!\bigl[\,S(x,z) \;|\; r(x,z)=1\,\bigr],∇θ​JML​(x)=E[S(x,z)∣r(x,z)=1],
where S(x,z)=∇θlog⁡mθ(z∣x)S(x,z)=\nabla_\theta\log m_\theta(z|x)S(x,z)=∇θ​logmθ​(z∣x) is the score function (the gradient of the log-likelihood of a single latent trajectory).
Why is this identity both surprising and useful? In standard score-function gradient estimators (REINFORCE), we would need to multiply the score by a total reward and then divide by the marginal probability of observing that reward—so the denominator still lurks in any Monte Carlo estimate. Here, however, the gradient of the log pass probability collapses to a simple conditional expectation: the average score computed only over those latent trajectories that actually produce the correct answer. The denominator pθpass(x)p_\theta^{\text{pass}}(x)pθpass​(x) disappears because the act of conditioning on success automatically re‑weights the distribution.
The derivation is short but instructive. Expand ∇θpθpass(x)\nabla_\theta p_\theta^{\text{pass}}(x)∇θ​pθpass​(x) inside the fraction:
∇θpθpass(x)=∇θ∫mθ(z∣x) I[f(z)=y∗(x)] dz=∫mθ(z∣x) ∇θlog⁡mθ(z∣x) I[f(z)=y∗(x)] dz.\nabla_\theta p_\theta^{\text{pass}}(x)
= \nabla_\theta \int m_\theta(z|x)\,\mathbb{I}[f(z)=y^*(x)]\,dz
= \int m_\theta(z|x)\,\nabla_\theta\log m_\theta(z|x)\,\mathbb{I}[f(z)=y^*(x)]\,dz.∇θ​pθpass​(x)=∇θ​∫mθ​(z∣x)I[f(z)=y∗(x)]dz=∫mθ​(z∣x)∇θ​logmθ​(z∣x)I[f(z)=y∗(x)]dz.
This is just the unnormalised expectation of the score function over successful trajectories. Now substitute back:
∇θJML(x)=1pθpass(x)∫mθ(z∣x) S(x,z) I[f(z)=y∗(x)] dz=Ez∼mθ ⁣[S(x,z)  |  f(z)=y∗(x)].\nabla_\theta J_{\text{ML}}(x)
= \frac{1}{p_\theta^{\text{pass}}(x)}
\int m_\theta(z|x)\,S(x,z)\,\mathbb{I}[f(z)=y^*(x)]\,dz
= \mathbb{E}_{z\sim m_\theta}\!\left[S(x,z) \;\middle|\; f(z)=y^*(x)\right].∇θ​JML​(x)=pθpass​(x)1​∫mθ​(z∣x)S(x,z)I[f(z)=y∗(x)]dz=Ez∼mθ​​[S(x,z)∣f(z)=y∗(x)].
The final equality uses the definition of conditional expectation: E[Y∣A]=E[Y IA]/P(A)\mathbb{E}[Y \mid A] = \mathbb{E}[Y\,\mathbb{I}_A] / P(A)E[Y∣A]=E[YIA​]/P(A). That tiny move is what evaporates the painfully inestimable pass rate.
The assumption pθpass(x)>0p_\theta^{\text{pass}}(x) > 0pθpass​(x)>0 is not a mere technicality. It ensures the condition makes sense—there must be at least some chance of generating a correct output under the current model. In practice, for tasks with massive latent spaces, this can be satisfied early enough if we use a reasonable pre‑trained model, or we can artificially maintain a small probability mass on correctness via entropy regularisation.
Now we have a gradient that can be described purely in terms of what happens on successful rollouts. This perspective immediately illuminates a failure mode of naive RL on correctness tasks. Standard policy gradient methods (and even GRPO) optimise a surrogate that may use all sampled trajectories, perhaps up‑weighting correct ones and down‑weighting incorrect ones, but they do not exactly mimic this conditional average unless the weighting scheme matches the model’s own normalised distribution over successes. The ML gradient, in contrast, is surgically precise: take the model, sample latent trajectories, and keep only those that solve the problem. Then compute the average of their score functions. No reward‑shaping, no value baseline, no denominator—just a conditional mean.
This interpretation also explains why maximum likelihood objectives often yield models that are far more robust on reasoning benchmarks than pure RL‑tuned models. By forcing the entire gradient signal to come from correct-alone trajectories, the model learns to increase the likelihood of paths that demonstrably lead to the right answer, without being distracted by any signal from incorrect attempts. It’s a cleaner, sharper optimisation signal.
The two equivalent forms in the theorem—conditioning on f(z)=y∗(x)f(z)=y^*(x)f(z)=y∗(x) and conditioning on r(x,z)=1r(x,z)=1r(x,z)=1—are simply two notations for the same event. The second form, using the binary reward r(x,z)r(x,z)r(x,z), will be especially convenient when we later build finite‑sample estimators and connect to the MaxRL truncation family.
The visual below captures the essence of Theorem 1 in a compact, lecture‑ready diagram. The slide first reminds us of the starting point: the gradient of the log pass probability as a ratio. The central framed box then states the conditional identity itself, with the two equivalent expectation forms placed one above the other for immediate comparison. The use of the score notation S(x,z)S(x,z)S(x,z) is made explicit, and a brief interpretation line—the ML gradient is the average score over correct outputs only—sits beneath the theorem box, reinforcing the main takeaway. The hand‑drawn, academic aesthetic keeps the focus on the algebraic insight, while the clean separation of equation blocks highlights the two parallel ways of writing the same gradient. It serves as the perfect summary before we delve into the proof that formalises the connection between the conditional form and the MaxRL estimators.

7. Proof of Theorem 1

Building a maximum‑likelihood objective on top of correctness feedback forces us to graduate from a naive REINFORCE gradient of pass rate to the gradient of the log‑pass probability. In the previous section we saw the population‑level identity for the pass rate:
∇θ p=Ez∼mθ(⋅∣x)[r(x,z) S(z)],S(z)=∇θlog⁡mθ(z∣x),\nabla_\theta\,p = \mathbb{E}_{z \sim m_\theta(\cdot|x)}\big[ r(x,z)\,S(z) \big], \qquad S(z)=\nabla_\theta\log m_\theta(z|x),∇θ​p=Ez∼mθ​(⋅∣x)​[r(x,z)S(z)],S(z)=∇θ​logmθ​(z∣x),
which is a standard REINFORCE expression that weights the score by the binary correctness reward. This quantity alone tells us how to increase the probability of passing, but not how to maximise the log‑likelihood of the correct answer under the model. The jump from one to the other is the content of Theorem 1, and its proof is remarkably compact once the right algebraic relations are written down.
Our target is JML(x)=log⁡pJ_{\text{ML}}(x) = \log pJML​(x)=logp, where p=Ez[r(x,z)]p = \mathbb{E}_z[r(x,z)]p=Ez​[r(x,z)] is the marginal pass probability. Differentiating directly gives
∇θJML(x)=∇θ pp.\nabla_\theta J_{\text{ML}}(x) = \frac{\nabla_\theta\,p}{p}.∇θ​JML​(x)=p∇θ​p​.
Substituting the REINFORCE form of ∇θ p\nabla_\theta\,p∇θ​p yields
∇θJML(x)=Ez[r(x,z) S(z)]p.\nabla_\theta J_{\text{ML}}(x) = \frac{\mathbb{E}_z[ r(x,z)\,S(z) ]}{p}.∇θ​JML​(x)=pEz​[r(x,z)S(z)]​.
At this point everything looks like a simple ratio of expectations. The denominator p=Ez[r(x,z)]p = \mathbb{E}_z[ r(x,z) ]p=Ez​[r(x,z)] is just the mean reward. So the gradient equals E[r S]/E[r]\mathbb{E}[r\,S] / \mathbb{E}[r]E[rS]/E[r].
Now recall a basic probabilistic identity: for any random vector XXX and a binary event r∈{0,1}r \in \{0,1\}r∈{0,1} with P(r=1)>0\mathbb{P}(r=1) > 0P(r=1)>0, the conditional expectation of XXX given the event is
E[X∣r=1]=E[r X]E[r].\mathbb{E}[X \mid r=1] = \frac{\mathbb{E}[ r\,X ]}{\mathbb{E}[r]}.E[X∣r=1]=E[r]E[rX]​.
This identity is nothing more than the definition of conditional expectation restricted to an event, written in terms of unconditional expectations. Its value here is that it translates a ratio of expectations back into an expected value under the success‑conditioned distribution—the distribution of model outputs that happen to solve the task.
Applying this identity with X=S(z)X = S(z)X=S(z) and r=r(x,z)r = r(x,z)r=r(x,z) gives
Ez[r(x,z) S(z)]p=Ez∼mθ(⋅∣x)[S(z)∣r(x,z)=1].\frac{\mathbb{E}_z[ r(x,z)\,S(z) ]}{p}
= \mathbb{E}_{z \sim m_\theta(\cdot|x)}[ S(z) \mid r(x,z)=1 ].pEz​[r(x,z)S(z)]​=Ez∼mθ​(⋅∣x)​[S(z)∣r(x,z)=1].
That is the entire proof: a pure algebraic rearrangement that replaces the ratio of two population averages with a single conditional expectation. The right‑hand side is precisely the expected score, but only over those rollouts that actually produce the ground‑truth answer. In symbols,
  ∇θJML(x)=E[∇θlog⁡mθ(z∣x)  ∣  f(z)=y∗(x)]  .\boxed{\;\nabla_\theta J_{\text{ML}}(x) = \mathbb{E}\big[\nabla_\theta\log m_\theta(z|x) \;\big|\; f(z) = y^*(x)\big]\;}.∇θ​JML​(x)=E[∇θ​logmθ​(z∣x)​f(z)=y∗(x)]​.
Why does this matter? The gradient that maximises log‑likelihood turns out to be exactly what you would compute if you could sample only from the model’s success distribution. In other words, the ML gradient is the average score of the model on its correct answers. This immediately suggests a practical finite‑sample estimator: draw NNN unconditional rollouts, keep the ones that pass, and average their scores. That estimator is unbiased for the gradient of a closely related objective, as we will formalise in the next section. For now, the crucial point is that the ML gradient can be expressed without ever inverting the conditional distribution or needing a separate “teacher”—the structure is fully contained inside the model’s own pass/fail behaviour.
The visual that supports this proof serves as a minimalistic derivation map. It presents the REINFORCE expression for ∇θp\nabla_\theta p∇θ​p, the division by ppp, the rearrangement into the ratio E[r S]/E[r]\mathbb{E}[r\,S]/\mathbb{E}[r]E[rS]/E[r], and the application of the conditional expectation identity—each step appearing as a single aligned equation. The final line is the boxed theorem statement, set apart with a subtle coloured background. This depiction is not meant to replace the logic, but to consolidate it into a glance‑able chain of reasoning, making the essential move—from a ratio of expectations to a conditional mean—immediately visible. The viewer can see at a glance why “discard failures and average the scores” is the population‑level truth behind maximum‑likelihood training with binary rewards.

8. Empirical Gradient Estimator \widehat g N

Having established that the gradient of the log pass probability can be expressed as a conditional expectation over successful trajectories, we now turn to its finite-sample approximation. The population-level form from Theorem 1,
∇θlog⁡pθ(success∣x)  =  Ez1:N ⁣[ 1K∑i=1Nri ∇θlog⁡mθ(zi∣x)  ∣  K≥1],\nabla_\theta \log p_{\theta}(\text{success}\mid x)
\;=\;
\mathbb{E}_{z_{1:N}}\!\left[\,\frac{1}{K}\sum_{i=1}^N r_i\, \nabla_\theta \log m_\theta(z_i\mid x) \;\Big|\; K\ge 1\right],∇θ​logpθ​(success∣x)=Ez1:N​​[K1​i=1∑N​ri​∇θ​logmθ​(zi​∣x)​K≥1],
naturally suggests an intuitive Monte Carlo estimator: draw NNN latent trajectories z1,…,zN∼mθ(⋅∣x)z_1,\dots,z_N \sim m_\theta(\cdot\mid x)z1​,…,zN​∼mθ​(⋅∣x), compute the binary rewards ri=I{f(zi)=y∗(x)}r_i = \mathbb{I}\{f(z_i)=y^*(x)\}ri​=I{f(zi​)=y∗(x)}, count the successes K=∑i=1NriK = \sum_{i=1}^N r_iK=∑i=1N​ri​, and then average the score gradients of the successful trajectories—each weighted equally. However, when no successes occur (K=0K=0K=0), the conditional expectation is undefined and no learning signal can be extracted from a sample that contains only failures; in that case the estimator is simply set to zero.
This yields the MaxRL gradient estimator:
g^N(x)  =  {1K∑i=1Nri ∇θlog⁡mθ(zi∣x),K≥10,K=0.\widehat{g}_N(x) \;=\;
\begin{cases}
\displaystyle\frac{1}{K}\sum_{i=1}^N r_i\, \nabla_\theta \log m_\theta(z_i\mid x), & K \ge 1 \\[1.2em]
0, & K = 0.
\end{cases}g​N​(x)=⎩⎨⎧​K1​i=1∑N​ri​∇θ​logmθ​(zi​∣x),0,​K≥1K=0.​
The piecewise definition echoes the truncation-based logic that underpins the whole MaxRL framework: when the sample reveals no successes, we cannot improve the log pass probability from that mini-batch, so the update is null. When at least one success is present, the estimator uses only the successful draws, normalising by the observed number of successes KKK rather than by the total sample size NNN.
It is instructive to contrast this with the classic REINFORCE estimator commonly used in reinforcement learning for correctness tasks:
g^REINFORCE(x)  =  1N∑i=1N(ri−b) ∇θlog⁡mθ(zi∣x),\widehat{g}_{\mathrm{REINFORCE}}(x) \;=\;
\frac{1}{N}\sum_{i=1}^N (r_i - b)\,\nabla_\theta \log m_\theta(z_i\mid x),g​REINFORCE​(x)=N1​i=1∑N​(ri​−b)∇θ​logmθ​(zi​∣x),
where bbb is a baseline (often a learned value function or a moving average of rewards). REINFORCE estimates the gradient of the expected reward—in this binary case, the pass probability pθ(success∣x)p_{\theta}(\text{success}\mid x)pθ​(success∣x). Every trajectory, successful or not, contributes to the estimate. The baseline bbb reduces variance but does not change the objective being optimized: REINFORCE pushes the policy toward producing any success, without distinguishing between a policy that succeeds rarely and one that succeeds reliably once we condition on some success. In other words, it optimises the pass rate (pass@1) but does not recover the maximum-likelihood principle that underpins standard supervised fine-tuning.
The MaxRL estimator replaces the uniform weighting of all NNN trajectories with a conditional average over the KKK successful ones. This seemingly minor change has profound consequences:
Normalisation by KKK, not NNN – The gradient magnitude does not shrink when the success rate is low. If only a single trajectory succeeds in a large batch, that trajectory’s score gradient receives full weight, while REINFORCE would dilute it by 1/N1/N1/N, possibly requiring a carefully tuned learning rate schedule to make any progress.
No need for a baseline – Because the estimator conditions on K≥1K\ge 1K≥1 and then averages the score gradients of the successes, the failure trajectories are simply ignored. There is no constant term to subtract; the unbiasedness of the conditional average follows directly from Theorem 1.
Direct connection to the log pass probability – By construction, g^N\widehat{g}_Ng​N​ is the finite-sample counterpart of the conditional expectation in Theorem 1, and as we will prove in the next section (Theorem 2), using this estimator in a stochastic gradient ascent procedure is exactly equivalent to optimising a truncated version of log⁡pθ(success∣x)\log p_{\theta}(\text{success}\mid x)logpθ​(success∣x) where the truncation order equals the sample size NNN.
A critical subtlety is the treatment of the K=0K=0K=0 case. Setting the gradient to zero when no successes are found may seem wasteful, but it is perfectly aligned with the truncation viewpoint: a sample of size NNN that contains zero successes provides no information about the higher-order terms in the Maclaurin expansion of the log pass probability. The gradient is zero because the conditional expectation is undefined; any ad‑hoc non‑zero value would bias the objective away from log⁡pθ\log p_{\theta}logpθ​. This design choice is not a limitation but a deliberate bridge between sample size and truncation order that will be formalised in Theorem 2.
The visual below makes this contrast immediate. At the top, the main MaxRL estimator is displayed prominently, with its piecewise definition centred in a clean box. The handling of the two cases—K≥1K\ge 1K≥1 and K=0K=0K=0—is spelled out, emphasising the normalisation by KKK. Immediately underneath, in a slightly smaller, shaded box, the REINFORCE estimator is shown for side‑by‑side comparison. An arrow annotation calls out the denominator: KKK in MaxRL versus NNN in REINFORCE, highlighting the fundamental difference in weighting. Another arrow points to a small tag referencing Theorem 1 from the preceding slide, reminding the viewer that this estimator is not an arbitrary choice but the natural sample analogue of the conditional expectation that lies at the heart of the MaxRL derivation. The muted shading of the REINFORCE box further reinforces that MaxRL is the estimator of primary interest, with REINFORCE serving as a familiar foil.

9. Theorem 2: Estimator–Objective Equivalence

The definition of g^N(x)\widehat{g}_N(x)g​N​(x) in the previous section may look like a pragmatic hack: average the score vectors of correct rollouts, ignore all failures, and return zero when nothing succeeds. Yet there is a strikingly clean reason that this particular ad‑hoc construction works. Theorem 2 uncovers the exact objective whose gradient is estimated without bias by g^N\widehat{g}_Ng​N​. It reveals that the estimator is not just a clever trick but a direct conduit to the truncated Maclaurin series of the log pass probability – the very series we derived earlier when expanding log⁡pθpass(x)\log p_\theta^{\mathrm{pass}}(x)logpθpass​(x).
Theorem 2 (Estimator–Objective Equivalence).
For any integer N≥1N \ge 1N≥1 and any prompt xxx, draw NNN i.i.d. completions z1,…,zN∼mθ(⋅∣x)z_1,\dots,z_N \sim m_\theta(\cdot\mid x)z1​,…,zN​∼mθ​(⋅∣x). Let ri=1{f(zi)=y∗(x)}r_i = \mathbf{1}\{f(z_i)=y^*(x)\}ri​=1{f(zi​)=y∗(x)} indicate correctness, Si=∇θlog⁡mθ(zi∣x)S_i = \nabla_\theta \log m_\theta(z_i\mid x)Si​=∇θ​logmθ​(zi​∣x) be the score, and K=∑i=1NriK = \sum_{i=1}^N r_iK=∑i=1N​ri​ count successes. With
g^N(x)={1K∑i=1NriSi,K≥1,0,K=0,\widehat g_N(x) = \begin{cases}
\frac{1}{K}\sum_{i=1}^N r_i S_i, & K \ge 1,\\[4pt]
0, & K=0,
\end{cases}g​N​(x)={K1​∑i=1N​ri​Si​,0,​K≥1,K=0,​
and the MaxRL objective of order TTT
JMaxRL(T)(x)=−∑k=1T(1−p)kk,p=pθpass(x),J^{(T)}_{\mathrm{MaxRL}}(x) = -\sum_{k=1}^T \frac{(1-p)^k}{k}, \qquad p = p_\theta^{\mathrm{pass}}(x),JMaxRL(T)​(x)=−k=1∑T​k(1−p)k​,p=pθpass​(x),
the expectation over the NNN rollouts satisfies
E[g^N(x)]=∇θ JMaxRL(N)(x).\mathbb{E}\bigl[\widehat g_N(x)\bigr] = \nabla_\theta\, J^{(N)}_{\mathrm{MaxRL}}(x).E[g​N​(x)]=∇θ​JMaxRL(N)​(x).
To appreciate the claim, recall that the Maclaurin expansion of log⁡p\log plogp in powers of (1−p)(1-p)(1−p) is
log⁡p=−∑k=1∞(1−p)kk.\log p = -\sum_{k=1}^{\infty} \frac{(1-p)^k}{k}.logp=−k=1∑∞​k(1−p)k​.
Thus JMaxRL(T)(x)J^{(T)}_{\mathrm{MaxRL}}(x)JMaxRL(T)​(x) is exactly the TTT-th order truncation of the true log-likelihood of correctness. When we later aggregate over a training set of prompts, maximising the full log-likelihood ∑xlog⁡p(x)\sum_x \log p(x)∑x​logp(x) is the standard MLE target. The theorem therefore asserts that each invocation of g^N\widehat{g}_Ng​N​ is an unbiased Monte Carlo estimate of the gradient of a truncated version of that MLE objective – and the truncation order automatically equals the number of rollouts we have just drawn.
Why should this hold? The proof (explored in the next section) hinges on a delicate cancellation. The naive approach would be to compute ∇θ(−∑k=1N(1−p)k/k)\nabla_\theta\bigl(-\sum_{k=1}^N (1-p)^k/k\bigr)∇θ​(−∑k=1N​(1−p)k/k) by differentiating term by term, which produces a sum involving powers of (1−p)(1-p)(1−p) and the gradient of ppp. Instead, the estimator g^N\widehat{g}_Ng​N​ averages scores only over successes and divides by the empirical success count KKK. When you condition on the number of successes KKK, the conditional expectation of 1K∑i=1KS(i)\frac{1}{K}\sum_{i=1}^K S_{(i)}K1​∑i=1K​S(i)​ over the successful trajectories, together with the binomial probabilities for KKK, miraculously collapses into exactly the derivative of the finite sum. The K=0K=0K=0 case contributes zero, which aligns with the fact that the derivative of the series vanishes when p=0p=0p=0 (the constant term is −∑k=1N1/k-\sum_{k=1}^N 1/k−∑k=1N​1/k, a constant independent of θ\thetaθ).
The practical consequence is profound. We never need to manually decide what truncation order to use or to store explicit representations of the series. Simply drawing NNN rollouts and calculating the empirical correct‑average score produces a gradient that, in expectation, corresponds to a specific point on the ladder of Maclaurin approximations. For N=1N=1N=1, E[g^1]=∇p=∇(−(1−p))\mathbb{E}[\widehat{g}_1] = \nabla p = \nabla (-(1-p))E[g​1​]=∇p=∇(−(1−p)), which is exactly the standard REINFORCE gradient for the pass‑rate objective (RL). As NNN grows, the expected gradient moves continuously toward ∇log⁡p\nabla \log p∇logp, the gradient of the full maximum‑likelihood objective. The estimator therefore builds a compute‑indexed bridge: the more rollouts you can afford, the closer you drive the model to maximum likelihood on the correctness task – without any change in the algorithmic mechanism.
In teaching, this equivalence is often presented as a compact, boxed statement that isolates the theorem from the surrounding algebra. The visual below captures that style: a centered theorem box containing the crucial equation E[g^N(x)]=∇θJMaxRL(N)(x)\mathbb{E}[\widehat g_N(x)] = \nabla_\theta J^{(N)}_{\mathrm{MaxRL}}(x)E[g​N​(x)]=∇θ​JMaxRL(N)​(x), together with a minimal italic remark beneath – “Increasing NNN climbs the MaxRL ladder.” This one‑liner distills the core insight that more compute automatically lifts the objective towards ML, making the ladder metaphor a handy mnemonic for the entire MaxRL framework.

11. Variance Reduction via a Control Variate

The previous section proved that the population gradient for MaxRL is exactly the expectation of a truncated score-weighted sum, which yields a natural finite‑sample estimator g^N(x)\widehat{g}_N(x)g​N​(x). The proof established that the estimator is unbiased, but it did not address its variance—which turns out to be the central practical obstacle when we actually draw NNN on‑policy trajectories. We must now face the fact that for challenging correctness tasks, KKK, the number of successful completions in a batch, can be very small. The raw estimator
g^N(x)=1K∑i=1NriSi,K≥1,\widehat{g}_N(x) = \frac{1}{K}\sum_{i=1}^{N} r_i S_i, \qquad K \ge 1,g​N​(x)=K1​i=1∑N​ri​Si​,K≥1,
and zero otherwise, inherits an acute instability: the division by the random count KKK amplifies fluctuations, especially when KKK takes values like 1,2,3. In those regimes, a single extra success or failure drastically changes the weight 1/K1/K1/K, causing large jumps in the gradient estimate. Variance reduction therefore becomes essential for any on‑policy implementation that hopes to converge reliably.
A natural first thought is to borrow the classic REINFORCE baseline trick. In standard policy gradients, we can subtract a state‑dependent baseline b(x)b(x)b(x) from the return without biasing the gradient, because the score has zero expectation: Ez∼mθ[b(x)∇θlog⁡mθ(z∣x)]=0\mathbb{E}_{z\sim m_\theta}[b(x) \nabla_\theta \log m_\theta(z|x)] = 0Ez∼mθ​​[b(x)∇θ​logmθ​(z∣x)]=0. Here, however, the normalisation by KKK breaks that property. The weight ri/Kr_i/Kri​/K is a random variable that depends on all NNN drawn solutions, making it correlated with the scores in a nontrivial way; a simple baseline no longer yields a zero‑mean correction. We need a control variate that remains incontrovertibly zero‑mean regardless of the mixture of successes and failures.
The solution is elegant and easy to compute: use the unconditional average score over all NNN samples, with no reference to correctness. Define
VN=1N∑i=1NSi,Si=∇θlog⁡mθ(zi∣x).V_N = \frac{1}{N}\sum_{i=1}^{N} S_i, \qquad S_i = \nabla_\theta \log m_\theta(z_i | x).VN​=N1​i=1∑N​Si​,Si​=∇θ​logmθ​(zi​∣x).
Because the score function always has zero expectation under the sampling distribution, Ez∼mθ[Si]=0\mathbb{E}_{z\sim m_\theta}[S_i] = \mathbf{0}Ez∼mθ​​[Si​]=0, we immediately obtain E[VN]=0\mathbb{E}[V_N] = \mathbf{0}E[VN​]=0, for any NNN and any policy. This zero‑mean property holds exactly, no matter the batch composition; it does not rely on independence from the rewards. Subtracting VNV_NVN​ from the raw estimator therefore cannot introduce bias:
E[g^N−VN]=E[g^N]−E[VN]=E[g^N].\mathbb{E}\bigl[\widehat{g}_N - V_N\bigr] = \mathbb{E}[\widehat{g}_N] - \mathbb{E}[V_N] = \mathbb{E}[\widehat{g}_N].E[g​N​−VN​]=E[g​N​]−E[VN​]=E[g​N​].
Thus, g~N(x)=g^N(x)−VN\widetilde{g}_N(x) = \widehat{g}_N(x) - V_Ng​N​(x)=g​N​(x)−VN​ remains an unbiased estimator of the gradient of log⁡P(correct)\log \mathbb{P}(\text{correct})logP(correct).
Now we can write the combined estimator in a revealing form. When K≥1K \ge 1K≥1,
g~N(x)=1K∑i=1NriSi  −  1N∑i=1NSi=∑i=1N(riK−1N)Si,\widetilde{g}_N(x) = \frac{1}{K}\sum_{i=1}^{N} r_i S_i \;-\; \frac{1}{N}\sum_{i=1}^{N} S_i = \sum_{i=1}^{N} \left(\frac{r_i}{K} - \frac{1}{N}\right) S_i,g​N​(x)=K1​i=1∑N​ri​Si​−N1​i=1∑N​Si​=i=1∑N​(Kri​​−N1​)Si​,
and g~N(x)=0\widetilde{g}_N(x) = 0g​N​(x)=0 when K=0K = 0K=0. The per‑sample weight shifts from ri/Kr_i/Kri​/K to (ri/K−1/N)(r_i/K - 1/N)(ri​/K−1/N). For a correct sample (ri=1r_i = 1ri​=1) this weight becomes 1/K−1/N1/K - 1/N1/K−1/N; for an incorrect sample (ri=0r_i = 0ri​=0) it becomes −1/N-1/N−1/N. The weights now sum to zero across all NNN draws, because ∑iri=K\sum_i r_i = K∑i​ri​=K. This zero‑sum property effectively removes the baseline drift that plagues the raw estimator. Intuitively, the control variate VNV_NVN​ captures the aggregate random fluctuation of the score vectors, and because the same vectors appear in g^N\widehat{g}_Ng​N​, subtracting VNV_NVN​ cancels a large portion of the stochastic noise—especially when KKK is small and (1K−1N)(\frac{1}{K} - \frac{1}{N})(K1​−N1​) can be large in magnitude.
The diagram that accompanies this section distills the construction into a compact visual. The left column presents the raw estimator g^N\widehat{g}_Ng​N​ in red, marked with its high‑variance affliction for small KKK. The right column displays the control variate VNV_NVN​ in blue, with an arrow pointing to the key property E[VN]=0\mathbb{E}[V_N] = \mathbf{0}E[VN​]=0. A horizontal dashed line separates these building blocks from the final result below: a double‑bordered box in dark green containing the variance‑reduced estimator g~N(x)=∑i=1N(riK−1N)Si\widetilde{g}_N(x) = \sum_{i=1}^N (\frac{r_i}{K} - \frac{1}{N})S_ig​N​(x)=∑i=1N​(Kri​​−N1​)Si​. A small italic note reminds us that the estimator gracefully falls back to zero when K=0K = 0K=0. The colour coding and spatial layout reinforce the logical flow: we start from a noisy but unbiased estimate, subtract a zero‑mean companion that shares its stochastic source, and obtain a stabilised estimator that remains unbiased and is far more practical for iterative on‑policy training. This control variate trick turns Theorem 2’s unbiased estimator into a tool we can actually deploy, setting the stage for the complete on‑policy algorithm that follows.

12. Algorithm 1: On-Policy MaxRL Implementation

With the control‑variate estimator g~N(x)\widetilde{g}_N(x)g​N​(x) fully derived, the next step is to embed it inside a practical training loop. The result is Algorithm 1, an on‑policy update that replaces the usual REINFORCE gradient with a variance‑reduced contribution for each input xxx. The algorithm is simple to implement, yet its structure directly realises the truncated log‑likelihood objective we recovered from the Maclaurin expansion.
For a batch of inputs BBB, the algorithm samples NNN responses from the current policy mθ(⋅∣x)m_\theta(\cdot | x)mθ​(⋅∣x) for each xxx, evaluates the binary correctness reward rj=1{f(zj)=y∗(x)}r_j = \mathbf{1}\{ f(z_j) = y^*(x) \}rj​=1{f(zj​)=y∗(x)}, and records the score vectors Sj=∇θlog⁡mθ(zj∣x)S_j = \nabla_\theta \log m_\theta(z_j | x)Sj​=∇θ​logmθ​(zj​∣x). The crucial step is forming the empirical success rate r^(x)=1N∑j=1Nrj\hat{r}(x) = \frac{1}{N} \sum_{j=1}^N r_jr^(x)=N1​∑j=1N​rj​ and then computing the gradient contribution
g^(x)  =  1N r^(x)∑j=1N(rj−r^(x)) Sj,\hat{g}(x) \;=\; \frac{1}{N\,\hat{r}(x)} \sum_{j=1}^{N} \bigl( r_j - \hat{r}(x) \bigr) \, S_j,g^​(x)=Nr^(x)1​j=1∑N​(rj​−r^(x))Sj​,
whenever r^(x)>0\hat{r}(x) > 0r^(x)>0; otherwise g^(x)=0\hat{g}(x) = 0g^​(x)=0. Comparing this with the standard REINFORCE estimator 1N∑j(rj−b)Sj\frac{1}{N}\sum_j (r_j - b) S_jN1​∑j​(rj​−b)Sj​ reveals two fundamental differences: the denominator is the total number of successes Nr^(x)N \hat{r}(x)Nr^(x) rather than NNN, and the baseline is the per‑input empirical mean r^(x)\hat{r}(x)r^(x) instead of an exogenous baseline. Both modifications arise directly from the derivation; they are not heuristic tweaks.
The normalisation by Nr^(x)N \hat{r}(x)Nr^(x) is what ties the algorithm to maximum likelihood. Recall from the expansion of log⁡P[correct∣x]\log \mathbb{P}[\text{correct}|x]logP[correct∣x] that the kkk-th term involves the conditional expectation over the successful trajectories only. In the estimator g~N(x)\widetilde{g}_N(x)g​N​(x), the sum ∑j(rj−r^(x))Sj\sum_j (r_j - \hat{r}(x)) S_j∑j​(rj​−r^(x))Sj​ is equivalent to ∑j:rj=1Sj−r^(x)∑jSj\sum_{j: r_j=1} S_j - \hat{r}(x) \sum_j S_j∑j:rj​=1​Sj​−r^(x)∑j​Sj​. When we divide by Nr^(x)N \hat{r}(x)Nr^(x), we are effectively forming the estimator
1Nr^(x)∑j:rj=1Sj  −  1N∑jSj,\frac{1}{N \hat{r}(x)} \sum_{j: r_j=1} S_j \;-\; \frac{1}{N} \sum_{j} S_j,Nr^(x)1​j:rj​=1∑​Sj​−N1​j∑​Sj​,
which is a consistent sample approximation of Ez∼mθ(⋅∣x)[S∣r=1]−Ez∼mθ(⋅∣x)[S]\mathbb{E}_{z\sim m_\theta(\cdot|x)}[S \mid r=1] - \mathbb{E}_{z\sim m_\theta(\cdot|x)}[S]Ez∼mθ​(⋅∣x)​[S∣r=1]−Ez∼mθ​(⋅∣x)​[S]. The first term is the gradient of the log‑likelihood restricted to correct completions, while the second term (the average over all samples) acts as a control variate whose expected value is zero under the policy. Thus the estimator is targeting the truncated log‑likelihood objective—it pushes the policy toward distributing mass only among correct responses, exactly as demanded by the full MLE solution when all responses are correct.
The condition r^(x)>0\hat{r}(x) > 0r^(x)>0 is a practical safeguard. If the policy never produces a correct answer for a given input within the NNN samples, then there is no information about which directions would improve correctness, and the gradient contribution is set to zero. This prevents the update from being corrupted by division by zero and, more importantly, avoids misleading the policy when it is completely unsuccessful. It also reflects the truncation order: if no successes are observed, the corresponding term in the Maclaurin expansion would vanish, so the estimator stays coherent.
After accumulating g^(x)\hat{g}(x)g^​(x) across the batch, the final update direction is simply 1∣B∣∑x∈Bg^(x)\frac{1}{|B|} \sum_{x\in B} \hat{g}(x)∣B∣1​∑x∈B​g^​(x). This is a standard stochastic gradient step that averages gradient contributions over the batch. Note that the sampling of NNN completions per input is done on‑policy, so the rollouts must be re‑drawn after each parameter update to maintain consistency with the current policy—exactly as in any REINFORCE‑style algorithm.
The algorithm’s design highlights the role of NNN as a compute‑indexed knob. When N=1N=1N=1, r^(x)\hat{r}(x)r^(x) can only be 0 or 1, and the estimator collapses to (1−r1)S1(1 - r_1) S_1(1−r1​)S1​ when r1=0r_1=0r1​=0 (zero otherwise). As NNN grows, the estimator concentrates around the true truncated gradient, reducing variance and enabling finer‑grained updates that approach MLE behaviour. Later we will see that the sample count NNN directly corresponds to the truncation order in the objective, making the compute budget an explicit parameter that interpolates between RL and pure log‑likelihood training.
It is also instructive to contrast MaxRL with GRPO, which normalises the reward by the standard deviation of rewards within a group of rollouts. GRPO subtracts the group mean and divides by the group standard deviation to measure advantage, an approach that works well for scalar reward shaping but does not target a log‑likelihood objective. MaxRL, by using the per‑input success fraction r^(x)\hat{r}(x)r^(x) as both baseline and normaliser, recovers the exact gradient of a truncated log‑likelihood when rewards are binary and the success probability is non‑negligible.
The accompanying slide provides a compact, at‑a‑glance summary of these ideas. A central pseudocode block faithfully renders Algorithm 1, with the line computing g^(x)\hat{g}(x)g^​(x) highlighted—this is the key line that embodies the variance‑reduced estimator we have just deconstructed. Beneath the block, two concise bullet points contrast MaxRL with REINFORCE and GRPO: one notes that REINFORCE normalises by NNN and subtracts a baseline, while MaxRL normalises by success count Nr^N\hat{r}Nr^ and subtracts the mean reward r^\hat{r}r^; the other points out that GRPO divides by standard deviation, whereas MaxRL’s normaliser is tied directly to the sample success rate. These contrasts crystallise the implementation differences that make MaxRL distinct, turning the slide into a quick reference for anyone implementing the method.

13. Unifying Weight-Function View

Having implemented MaxRL as an on‑policy procedure in Algorithm 1, we now have a concrete algorithm that uses multiple independent attempts to construct an unbiased gradient. But stepping back, it becomes clear that MaxRL is not an isolated trick—it sits inside a broader family of methods that share a common mathematical skeleton. All of them can be understood as different ways of choosing how much to amplify the gradient signal from a prompt xxx based on the model’s current pass probability p=pθpass(x)p = p_\theta^{\text{pass}}(x)p=pθpass​(x). This suggests a simple, unifying language: a weight function w(p)w(p)w(p) that scales the per‑example gradient ∇θp\nabla_\theta p∇θ​p. Once we adopt this view, the design space of policy‑gradient algorithms for correctness tasks collapses to selecting the weight function w(p)w(p)w(p), and we can compare methods side‑by‑side by looking at their weight curves.
The shared template is deceptively compact. Let ρ\rhoρ denote the distribution over prompts (or more generally the state visitation distribution under the current policy). Then the gradient of any objective that factors through the pass probability can be written as
∇θJ=Ex∼ρ ⁣[w(p)  ∇θp],p=pθpass(x).\nabla_{\theta} J = \mathbb{E}_{x\sim\rho}\!\Big[ w(p) \; \nabla_{\theta} p \Big],
\qquad p = p_\theta^{\text{pass}}(x).∇θ​J=Ex∼ρ​[w(p)∇θ​p],p=pθpass​(x).
The weight function w(p)w(p)w(p) encodes the objective’s sensitivity to examples of different difficulty. An example where the model nearly always succeeds (p≈1p \approx 1p≈1) and an example where it nearly always fails (p≈0p \approx 0p≈0) can receive drastically different weights depending on the method. The template itself is a direct consequence of the policy gradient theorem when the per‑step reward is replaced by a binary correctness signal, but the real value is that it decouples the scale of the update from the raw probability, allowing us to design algorithms by reasoning about w(p)w(p)w(p) in isolation.
Standard reinforcement learning—the most common baseline—treats every completed trajectory equally: a correct answer yields a reward of 1, an incorrect one 0. Under this reward scheme the expected return is the pass probability, and its gradient reduces to E[∇θp]\mathbb{E}[\nabla_\theta p]E[∇θ​p], i.e., wRL(p)=1w_{\text{RL}}(p) = 1wRL​(p)=1. This constant weight ignores how certain the model already is about a prompt. An easy prompt contributes exactly as much gradient as a borderline one, which is wasteful when many gradient samples are dominated by high‑probability noise. GRPO, the method used in DeepSeek‑R1, attempts to remedy this by normalising rewards within a group of rollouts. Its effective weight function becomes the reciprocal of the standard deviation of the binary outcome: wGRPO(p)=1/p(1−p)w_{\text{GRPO}}(p) = 1/\sqrt{p(1-p)}wGRPO​(p)=1/p(1−p)​. This function is symmetric and strongly U‑shaped—it heavily upweights prompts where p≈0p \approx 0p≈0 or p≈1p \approx 1p≈1 because those are the cases with the smallest variance. While this gives maximal weight to confidently correct or confidently wrong answers, it also amplifies noise, since the variance estimate itself is unstable for extreme probabilities.
At the opposite extreme sits maximum likelihood estimation over correct trajectories. Maximising log‑likelihood of correct answers—or, equivalently, minimising the cross‑entropy loss on only positive examples—gives a gradient of the form Ex∼ρ[(1/p)∇θp]\mathbb{E}_{x\sim\rho}[(1/p)\nabla_\theta p]Ex∼ρ​[(1/p)∇θ​p], so wML(p)=1/pw_{\text{ML}}(p) = 1/pwML​(p)=1/p. This hyperbola is gentle for easy prompts (ppp near 1, weight near 1) but grows without bound as ppp shrinks, desperately trying to lift the tiniest success probabilities. It is compute‑hungry because it needs reliable estimates of ∇θlog⁡p\nabla_\theta \log p∇θ​logp on very rare successes, which usually requires millions of samples. MaxRL bridges these extremes. Its weight function, derived from the truncated Maclaurin expansion of log⁡p\log plogp, is
wMaxRL(T)(p)=1−(1−p)Tp.w_{\text{MaxRL}(T)}(p) = \frac{1 - (1-p)^T}{p}.wMaxRL(T)​(p)=p1−(1−p)T​.
For T=1T=1T=1 this reduces to wRLw_{\text{RL}}wRL​ (since 1−(1−p)=p1-(1-p)=p1−(1−p)=p), while as T→∞T\to\inftyT→∞ it approaches 1/p1/p1/p for any p>0p>0p>0, recovering ML. At finite TTT, the function behaves like wRLw_{\text{RL}}wRL​ when ppp is large because (1−p)T(1-p)^T(1−p)T decays quickly, and it smoothly bends upward toward the ML hyperbola as ppp becomes small. This interpolation is controlled solely by the truncation order TTT, which equals the number of independent attempts used in the MaxRL estimator—a direct compute‑indexed bridge.
Viewing all four weight functions on a single log‑log plot makes the relationships immediate and intuitive. The x‑axis is the pass probability ppp, spanning several orders of magnitude from near‑impossible prompts to near‑certain ones. The y‑axis is the weight w(p)w(p)w(p), also on a logarithmic scale to expose power‑law behaviour. In such a visual, RL appears as a flat horizontal line at w=1w=1w=1—utterly indifferent to ppp. Maximum likelihood traces a straight line with slope −1-1−1 (since log⁡w=−log⁡p\log w = -\log plogw=−logp), a hyperbola that skyrockets for tiny ppp. GRPO forms a symmetric bowl that rises sharply at both ends, visually distinct from everything else. The MaxRL family fans out between RL and ML: for T=2T=2T=2 the curve barely lifts above 1 except at the very lowest probabilities; T=10T=10T=10 bends significantly earlier; T=50T=50T=50 hugs 1/p1/p1/p over a wide range before saturating at w≈Tw \approx Tw≈T for p→0p\to0p→0. This saturation is the key—MaxRL never overweights hopeless prompts as severely as ML does, because the truncation caps the weight at TTT. The plot also reveals a subtle danger: GRPO’s peak at p→1p \to 1p→1 can be far larger than any MaxRL curve for confident successes, potentially causing overfitting to already‑mastered prompts instead of focusing compute on the tail. In contrast, MaxRL concentrates gradient credit on the examples that are neither impossible nor already solved, which matches the intuition of efficient learning.
The visual below distills the entire discussion into a single, glanceable comparison. The four families are colour‑coded and a legend identifies each by its functional form. The family of MaxRL(TTT) curves, plotted for T∈{2,4,10,50}T \in \{2,4,10,50\}T∈{2,4,10,50}, visibly threads the needle between the flat RL baseline and the steep ML target, illustrating how the hyperparameter TTT indexes a smooth trade‑off. The log‑log axes make it obvious that the weight functions differ most dramatically in the low‑probability regime, precisely where data scarcity forces an algorithm to choose between high variance and high bias. This unified weight‑function view not only organises existing methods but also suggests new ones: any monotonically decreasing w(p)w(p)w(p) with controlled growth near zero could be a candidate for compute‑efficient reinforcement fine‑tuning, and this plot gives us the mental model to design it.

14. Empirical Highlights

With a unified weight-function view of the optimization landscape, the theoretical promise of MaxRL becomes concrete: by indexing the gradient estimate with a truncation order NNN, the algorithm interpolates between a raw reinforcement signal and the exact maximum-likelihood gradient. The natural next step is to validate whether that interpolation translates into genuine empirical gains—especially in regimes where existing methods are known to struggle. The empirical highlights across image classification, spatial reasoning, mathematical problem-solving, and large-scale language model fine‑tuning all point to the same conclusion: MaxRL consistently outperforms REINFORCE-style baselines and the popular GRPO family, often by dramatic margins.
Recall the core predicament that standard policy‑gradient methods face on correctness tasks. When a binary reward only flags whether a sampled answer is correct, the gradient estimator is fundamentally limited to the support of positive rollouts. In problems with a low initial pass rate, that support can be extremely sparse; the resulting signal is weak, high‑variance, and entirely blind to the structure of incorrect answers. REINFORCE and its modern derivatives (RLOO, GRPO) consequently stall in these cold‑start conditions—they simply do not see enough correct traces to climb out of the low‑performance basin. MaxRL sidesteps this trap by exploiting the Maclaurin expansion of the log‑pass‑probability: instead of ignoring negative rollouts, it weights them according to a truncated exponential series that automatically assigns meaningful learning signals to both correct and incorrect samples. The truncation order NNN becomes a compute‑indexed dial that, when turned up, recovers the full log‑likelihood gradient with exactness.
The first striking demonstration comes from an ImageNet classification proxy, where the model is trained from a low initial pass rate. Standard REINFORCE plateaus early—its reliance on sporadic positive examples prevents convergence toward the cross‑entropy teacher. MaxRL, in contrast, closely tracks the cross‑entropy baseline as the number of rollouts per sample grows (Figure 2). This is a direct consequence of Theorem 2: with NNN rollouts, the finite‑sample MaxRL estimator exactly implements the NNN‑truncated Maclaurin term of the log‑pass‑probability. As NNN increases, the objective smoothly morphs into a proper maximum‑likelihood loss, explaining why it can eventually match cross‑entropy performance. The experiment thus confirms that MaxRL’s compute‑indexed bridge is not just a formal curiosity but a practical mechanism for escaping the cold‑start trap.
Equally telling is a Maze navigation task with access to effectively infinite training data. Here the question is not data scarcity but gradient efficiency: how many environment interactions are needed to reach a strong policy? MaxRL scales far more gracefully with the number of training rollouts than GRPO does. Notably, MaxRL with only 4 rollouts per problem instance already outperforms GRPO using 128 rollouts (Figure 3, Table 3). This counter‑intuitive result makes sense through the weight‑function lens. GRPO applies a severe advantage‑clipping operation that discards fine‑grained credit assignment among rollouts, effectively compressing the information into a crude relative‑ranking signal. MaxRL’s weight function, being a smooth polynomial in the pass rate, preserves richer per‑sample feedback even with a small rollout budget, so it needs far fewer samples to build an accurate gradient. Empirically, this translates into a massive reduction in required compute.
Perhaps the most dramatic warning for practitioners comes from the GSM8K data‑scarce regime. Here, fine‑tuning a language model on only a handful of math word‑problem chains reveals a dark side of optimizing solely for pass rate. GRPO and RLOO drive the model to high pass@1, but they simultaneously suffer catastrophic pass@k degradation: the set of valid solution paths collapses, and the model loses the diversity that makes test‑time majority voting effective (Figure 4, Table 4). MaxRL achieves a higher peak pass@1 while preserving pass@k diversity—the distribution over correct reasoning chains remains rich. This is exactly what we would expect when the objective approximates maximum likelihood rather than a mode‑seeking RL signal. The truncation order NNN acts as an implicit regularizer; even with finite NNN, the log‑probability target encourages coverage of all correct modes, not just the easiest one. For safety‑critical or reasoning‑intensive tasks, that property is invaluable.
The modern scale test cements the case. Fine‑tuning Qwen3 1.7B and 4B models on mathematical benchmarks (AIME, BeyondAIME, MATH‑500, Minerva) with a perfect outcome verifier reveals that MaxRL Pareto‑dominates GRPO on both pass@1 and pass@k across all tasks (Figure 5). The dominance is particularly tangible in test‑time compute scaling: when allowed to sample and majority‑vote at test time, models trained with MaxRL achieve up to a 20× efficiency gain over GRPO‑trained counterparts. In other words, to reach a target accuracy, a MaxRL model needs 20 times fewer test‑time samples, directly capitalizing on its preserved distributional diversity. The visual below captures this cluster of results in a 2×2 grid of summary bullet points, with green‑coded successes where MaxRL excels and red‑coded pitfalls for competing methods. A dedicated highlight box underscores the 20× test‑time scaling advantage, reminding us that the bridge from RL to log‑likelihood is not merely an academic equivalence but a recipe for substantially better sample efficiency at both training and inference time.

15. MaxRL at a Glance

The empirical results we just examined show a striking pattern: a model fine-tuned with a simple reinforcement learning reward—answer correctness—can boost its pass rate on held-out prompts, yet it often fails to capture the full statistical richness of the data. The learned policy may ignore subtle failure modes, become overconfident, and ultimately leave a gap when we measure its log-probability rather than the binary pass rate. This raises a deeper question: can we design a family of objectives that, at low compute, behaves like an RL pass-rate maximizer but, as we increase the number of samples, converges to the maximum likelihood estimate? Maximum Likelihood Reinforcement Learning (MaxRL) does exactly that by carving a compute-indexed path from the binary reward world to the log-likelihood ideal.
The key mathematical observation is that for any prompt xxx and any evaluation protocol that ultimately extracts a binary correctness outcome, the model’s pass probability p=pθpass(x)p=p_\theta^{\text{pass}}(x)p=pθpass​(x) satisfies
log⁡p=log⁡(1−(1−p))=−∑k=1∞(1−p)kk,\log p = \log(1-(1-p)) = -\sum_{k=1}^\infty \frac{(1-p)^k}{k},logp=log(1−(1−p))=−k=1∑∞​k(1−p)k​,
a Maclaurin series that converges for 0<p≤10 < p \le 10<p≤1. Truncating this expansion after TTT terms yields a family of surrogate objectives
JMaxRL(T)=−∑k=1T(1−p)kk,J^{(T)}_{\text{MaxRL}} = -\sum_{k=1}^{T} \frac{(1-p)^k}{k},JMaxRL(T)​=−k=1∑T​k(1−p)k​,
where we treat ppp as the quantity to be optimized over the policy parameters. When T=1T=1T=1, J(1)=−(1−p)J^{(1)}=-(1-p)J(1)=−(1−p), a linear function of the pass rate; maximizing it is equivalent (up to an additive constant) to maximizing the expected binary reward—standard RL on correctness. For T>1T>1T>1, higher-order terms—the pass@k probabilities that at least one of kkk sampled answers is correct—enter the objective, and they become increasingly influential when ppp is small. In other words, the truncation order TTT acts as a dial that controls how far we push beyond a single binary success toward the full log-probability surface.
Gradient-based optimization is made practical by the following identity (Theorem 1 of the MaxRL derivation):
∇θJ(T)=∑k=1T1k ∇θ pass@k(x).\nabla_\theta J^{(T)} = \sum_{k=1}^{T} \frac{1}{k}\,\nabla_\theta\,\text{pass@k}(x).∇θ​J(T)=k=1∑T​k1​∇θ​pass@k(x).
Each term ∇θ pass@k(x)\nabla_\theta\,\text{pass@k}(x)∇θ​pass@k(x) can be estimated without bias from a finite number of rollouts. Crucially, when we draw NNN independent rollouts per prompt and construct the natural estimator g^N\widehat{g}_Ng​N​ (described earlier in the lecture), that estimator is unbiased for ∇θJ(N)\nabla_\theta J^{(N)}∇θ​J(N). Thus the rollout count NNN directly sets the effective truncation order: with NNN samples we are, in expectation, optimizing JMaxRL(N)J^{(N)}_{\text{MaxRL}}JMaxRL(N)​. This compute-indexed link is the heart of MaxRL—the level of sampling determines which member of the objective family we are actually pursuing.
Why does this matter? As we increase NNN, the objective J(N)J^{(N)}J(N) progressively incorporates pass@k terms for larger kkk, each weighted by 1/k1/k1/k. Hard problems, where the pass probability ppp is low, see relatively larger contributions from higher-order terms because ∇θ pass@k(x)\nabla_\theta\,\text{pass@k}(x)∇θ​pass@k(x) tends to be more pronounced when the model struggles to get any sample correct. The gradient thus concentrates on the most difficult prompts, preventing the model from simply learning a uniform “easy mode” strategy. At the same time, the explicit dependence on pass@k for k>1k>1k>1 discourages pass@k collapse—a degenerate behavior observed in vanilla RL where the policy becomes deterministic and all kkk attempts produce the same (possibly wrong) answer, making pass@k estimates unreliable and learning stale. With MaxRL, even if pass@1 is high, the model retains incentive to produce diverse correct solutions, because any failure to achieve at least one correct answer among kkk trials is penalized by the (1−p)k/k(1-p)^k/k(1−p)k/k terms.
Taken together, the MaxRL framework cleanly unifies pure RL, maximum likelihood, and a spectrum of intermediate objectives under a single weight-function perspective (when compared to GRPO and similar algorithms, MaxRL can be seen as adjusting the sampling weight according to 1/p1/p1/p truncated at order NNN). The visual summary below condenses this unified view into a compact table, listing the core objective, its stochastic gradient, and the unbiased estimator that realizes the truncation via rollouts. Below the table, highlighted bullet points reinforce the compute-indexed property—more samples NNN imply a higher truncation T=NT=NT=N and a better ML approximation—and the practical advantages: concentrating gradient on hard examples, preventing pass@k collapse, and scaling effectively with both compute and data. This at-a-glance reference grounds the more detailed theorems we have explored and serves as a quick mental model for why MaxRL behaves as a smooth bridge from simple reward maximization to full log-probability learning.

2. Latent Generation Model and Pass Rate

The limitations of standard RL on hard correctness tasks force us to look beneath the surface of the final answer. If the only signal a model receives is whether its decoded output matches the ground truth, then any two models that achieve identical pass rates are indistinguishable to the optimizer, regardless of how they produce their answers. To make precise statements about what an objective can and cannot recover, we need a generative model that exposes the unobserved reasoning process while still connecting to the observable binary reward. This is the latent generation model, and it is the formal backbone of the entire MaxRL analysis.
We assume an input xxx is drawn from a distribution ρ\rhoρ over a space XXX. The model itself is a policy mθ(⋅∣x)m_\theta(\cdot|x)mθ​(⋅∣x) that, given xxx, produces a trajectory z∈Zz \in Zz∈Z. Crucially, zzz is latent: it may correspond to a chain-of-thought, a sequence of tool calls, or an internal navigation plan. The model does not output zzz directly as the user-visible answer. Instead, a deterministic decoding function f:Z→Yf: Z \to Yf:Z→Y maps the trajectory to a final answer y=f(z)∈Yy = f(z) \in Yy=f(z)∈Y. For training, we assume that for every input xxx we know the correct answer y∗(x)y^*(x)y∗(x). This abstraction is remarkably general — it accommodates mathematical reasoning, code generation, multi‑step retrieval, and any task where correctness can be judged by comparing the decoded output against a known target.
Because the decoder is deterministic, the only source of randomness in the final answer is the stochastic policy mθm_\thetamθ​. So we can define a binary reward that indicates correctness:
r(x,z)=I{f(z)=y∗(x)}.r(x,z) = \mathbb{I}\{ f(z) = y^*(x) \}.r(x,z)=I{f(z)=y∗(x)}.
This reward is all‑or‑nothing: 1 if the final answer matches, 0 otherwise. The expected reward over the model’s own latent distribution, conditioned on xxx, is the per‑input pass rate:
pθpass(x)=Ez∼mθ(⋅∣x)[ r(x,z) ].p_\theta^{\text{pass}}(x) = \mathbb{E}_{z \sim m_\theta(\cdot|x)}[\, r(x,z) \,].pθpass​(x)=Ez∼mθ​(⋅∣x)​[r(x,z)].
In words, pθpass(x)p_\theta^{\text{pass}}(x)pθpass​(x) is the probability that a single answer obtained by sampling zzz from the policy and then applying fff will be exactly correct. It is the fundamental quantity that standard RL methods optimize — but as we saw earlier, optimizing only the pass rate discards all information about the latent trajectories that produced the correct output.
That latent information, however, is exactly what we need if we hope to recover the true conditional distribution over correct trajectories, i.e., maximum likelihood. The pass rate is a coarser statistic: it collapses the rich structure of zzz into a single number between 0 and 1. Two models with radically different reasoning patterns can have the same pass rate, yet one might produce a correct answer by genuine understanding while the other might guess wildly but sometimes land on the right token sequence. Distinguishing them requires looking at the collection of trajectories, not just the aggregate success frequency.
This is where the idea of multiple rollouts enters naturally. If we draw kkk independent trajectories z1,…,zk∼mθ(⋅∣x)z_1, \dots, z_k \sim m_\theta(\cdot|x)z1​,…,zk​∼mθ​(⋅∣x), we can compute the probability that at least one of the corresponding decoded answers is correct:
pass@k(x)=1−(1−pθpass(x))k,fail@k(x)=(1−pθpass(x))k.\text{pass@k}(x) = 1 - \bigl(1 - p_\theta^{\text{pass}}(x)\bigr)^k,
\qquad
\text{fail@k}(x) = \bigl(1 - p_\theta^{\text{pass}}(x)\bigr)^k.pass@k(x)=1−(1−pθpass​(x))k,fail@k(x)=(1−pθpass​(x))k.
The pass@k metric quantifies how compute—in the form of additional sampling—improves the chance of seeing a correct answer. If the base pass rate is tiny (say p=0.01p = 0.01p=0.01), then with k=100k=100k=100 attempts we get pass@k≈1−(0.99)100≈0.634\text{pass@k} \approx 1 - (0.99)^{100} \approx 0.634pass@k≈1−(0.99)100≈0.634. The complement fail@k\text{fail@k}fail@k decays exponentially in kkk, which will later become crucial for designing objectives that connect the number of samples to the order of a Taylor‑like expansion of the log-pass probability.
Note that pass@k is still a function of the per‑input pass rate; it adds no new information about the latent trajectories themselves. However, the joint distribution of the kkk rollouts—and in particular the number of correct answers among them—does contain stochastic information that, with the right objective, can guide the model toward high‑likelihood reasoning paths. The stage is now set to ask: how can we design a training signal that uses this richer data, ideally recovering something akin to maximum likelihood as a limiting case?
The visual below consolidates this generative story. At a glance you see the flow from input distribution, through the stochastic policy and deterministic decoder, into the binary comparator that yields the reward. The pass rate appears as the expected reward, and the separate inset shows how kkk independent draws give rise to the pass@k formula. The color coding — blue for input/output, green for the latent policy, red for the binary reward, gray for the known target — makes the signal‑flow interpretation immediate. This diagram is not just a static definition; it will recur as the core abstraction throughout the MaxRL development, anchoring every subsequent theorem and estimator construction in the same latent generation model.

3. ML vs RL Objectives in the Binary-Correctness Setting

The previous post formalized the notion of a latent generation model: a policy mθ(z∣x)m_\theta(z \mid x)mθ​(z∣x) that stochastically produces candidate answers zzz, and a correctness oracle r(x,z)∈{0,1}r(x,z) \in \{0,1\}r(x,z)∈{0,1} that judges each one. For any fixed prompt xxx, the pass rate
pθpass(x)  =  Ez∼mθ(⋅∣x)[r(x,z)]p_\theta^{\text{pass}}(x) \;=\; \mathbb{E}_{z \sim m_\theta(\cdot \mid x)}\bigl[r(x,z)\bigr]pθpass​(x)=Ez∼mθ​(⋅∣x)​[r(x,z)]
captures the probability that a single attempt from the policy succeeds. From this, we can define two natural optimization targets that extract different summaries of the pass rate distribution over prompts.
The first, maximum likelihood (ML), asks: what parameters make the observed successes most probable, in the sense of maximizing the expected log pass rate? Its objective is
JML(θ)  =  Ex∼ρ[log⁡pθpass(x)].J_{\text{ML}}(\theta) \;=\; \mathbb{E}_{x \sim \rho}\bigl[\log p_\theta^{\text{pass}}(x)\bigr].JML​(θ)=Ex∼ρ​[logpθpass​(x)].
The second, which we will call RL, simply maximizes the expected pass rate itself:
JRL(θ)  =  Ex∼ρ[pθpass(x)].J_{\text{RL}}(\theta) \;=\; \mathbb{E}_{x \sim \rho}\bigl[p_\theta^{\text{pass}}(x)\bigr].JRL​(θ)=Ex∼ρ​[pθpass​(x)].
These two criteria coincide only when every prompt has the same pass rate – a degenerate case. In any realistic setting, they drive optimization in importantly different directions, and that divergence is the pivot of this entire lecture.
To see why, consider the gradients. Under mild interchangeability conditions we can push ∇θ\nabla_\theta∇θ​ inside the expectation over xxx, yielding
∇θJML  =  Ex∼ρ ⁣[1pθpass(x)∇θpθpass(x)],∇θJRL  =  Ex∼ρ ⁣[∇θpθpass(x)].\nabla_\theta J_{\text{ML}} \;=\; \mathbb{E}_{x \sim \rho}\!\left[\frac{1}{p_\theta^{\text{pass}}(x)} \nabla_\theta p_\theta^{\text{pass}}(x)\right],
\qquad
\nabla_\theta J_{\text{RL}} \;=\; \mathbb{E}_{x \sim \rho}\!\left[\nabla_\theta p_\theta^{\text{pass}}(x)\right].∇θ​JML​=Ex∼ρ​[pθpass​(x)1​∇θ​pθpass​(x)],∇θ​JRL​=Ex∼ρ​[∇θ​pθpass​(x)].
The RL gradient is the population version of a standard policy gradient update: it increases the pass rate wherever its derivative points, but it does so without caring about the absolute magnitude of that pass rate. A prompt with current success probability 0.010.010.01 and a prompt with 0.990.990.99 both receive the same gradient weight of 111. In contrast, the ML gradient weights each prompt’s update by 1/p1/p1/p, so it puts enormous emphasis on prompts where the policy is currently failing (small ppp), and little emphasis on those already mastered.
This contrast is not just a mathematical curiosity. In fully differentiable classification tasks – where the policy directly outputs a softmax distribution over a finite set of labels YYY and correctness is simply the indicator of hitting the right label – the ML objective JMLJ_{\text{ML}}JML​ becomes the familiar cross‑entropy loss. The gradient 1p∇θp\frac{1}{p}\nabla_\theta pp1​∇θ​p arises naturally from the derivative of the log. However, in our setting the policy produces a latent answer zzz and we only observe the binary correctness r(x,z)r(x,z)r(x,z); the inner expectation Ez[r]\mathbb{E}_z[r]Ez​[r] is not directly differentiable with respect to the policy’s parameters. To obtain a gradient estimate we must resort to score‑function (REINFORCE) estimators that use the log‑likelihood of the sampled action. That estimator inevitably introduces a 1/p1/p1/p term when we target the log pass rate, because the full gradient of log⁡p\log plogp equals (1/p)∇θp(1/p)\nabla_\theta p(1/p)∇θ​p and we must estimate both ppp and its gradient from finite samples.
This is the critical point: RL is not “better” than ML; it is a practical necessity born from the sampling step. The RL objective JRLJ_{\text{RL}}JRL​ yields the familiar policy gradient form without an explicit 1/p1/p1/p factor, precisely because the derivative of ppp does not involve a 1/p1/p1/p weighting. That simplifies estimation enormously, but at the cost of abandoning the log‑pass‑rate criterion, which in many correctness tasks would be the principled target (it maximizes the likelihood of observing a correct answer under the latent model).
So we are left with a sharp question: Can we recover the log‑pass‑rate objective JMLJ_{\text{ML}}JML​ using only finite samples, without requiring knowledge of the true pass rates? The rest of this lecture frames a family of estimators that interpolate between the RL gradient and the ML gradient, indexed by the number of samples we are willing to draw per prompt. The conceptual bridge is a Maclaurin expansion of the log, which we’ll begin to unfold in the next section.
The visual below distills this tension into its bare essentials. On the left, the ML side displays the log pass rate objective and its gradient with the telltale 1/p1/p1/p factor. On the right, the RL side shows the expected pass rate objective and the simpler, factor‑free gradient. The two gradients are aligned to emphasize the single difference: the weighting inside the expectation. Beneath them, the key contrast is spelled out in plain terms – the differentiable case versus the sampled latent case – and the central question is boxed in blue as a prompt for the analytical bridge we are about to build.

4. Maclaurin Expansion of Log Pass Rate

Having seen that standard RL on binary correctness reduces to optimizing the pass probability ppp, we might assume that scaling up RL by increasing sample budgets and reward granularity would naturally approach maximum likelihood. However, the true ML objective does not simply weight the reward by the empirical pass rate; it optimizes the log pass probability log⁡pθpass(x)\log p_\theta^{\text{pass}}(x)logpθpass​(x). This objective has a hidden depth: it encodes far more than the first-moment probability of a single correct answer. It distills information from the entire distribution of failures across multiple independent attempts, a structure that a pass-rate-only signal entirely misses.
To uncover that structure, we expand log⁡p\log plogp through the lens of the Maclaurin series for −log⁡(1−z)-\log(1-z)−log(1−z). This classic expansion holds for ∣z∣<1|z|<1∣z∣<1 and writes the logarithm as an infinite sum of powers:
−log⁡(1−z)=∑k=1∞zkk,∣z∣<1.-\log(1-z) = \sum_{k=1}^{\infty} \frac{z^{k}}{k}, \qquad |z| < 1.−log(1−z)=k=1∑∞​kzk​,∣z∣<1.
Now, set z=1−pz = 1-pz=1−p, which satisfies ∣1−p∣<1|1-p|<1∣1−p∣<1 for any p∈(0,1]p\in(0,1]p∈(0,1]. Instantly we obtain
log⁡p=−log⁡(1−(1−p))=−∑k=1∞(1−p)kk.\log p = -\log(1 - (1-p)) = -\sum_{k=1}^{\infty} \frac{(1-p)^{k}}{k}.logp=−log(1−(1−p))=−k=1∑∞​k(1−p)k​.
The term (1−p)k(1-p)^{k}(1−p)k is precisely the probability that all kkk independent samples from the policy are incorrect – that is, the fail@k event, fail@k(x)\mathrm{fail@k}(x)fail@k(x). Substituting this notation yields a crisp identity:
log⁡pθpass(x)=−∑k=1∞1k fail@k(x).\log p_\theta^{\text{pass}}(x) = -\sum_{k=1}^{\infty} \frac{1}{k}\,\mathrm{fail@k}(x).logpθpass​(x)=−k=1∑∞​k1​fail@k(x).
This series is not merely a formal manipulation; it decomposes the log-likelihood into an infinite harmonic mixture of higher-order failure probabilities. The weighting factors 1/k1/k1/k decay gently, meaning that fail@10 carries one-tenth the influence of fail@1, but its contribution is far from negligible.
The real power of this expansion appears when we differentiate with respect to the model parameters θ\thetaθ. Differentiating term by term – assuming standard smoothness conditions that allow the gradient to pass through the sum – we get
∇θlog⁡p=−∑k=1∞1k ∇θ fail@k(x).\nabla_\theta \log p = -\sum_{k=1}^{\infty} \frac{1}{k}\,\nabla_\theta\,\mathrm{fail@k}(x).∇θ​logp=−k=1∑∞​k1​∇θ​fail@k(x).
Each ∇θ fail@k(x)\nabla_\theta\,\mathrm{fail@k}(x)∇θ​fail@k(x) is the policy gradient of the joint failure event over kkk samples. But since the pass event is complementary, pass@k(x)=1−fail@k(x)\mathrm{pass@k}(x) = 1 - \mathrm{fail@k}(x)pass@k(x)=1−fail@k(x) and the gradient of the constant vanishes, we can flip the sign:
∇θJML(x)=∑k=1∞1k ∇θ pass@k(x).\boxed{\nabla_\theta J_{\mathrm{ML}}(x) = \sum_{k=1}^{\infty} \frac{1}{k}\,\nabla_\theta\,\mathrm{pass@k}(x)}.∇θ​JML​(x)=k=1∑∞​k1​∇θ​pass@k(x)​.
This boxed equation is the central revelation of the MaxRL framework. The maximum-likelihood gradient is an infinite harmonic mixture of the policy gradients for pass@k events. In other words, maximising log⁡p\log plogp automatically encourages not only that the model passes on its first try (pass@1) but also that it passes with high probability when given k=2,3,4,…k=2, 3, 4, \dotsk=2,3,4,… independent attempts, each weighted by a diminishing factor 1/k1/k1/k. It rewards a model that becomes robustly reliable under repeated sampling, not just a model that occasionally gets the right answer.
Why does this matter? Standard RL with a binary correctness reward produces a gradient proportional to ∇θ pass@1(x)\nabla_\theta\,\mathrm{pass@1}(x)∇θ​pass@1(x), capturing only the first term of this infinite series. It ignores all k≥2k \ge 2k≥2, discarding information about whether the model overcomes its failures when given multiple tries. The ML gradient, by contrast, explicitly accounts for the full spectrum of sample budgets, revealing that the true likelihood objective inherently relies on multiple samples. This is not an ad‑hoc trick; it is a direct consequence of the logarithmic transformation.
The visual below distils this entire derivation into a clean, colour‑coded equation chain. Starting from the known Maclaurin series for −log⁡(1−z)-\log(1-z)−log(1−z), it substitutes z=1−pz=1-pz=1−p, rewrites the powers in terms of fail@k(x)\mathrm{fail@k}(x)fail@k(x), differentiates, and then uses the complement relation to arrive at the harmonic sum over pass@k\mathrm{pass@k}pass@k gradients. The final identity is placed in a prominent box, with annotations that highlight the interpretation: the harmonic mixture of pass@k gradients. The use of blue for (1−p)(1-p)(1−p), red for fail@k terms, and green for pass@k terms makes the sign flip and the transformation visually immediate, reinforcing the conceptual shift from failure‑centered to success‑centered weighting.

5. MaxRL: A Compute-Indexed Family of Objectives

With the Maclaurin expansion of log⁡p\log plogp established—an exact, albeit infinite, series representation of the log‑pass probability—we can now build a bridge between standard correctness‑based RL and maximum likelihood estimation. Truncating that series at a finite order produces a family of objectives that are directly controllable by a single integer parameter: the truncation length TTT. This parameter becomes a compute index, dictating how close the objective moves toward the full log‑likelihood and, correspondingly, how many rollouts are required to obtain reliable gradient estimates.
Recall from the previous expansion that for a fixed input xxx and model parameters θ\thetaθ, letting p≡pθpass(x)p \equiv p_\theta^{\mathrm{pass}}(x)p≡pθpass​(x) be the probability of generating a correct answer in one attempt, we have
log⁡p=−∑k=1∞(1−p)kk=−∑k=1∞fail@k(x)k.\log p = -\sum_{k=1}^{\infty}\frac{(1-p)^k}{k}
        = -\sum_{k=1}^{\infty}\frac{\mathrm{fail@k}(x)}{k}.logp=−k=1∑∞​k(1−p)k​=−k=1∑∞​kfail@k(x)​.
The terms fail@k(x)=(1−p)k\mathrm{fail@k}(x) = (1-p)^kfail@k(x)=(1−p)k decay exponentially as kkk grows; the infinite sum is convergent, but in practice we cannot compute infinitely many terms. The truncation idea is simple yet powerful: keep only the first TTT terms of the series and discard the remainder. For any truncation level T∈NT \in \mathbb{N}T∈N, the MaxRL truncated objective is defined as
JMaxRL(T)(x)  :=  −∑k=1T(1−p)kk.J^{(T)}_{\mathrm{MaxRL}}(x) \;:=\; -\sum_{k=1}^{T}\frac{(1-p)^k}{k}.JMaxRL(T)​(x):=−k=1∑T​k(1−p)k​.
This is not a lower bound in the strict sense—the neglected tail ∑k=T+1∞(1−p)k/k\sum_{k=T+1}^{\infty}(1-p)^k/k∑k=T+1∞​(1−p)k/k is always positive, so J(T)J^{(T)}J(T) underestimates log⁡p\log plogp. However, the bias shrinks rapidly with TTT and, crucially, the gradient of J(T)J^{(T)}J(T) has a remarkably clean form.
Differentiating with respect to θ\thetaθ (and recalling that ppp depends on θ\thetaθ) yields
∇θJMaxRL(T)(x)=∑k=1T1k ∇θ pass@k(x),\nabla_\theta J^{(T)}_{\mathrm{MaxRL}}(x)
= \sum_{k=1}^{T}\frac{1}{k}\,\nabla_\theta\,\mathrm{pass@k}(x),∇θ​JMaxRL(T)​(x)=k=1∑T​k1​∇θ​pass@k(x),
because pass@k(x)=1−(1−p)k\mathrm{pass@k}(x) = 1 - (1-p)^kpass@k(x)=1−(1−p)k and ∇θpass@k(x)=k(1−p)k−1∇θp\nabla_\theta \mathrm{pass@k}(x) = k(1-p)^{k-1}\nabla_\theta p∇θ​pass@k(x)=k(1−p)k−1∇θ​p, so the factor 1/k1/k1/k leaves (1−p)k−1∇θp(1-p)^{k-1}\nabla_\theta p(1−p)k−1∇θ​p. Summing the geometric series 1+(1−p)+⋯+(1−p)T−11 + (1-p) + \dots + (1-p)^{T-1}1+(1−p)+⋯+(1−p)T−1 gives a compact gradient expression
∇θJMaxRL(T)(x)=(1−(1−p)T) ∇θpp=(1−(1−p)T) ∇θlog⁡p.\nabla_\theta J^{(T)}_{\mathrm{MaxRL}}(x)
= \Bigl(1 - (1-p)^T\Bigr)\,\frac{\nabla_\theta p}{p}
= \Bigl(1 - (1-p)^T\Bigr)\,\nabla_\theta \log p .∇θ​JMaxRL(T)​(x)=(1−(1−p)T)p∇θ​p​=(1−(1−p)T)∇θ​logp.
From this, the weight factor 1−(1−p)T1-(1-p)^T1−(1−p)T directly reveals the trade‑off: it is a multiplicative damping that approaches 111 from below as TTT increases. When T=1T=1T=1, 1−(1−p)=p1-(1-p)=p1−(1−p)=p, and the gradient reduces to ∇θp\nabla_\theta p∇θ​p, which is exactly ∇θ pass@1(x)\nabla_\theta\,\mathrm{pass@1}(x)∇θ​pass@1(x)—the standard RL objective that optimizes pass rate. As T→∞T\to\inftyT→∞, (1−p)T→0(1-p)^T \to 0(1−p)T→0 (for p>0p>0p>0), the damping factor tends to 111, and the gradient converges to ∇θlog⁡p\nabla_\theta\log p∇θ​logp, the full maximum‑likelihood gradient. Thus the MaxRL family interpolates smoothly between RL (T=1T=1T=1) and full ML (T→∞T\to\inftyT→∞), with every intermediate TTT defining a partially damped gradient that approximates the ultimate log‑likelihood target.
This view re‑frames the problem as a compute‑accuracy trade‑off. Estimating ∇θpass@k\nabla_\theta\mathrm{pass@k}∇θ​pass@k for k>1k>1k>1 requires at least kkk independent rollout samples to reliably assess whether any of the kkk attempts is correct. So larger TTT demands more samples, but it also supplies a gradient that is closer to the true ML direction. In this sense, TTT acts as a compute knob: for a fixed compute budget one can choose the largest affordable truncation level, thereby maximising the approximation quality while respecting practical constraints.
The visual below captures this idea at a glance. It opens with a faint reference to the Maclaurin series, reminding the reader of the expansion’s structure. At the centre, a prominently boxed definition displays the truncated objective JMaxRL(T)(x)J^{(T)}_{\mathrm{MaxRL}}(x)JMaxRL(T)​(x) and its gradient as a sum of weighted pass@k\mathrm{pass@k}pass@k gradients, with TTT highlighted as the controlling index. To the sides, two parallel mini‑boxes anchor the endpoints: T=1T=1T=1 (standard RL, gradient driven only by pass@1\mathrm{pass@1}pass@1) and T→∞T\to\inftyT→∞ (full ML, exact log‑likelihood gradient). A horizontal arrow connects these extremes, labeled with the progression “increasing TTT → more compute, better ML approximation”. Concise takeaways at the bottom reinforce the central trade‑off: larger TTT brings the objective closer to ML but demands proportionally more rollouts. The composition makes the family’s conceptual structure immediately legible—a graded spectrum from RL to ML, with compute as the currency that determines how far along that spectrum we can afford to go.

6. Theorem 1: Conditional Form of the ML Gradient

The family of objectives we just explored—MaxRL with different truncation orders NNN—provides a practical ladder from modest compute to ideal behaviour. But to understand precisely what those objectives are trying to approximate, and why a conditional expectation is the right building block, we need to examine the exact gradient of the maximum likelihood objective without any approximation. That gradient turns out to have a strikingly clean conditional form, something that is not immediately obvious from the definition LML(x)=log⁡pθpass(x)\mathcal{L}_{\text{ML}}(x)=\log p_\theta^{\text{pass}}(x)LML​(x)=logpθpass​(x).
Recall the starting point for any input xxx. The ML gradient is simply the derivative of the log pass probability,
∇θJML(x)=1pθpass(x) ∇θ pθpass(x).\nabla_\theta J_{\text{ML}}(x) = \frac{1}{p_\theta^{\text{pass}}(x)}\,\nabla_\theta\,p_\theta^{\text{pass}}(x).∇θ​JML​(x)=pθpass​(x)1​∇θ​pθpass​(x).
This expression already hints at the need to know the pass rate itself—a quantity we cannot cheaply evaluate—but we can rewrite it into a form that eliminates the explicit division and reveals a much more intuitive structure.
Theorem 1 (Conditional Gradient Identity). Assume pθpass(x)>0p_\theta^{\text{pass}}(x) > 0pθpass​(x)>0. Then
∇θJML(x)=Ez∼mθ(⋅∣x) ⁣[∇θlog⁡mθ(z∣x)  |  f(z)=y∗(x)].\nabla_\theta J_{\text{ML}}(x)
= \mathbb{E}_{z\sim m_\theta(\cdot|x)}\!\left[\nabla_\theta\log m_\theta(z|x) \;\middle|\; f(z) = y^*(x)\right].∇θ​JML​(x)=Ez∼mθ​(⋅∣x)​[∇θ​logmθ​(z∣x)∣f(z)=y∗(x)].
Equivalently,
∇θJML(x)=E ⁣[ S(x,z)  ∣  r(x,z)=1 ],\nabla_\theta J_{\text{ML}}(x)
= \mathbb{E}\!\bigl[\,S(x,z) \;|\; r(x,z)=1\,\bigr],∇θ​JML​(x)=E[S(x,z)∣r(x,z)=1],
where S(x,z)=∇θlog⁡mθ(z∣x)S(x,z)=\nabla_\theta\log m_\theta(z|x)S(x,z)=∇θ​logmθ​(z∣x) is the score function (the gradient of the log-likelihood of a single latent trajectory).
Why is this identity both surprising and useful? In standard score-function gradient estimators (REINFORCE), we would need to multiply the score by a total reward and then divide by the marginal probability of observing that reward—so the denominator still lurks in any Monte Carlo estimate. Here, however, the gradient of the log pass probability collapses to a simple conditional expectation: the average score computed only over those latent trajectories that actually produce the correct answer. The denominator pθpass(x)p_\theta^{\text{pass}}(x)pθpass​(x) disappears because the act of conditioning on success automatically re‑weights the distribution.
The derivation is short but instructive. Expand ∇θpθpass(x)\nabla_\theta p_\theta^{\text{pass}}(x)∇θ​pθpass​(x) inside the fraction:
∇θpθpass(x)=∇θ∫mθ(z∣x) I[f(z)=y∗(x)] dz=∫mθ(z∣x) ∇θlog⁡mθ(z∣x) I[f(z)=y∗(x)] dz.\nabla_\theta p_\theta^{\text{pass}}(x)
= \nabla_\theta \int m_\theta(z|x)\,\mathbb{I}[f(z)=y^*(x)]\,dz
= \int m_\theta(z|x)\,\nabla_\theta\log m_\theta(z|x)\,\mathbb{I}[f(z)=y^*(x)]\,dz.∇θ​pθpass​(x)=∇θ​∫mθ​(z∣x)I[f(z)=y∗(x)]dz=∫mθ​(z∣x)∇θ​logmθ​(z∣x)I[f(z)=y∗(x)]dz.
This is just the unnormalised expectation of the score function over successful trajectories. Now substitute back:
∇θJML(x)=1pθpass(x)∫mθ(z∣x) S(x,z) I[f(z)=y∗(x)] dz=Ez∼mθ ⁣[S(x,z)  |  f(z)=y∗(x)].\nabla_\theta J_{\text{ML}}(x)
= \frac{1}{p_\theta^{\text{pass}}(x)}
\int m_\theta(z|x)\,S(x,z)\,\mathbb{I}[f(z)=y^*(x)]\,dz
= \mathbb{E}_{z\sim m_\theta}\!\left[S(x,z) \;\middle|\; f(z)=y^*(x)\right].∇θ​JML​(x)=pθpass​(x)1​∫mθ​(z∣x)S(x,z)I[f(z)=y∗(x)]dz=Ez∼mθ​​[S(x,z)∣f(z)=y∗(x)].
The final equality uses the definition of conditional expectation: E[Y∣A]=E[Y IA]/P(A)\mathbb{E}[Y \mid A] = \mathbb{E}[Y\,\mathbb{I}_A] / P(A)E[Y∣A]=E[YIA​]/P(A). That tiny move is what evaporates the painfully inestimable pass rate.
The assumption pθpass(x)>0p_\theta^{\text{pass}}(x) > 0pθpass​(x)>0 is not a mere technicality. It ensures the condition makes sense—there must be at least some chance of generating a correct output under the current model. In practice, for tasks with massive latent spaces, this can be satisfied early enough if we use a reasonable pre‑trained model, or we can artificially maintain a small probability mass on correctness via entropy regularisation.
Now we have a gradient that can be described purely in terms of what happens on successful rollouts. This perspective immediately illuminates a failure mode of naive RL on correctness tasks. Standard policy gradient methods (and even GRPO) optimise a surrogate that may use all sampled trajectories, perhaps up‑weighting correct ones and down‑weighting incorrect ones, but they do not exactly mimic this conditional average unless the weighting scheme matches the model’s own normalised distribution over successes. The ML gradient, in contrast, is surgically precise: take the model, sample latent trajectories, and keep only those that solve the problem. Then compute the average of their score functions. No reward‑shaping, no value baseline, no denominator—just a conditional mean.
This interpretation also explains why maximum likelihood objectives often yield models that are far more robust on reasoning benchmarks than pure RL‑tuned models. By forcing the entire gradient signal to come from correct-alone trajectories, the model learns to increase the likelihood of paths that demonstrably lead to the right answer, without being distracted by any signal from incorrect attempts. It’s a cleaner, sharper optimisation signal.
The two equivalent forms in the theorem—conditioning on f(z)=y∗(x)f(z)=y^*(x)f(z)=y∗(x) and conditioning on r(x,z)=1r(x,z)=1r(x,z)=1—are simply two notations for the same event. The second form, using the binary reward r(x,z)r(x,z)r(x,z), will be especially convenient when we later build finite‑sample estimators and connect to the MaxRL truncation family.
The visual below captures the essence of Theorem 1 in a compact, lecture‑ready diagram. The slide first reminds us of the starting point: the gradient of the log pass probability as a ratio. The central framed box then states the conditional identity itself, with the two equivalent expectation forms placed one above the other for immediate comparison. The use of the score notation S(x,z)S(x,z)S(x,z) is made explicit, and a brief interpretation line—the ML gradient is the average score over correct outputs only—sits beneath the theorem box, reinforcing the main takeaway. The hand‑drawn, academic aesthetic keeps the focus on the algebraic insight, while the clean separation of equation blocks highlights the two parallel ways of writing the same gradient. It serves as the perfect summary before we delve into the proof that formalises the connection between the conditional form and the MaxRL estimators.

7. Proof of Theorem 1

Building a maximum‑likelihood objective on top of correctness feedback forces us to graduate from a naive REINFORCE gradient of pass rate to the gradient of the log‑pass probability. In the previous section we saw the population‑level identity for the pass rate:
∇θ p=Ez∼mθ(⋅∣x)[r(x,z) S(z)],S(z)=∇θlog⁡mθ(z∣x),\nabla_\theta\,p = \mathbb{E}_{z \sim m_\theta(\cdot|x)}\big[ r(x,z)\,S(z) \big], \qquad S(z)=\nabla_\theta\log m_\theta(z|x),∇θ​p=Ez∼mθ​(⋅∣x)​[r(x,z)S(z)],S(z)=∇θ​logmθ​(z∣x),
which is a standard REINFORCE expression that weights the score by the binary correctness reward. This quantity alone tells us how to increase the probability of passing, but not how to maximise the log‑likelihood of the correct answer under the model. The jump from one to the other is the content of Theorem 1, and its proof is remarkably compact once the right algebraic relations are written down.
Our target is JML(x)=log⁡pJ_{\text{ML}}(x) = \log pJML​(x)=logp, where p=Ez[r(x,z)]p = \mathbb{E}_z[r(x,z)]p=Ez​[r(x,z)] is the marginal pass probability. Differentiating directly gives
∇θJML(x)=∇θ pp.\nabla_\theta J_{\text{ML}}(x) = \frac{\nabla_\theta\,p}{p}.∇θ​JML​(x)=p∇θ​p​.
Substituting the REINFORCE form of ∇θ p\nabla_\theta\,p∇θ​p yields
∇θJML(x)=Ez[r(x,z) S(z)]p.\nabla_\theta J_{\text{ML}}(x) = \frac{\mathbb{E}_z[ r(x,z)\,S(z) ]}{p}.∇θ​JML​(x)=pEz​[r(x,z)S(z)]​.
At this point everything looks like a simple ratio of expectations. The denominator p=Ez[r(x,z)]p = \mathbb{E}_z[ r(x,z) ]p=Ez​[r(x,z)] is just the mean reward. So the gradient equals E[r S]/E[r]\mathbb{E}[r\,S] / \mathbb{E}[r]E[rS]/E[r].
Now recall a basic probabilistic identity: for any random vector XXX and a binary event r∈{0,1}r \in \{0,1\}r∈{0,1} with P(r=1)>0\mathbb{P}(r=1) > 0P(r=1)>0, the conditional expectation of XXX given the event is
E[X∣r=1]=E[r X]E[r].\mathbb{E}[X \mid r=1] = \frac{\mathbb{E}[ r\,X ]}{\mathbb{E}[r]}.E[X∣r=1]=E[r]E[rX]​.
This identity is nothing more than the definition of conditional expectation restricted to an event, written in terms of unconditional expectations. Its value here is that it translates a ratio of expectations back into an expected value under the success‑conditioned distribution—the distribution of model outputs that happen to solve the task.
Applying this identity with X=S(z)X = S(z)X=S(z) and r=r(x,z)r = r(x,z)r=r(x,z) gives
Ez[r(x,z) S(z)]p=Ez∼mθ(⋅∣x)[S(z)∣r(x,z)=1].\frac{\mathbb{E}_z[ r(x,z)\,S(z) ]}{p}
= \mathbb{E}_{z \sim m_\theta(\cdot|x)}[ S(z) \mid r(x,z)=1 ].pEz​[r(x,z)S(z)]​=Ez∼mθ​(⋅∣x)​[S(z)∣r(x,z)=1].
That is the entire proof: a pure algebraic rearrangement that replaces the ratio of two population averages with a single conditional expectation. The right‑hand side is precisely the expected score, but only over those rollouts that actually produce the ground‑truth answer. In symbols,
  ∇θJML(x)=E[∇θlog⁡mθ(z∣x)  ∣  f(z)=y∗(x)]  .\boxed{\;\nabla_\theta J_{\text{ML}}(x) = \mathbb{E}\big[\nabla_\theta\log m_\theta(z|x) \;\big|\; f(z) = y^*(x)\big]\;}.∇θ​JML​(x)=E[∇θ​logmθ​(z∣x)​f(z)=y∗(x)]​.
Why does this matter? The gradient that maximises log‑likelihood turns out to be exactly what you would compute if you could sample only from the model’s success distribution. In other words, the ML gradient is the average score of the model on its correct answers. This immediately suggests a practical finite‑sample estimator: draw NNN unconditional rollouts, keep the ones that pass, and average their scores. That estimator is unbiased for the gradient of a closely related objective, as we will formalise in the next section. For now, the crucial point is that the ML gradient can be expressed without ever inverting the conditional distribution or needing a separate “teacher”—the structure is fully contained inside the model’s own pass/fail behaviour.
The visual that supports this proof serves as a minimalistic derivation map. It presents the REINFORCE expression for ∇θp\nabla_\theta p∇θ​p, the division by ppp, the rearrangement into the ratio E[r S]/E[r]\mathbb{E}[r\,S]/\mathbb{E}[r]E[rS]/E[r], and the application of the conditional expectation identity—each step appearing as a single aligned equation. The final line is the boxed theorem statement, set apart with a subtle coloured background. This depiction is not meant to replace the logic, but to consolidate it into a glance‑able chain of reasoning, making the essential move—from a ratio of expectations to a conditional mean—immediately visible. The viewer can see at a glance why “discard failures and average the scores” is the population‑level truth behind maximum‑likelihood training with binary rewards.

8. Empirical Gradient Estimator \widehat g N

Having established that the gradient of the log pass probability can be expressed as a conditional expectation over successful trajectories, we now turn to its finite-sample approximation. The population-level form from Theorem 1,
∇θlog⁡pθ(success∣x)  =  Ez1:N ⁣[ 1K∑i=1Nri ∇θlog⁡mθ(zi∣x)  ∣  K≥1],\nabla_\theta \log p_{\theta}(\text{success}\mid x)
\;=\;
\mathbb{E}_{z_{1:N}}\!\left[\,\frac{1}{K}\sum_{i=1}^N r_i\, \nabla_\theta \log m_\theta(z_i\mid x) \;\Big|\; K\ge 1\right],∇θ​logpθ​(success∣x)=Ez1:N​​[K1​i=1∑N​ri​∇θ​logmθ​(zi​∣x)​K≥1],
naturally suggests an intuitive Monte Carlo estimator: draw NNN latent trajectories z1,…,zN∼mθ(⋅∣x)z_1,\dots,z_N \sim m_\theta(\cdot\mid x)z1​,…,zN​∼mθ​(⋅∣x), compute the binary rewards ri=I{f(zi)=y∗(x)}r_i = \mathbb{I}\{f(z_i)=y^*(x)\}ri​=I{f(zi​)=y∗(x)}, count the successes K=∑i=1NriK = \sum_{i=1}^N r_iK=∑i=1N​ri​, and then average the score gradients of the successful trajectories—each weighted equally. However, when no successes occur (K=0K=0K=0), the conditional expectation is undefined and no learning signal can be extracted from a sample that contains only failures; in that case the estimator is simply set to zero.
This yields the MaxRL gradient estimator:
g^N(x)  =  {1K∑i=1Nri ∇θlog⁡mθ(zi∣x),K≥10,K=0.\widehat{g}_N(x) \;=\;
\begin{cases}
\displaystyle\frac{1}{K}\sum_{i=1}^N r_i\, \nabla_\theta \log m_\theta(z_i\mid x), & K \ge 1 \\[1.2em]
0, & K = 0.
\end{cases}g​N​(x)=⎩⎨⎧​K1​i=1∑N​ri​∇θ​logmθ​(zi​∣x),0,​K≥1K=0.​
The piecewise definition echoes the truncation-based logic that underpins the whole MaxRL framework: when the sample reveals no successes, we cannot improve the log pass probability from that mini-batch, so the update is null. When at least one success is present, the estimator uses only the successful draws, normalising by the observed number of successes KKK rather than by the total sample size NNN.
It is instructive to contrast this with the classic REINFORCE estimator commonly used in reinforcement learning for correctness tasks:
g^REINFORCE(x)  =  1N∑i=1N(ri−b) ∇θlog⁡mθ(zi∣x),\widehat{g}_{\mathrm{REINFORCE}}(x) \;=\;
\frac{1}{N}\sum_{i=1}^N (r_i - b)\,\nabla_\theta \log m_\theta(z_i\mid x),g​REINFORCE​(x)=N1​i=1∑N​(ri​−b)∇θ​logmθ​(zi​∣x),
where bbb is a baseline (often a learned value function or a moving average of rewards). REINFORCE estimates the gradient of the expected reward—in this binary case, the pass probability pθ(success∣x)p_{\theta}(\text{success}\mid x)pθ​(success∣x). Every trajectory, successful or not, contributes to the estimate. The baseline bbb reduces variance but does not change the objective being optimized: REINFORCE pushes the policy toward producing any success, without distinguishing between a policy that succeeds rarely and one that succeeds reliably once we condition on some success. In other words, it optimises the pass rate (pass@1) but does not recover the maximum-likelihood principle that underpins standard supervised fine-tuning.
The MaxRL estimator replaces the uniform weighting of all NNN trajectories with a conditional average over the KKK successful ones. This seemingly minor change has profound consequences:
Normalisation by KKK, not NNN – The gradient magnitude does not shrink when the success rate is low. If only a single trajectory succeeds in a large batch, that trajectory’s score gradient receives full weight, while REINFORCE would dilute it by 1/N1/N1/N, possibly requiring a carefully tuned learning rate schedule to make any progress.
No need for a baseline – Because the estimator conditions on K≥1K\ge 1K≥1 and then averages the score gradients of the successes, the failure trajectories are simply ignored. There is no constant term to subtract; the unbiasedness of the conditional average follows directly from Theorem 1.
Direct connection to the log pass probability – By construction, g^N\widehat{g}_Ng​N​ is the finite-sample counterpart of the conditional expectation in Theorem 1, and as we will prove in the next section (Theorem 2), using this estimator in a stochastic gradient ascent procedure is exactly equivalent to optimising a truncated version of log⁡pθ(success∣x)\log p_{\theta}(\text{success}\mid x)logpθ​(success∣x) where the truncation order equals the sample size NNN.
A critical subtlety is the treatment of the K=0K=0K=0 case. Setting the gradient to zero when no successes are found may seem wasteful, but it is perfectly aligned with the truncation viewpoint: a sample of size NNN that contains zero successes provides no information about the higher-order terms in the Maclaurin expansion of the log pass probability. The gradient is zero because the conditional expectation is undefined; any ad‑hoc non‑zero value would bias the objective away from log⁡pθ\log p_{\theta}logpθ​. This design choice is not a limitation but a deliberate bridge between sample size and truncation order that will be formalised in Theorem 2.
The visual below makes this contrast immediate. At the top, the main MaxRL estimator is displayed prominently, with its piecewise definition centred in a clean box. The handling of the two cases—K≥1K\ge 1K≥1 and K=0K=0K=0—is spelled out, emphasising the normalisation by KKK. Immediately underneath, in a slightly smaller, shaded box, the REINFORCE estimator is shown for side‑by‑side comparison. An arrow annotation calls out the denominator: KKK in MaxRL versus NNN in REINFORCE, highlighting the fundamental difference in weighting. Another arrow points to a small tag referencing Theorem 1 from the preceding slide, reminding the viewer that this estimator is not an arbitrary choice but the natural sample analogue of the conditional expectation that lies at the heart of the MaxRL derivation. The muted shading of the REINFORCE box further reinforces that MaxRL is the estimator of primary interest, with REINFORCE serving as a familiar foil.

9. Theorem 2: Estimator–Objective Equivalence

The definition of g^N(x)\widehat{g}_N(x)g​N​(x) in the previous section may look like a pragmatic hack: average the score vectors of correct rollouts, ignore all failures, and return zero when nothing succeeds. Yet there is a strikingly clean reason that this particular ad‑hoc construction works. Theorem 2 uncovers the exact objective whose gradient is estimated without bias by g^N\widehat{g}_Ng​N​. It reveals that the estimator is not just a clever trick but a direct conduit to the truncated Maclaurin series of the log pass probability – the very series we derived earlier when expanding log⁡pθpass(x)\log p_\theta^{\mathrm{pass}}(x)logpθpass​(x).
Theorem 2 (Estimator–Objective Equivalence).
For any integer N≥1N \ge 1N≥1 and any prompt xxx, draw NNN i.i.d. completions z1,…,zN∼mθ(⋅∣x)z_1,\dots,z_N \sim m_\theta(\cdot\mid x)z1​,…,zN​∼mθ​(⋅∣x). Let ri=1{f(zi)=y∗(x)}r_i = \mathbf{1}\{f(z_i)=y^*(x)\}ri​=1{f(zi​)=y∗(x)} indicate correctness, Si=∇θlog⁡mθ(zi∣x)S_i = \nabla_\theta \log m_\theta(z_i\mid x)Si​=∇θ​logmθ​(zi​∣x) be the score, and K=∑i=1NriK = \sum_{i=1}^N r_iK=∑i=1N​ri​ count successes. With
g^N(x)={1K∑i=1NriSi,K≥1,0,K=0,\widehat g_N(x) = \begin{cases}
\frac{1}{K}\sum_{i=1}^N r_i S_i, & K \ge 1,\\[4pt]
0, & K=0,
\end{cases}g​N​(x)={K1​∑i=1N​ri​Si​,0,​K≥1,K=0,​
and the MaxRL objective of order TTT
JMaxRL(T)(x)=−∑k=1T(1−p)kk,p=pθpass(x),J^{(T)}_{\mathrm{MaxRL}}(x) = -\sum_{k=1}^T \frac{(1-p)^k}{k}, \qquad p = p_\theta^{\mathrm{pass}}(x),JMaxRL(T)​(x)=−k=1∑T​k(1−p)k​,p=pθpass​(x),
the expectation over the NNN rollouts satisfies
E[g^N(x)]=∇θ JMaxRL(N)(x).\mathbb{E}\bigl[\widehat g_N(x)\bigr] = \nabla_\theta\, J^{(N)}_{\mathrm{MaxRL}}(x).E[g​N​(x)]=∇θ​JMaxRL(N)​(x).
To appreciate the claim, recall that the Maclaurin expansion of log⁡p\log plogp in powers of (1−p)(1-p)(1−p) is
log⁡p=−∑k=1∞(1−p)kk.\log p = -\sum_{k=1}^{\infty} \frac{(1-p)^k}{k}.logp=−k=1∑∞​k(1−p)k​.
Thus JMaxRL(T)(x)J^{(T)}_{\mathrm{MaxRL}}(x)JMaxRL(T)​(x) is exactly the TTT-th order truncation of the true log-likelihood of correctness. When we later aggregate over a training set of prompts, maximising the full log-likelihood ∑xlog⁡p(x)\sum_x \log p(x)∑x​logp(x) is the standard MLE target. The theorem therefore asserts that each invocation of g^N\widehat{g}_Ng​N​ is an unbiased Monte Carlo estimate of the gradient of a truncated version of that MLE objective – and the truncation order automatically equals the number of rollouts we have just drawn.
Why should this hold? The proof (explored in the next section) hinges on a delicate cancellation. The naive approach would be to compute ∇θ(−∑k=1N(1−p)k/k)\nabla_\theta\bigl(-\sum_{k=1}^N (1-p)^k/k\bigr)∇θ​(−∑k=1N​(1−p)k/k) by differentiating term by term, which produces a sum involving powers of (1−p)(1-p)(1−p) and the gradient of ppp. Instead, the estimator g^N\widehat{g}_Ng​N​ averages scores only over successes and divides by the empirical success count KKK. When you condition on the number of successes KKK, the conditional expectation of 1K∑i=1KS(i)\frac{1}{K}\sum_{i=1}^K S_{(i)}K1​∑i=1K​S(i)​ over the successful trajectories, together with the binomial probabilities for KKK, miraculously collapses into exactly the derivative of the finite sum. The K=0K=0K=0 case contributes zero, which aligns with the fact that the derivative of the series vanishes when p=0p=0p=0 (the constant term is −∑k=1N1/k-\sum_{k=1}^N 1/k−∑k=1N​1/k, a constant independent of θ\thetaθ).
The practical consequence is profound. We never need to manually decide what truncation order to use or to store explicit representations of the series. Simply drawing NNN rollouts and calculating the empirical correct‑average score produces a gradient that, in expectation, corresponds to a specific point on the ladder of Maclaurin approximations. For N=1N=1N=1, E[g^1]=∇p=∇(−(1−p))\mathbb{E}[\widehat{g}_1] = \nabla p = \nabla (-(1-p))E[g​1​]=∇p=∇(−(1−p)), which is exactly the standard REINFORCE gradient for the pass‑rate objective (RL). As NNN grows, the expected gradient moves continuously toward ∇log⁡p\nabla \log p∇logp, the gradient of the full maximum‑likelihood objective. The estimator therefore builds a compute‑indexed bridge: the more rollouts you can afford, the closer you drive the model to maximum likelihood on the correctness task – without any change in the algorithmic mechanism.
In teaching, this equivalence is often presented as a compact, boxed statement that isolates the theorem from the surrounding algebra. The visual below captures that style: a centered theorem box containing the crucial equation E[g^N(x)]=∇θJMaxRL(N)(x)\mathbb{E}[\widehat g_N(x)] = \nabla_\theta J^{(N)}_{\mathrm{MaxRL}}(x)E[g​N​(x)]=∇θ​JMaxRL(N)​(x), together with a minimal italic remark beneath – “Increasing NNN climbs the MaxRL ladder.” This one‑liner distills the core insight that more compute automatically lifts the objective towards ML, making the ladder metaphor a handy mnemonic for the entire MaxRL framework.

11. Variance Reduction via a Control Variate

The previous section proved that the population gradient for MaxRL is exactly the expectation of a truncated score-weighted sum, which yields a natural finite‑sample estimator g^N(x)\widehat{g}_N(x)g​N​(x). The proof established that the estimator is unbiased, but it did not address its variance—which turns out to be the central practical obstacle when we actually draw NNN on‑policy trajectories. We must now face the fact that for challenging correctness tasks, KKK, the number of successful completions in a batch, can be very small. The raw estimator
g^N(x)=1K∑i=1NriSi,K≥1,\widehat{g}_N(x) = \frac{1}{K}\sum_{i=1}^{N} r_i S_i, \qquad K \ge 1,g​N​(x)=K1​i=1∑N​ri​Si​,K≥1,
and zero otherwise, inherits an acute instability: the division by the random count KKK amplifies fluctuations, especially when KKK takes values like 1,2,3. In those regimes, a single extra success or failure drastically changes the weight 1/K1/K1/K, causing large jumps in the gradient estimate. Variance reduction therefore becomes essential for any on‑policy implementation that hopes to converge reliably.
A natural first thought is to borrow the classic REINFORCE baseline trick. In standard policy gradients, we can subtract a state‑dependent baseline b(x)b(x)b(x) from the return without biasing the gradient, because the score has zero expectation: Ez∼mθ[b(x)∇θlog⁡mθ(z∣x)]=0\mathbb{E}_{z\sim m_\theta}[b(x) \nabla_\theta \log m_\theta(z|x)] = 0Ez∼mθ​​[b(x)∇θ​logmθ​(z∣x)]=0. Here, however, the normalisation by KKK breaks that property. The weight ri/Kr_i/Kri​/K is a random variable that depends on all NNN drawn solutions, making it correlated with the scores in a nontrivial way; a simple baseline no longer yields a zero‑mean correction. We need a control variate that remains incontrovertibly zero‑mean regardless of the mixture of successes and failures.
The solution is elegant and easy to compute: use the unconditional average score over all NNN samples, with no reference to correctness. Define
VN=1N∑i=1NSi,Si=∇θlog⁡mθ(zi∣x).V_N = \frac{1}{N}\sum_{i=1}^{N} S_i, \qquad S_i = \nabla_\theta \log m_\theta(z_i | x).VN​=N1​i=1∑N​Si​,Si​=∇θ​logmθ​(zi​∣x).
Because the score function always has zero expectation under the sampling distribution, Ez∼mθ[Si]=0\mathbb{E}_{z\sim m_\theta}[S_i] = \mathbf{0}Ez∼mθ​​[Si​]=0, we immediately obtain E[VN]=0\mathbb{E}[V_N] = \mathbf{0}E[VN​]=0, for any NNN and any policy. This zero‑mean property holds exactly, no matter the batch composition; it does not rely on independence from the rewards. Subtracting VNV_NVN​ from the raw estimator therefore cannot introduce bias:
E[g^N−VN]=E[g^N]−E[VN]=E[g^N].\mathbb{E}\bigl[\widehat{g}_N - V_N\bigr] = \mathbb{E}[\widehat{g}_N] - \mathbb{E}[V_N] = \mathbb{E}[\widehat{g}_N].E[g​N​−VN​]=E[g​N​]−E[VN​]=E[g​N​].
Thus, g~N(x)=g^N(x)−VN\widetilde{g}_N(x) = \widehat{g}_N(x) - V_Ng​N​(x)=g​N​(x)−VN​ remains an unbiased estimator of the gradient of log⁡P(correct)\log \mathbb{P}(\text{correct})logP(correct).
Now we can write the combined estimator in a revealing form. When K≥1K \ge 1K≥1,
g~N(x)=1K∑i=1NriSi  −  1N∑i=1NSi=∑i=1N(riK−1N)Si,\widetilde{g}_N(x) = \frac{1}{K}\sum_{i=1}^{N} r_i S_i \;-\; \frac{1}{N}\sum_{i=1}^{N} S_i = \sum_{i=1}^{N} \left(\frac{r_i}{K} - \frac{1}{N}\right) S_i,g​N​(x)=K1​i=1∑N​ri​Si​−N1​i=1∑N​Si​=i=1∑N​(Kri​​−N1​)Si​,
and g~N(x)=0\widetilde{g}_N(x) = 0g​N​(x)=0 when K=0K = 0K=0. The per‑sample weight shifts from ri/Kr_i/Kri​/K to (ri/K−1/N)(r_i/K - 1/N)(ri​/K−1/N). For a correct sample (ri=1r_i = 1ri​=1) this weight becomes 1/K−1/N1/K - 1/N1/K−1/N; for an incorrect sample (ri=0r_i = 0ri​=0) it becomes −1/N-1/N−1/N. The weights now sum to zero across all NNN draws, because ∑iri=K\sum_i r_i = K∑i​ri​=K. This zero‑sum property effectively removes the baseline drift that plagues the raw estimator. Intuitively, the control variate VNV_NVN​ captures the aggregate random fluctuation of the score vectors, and because the same vectors appear in g^N\widehat{g}_Ng​N​, subtracting VNV_NVN​ cancels a large portion of the stochastic noise—especially when KKK is small and (1K−1N)(\frac{1}{K} - \frac{1}{N})(K1​−N1​) can be large in magnitude.
The diagram that accompanies this section distills the construction into a compact visual. The left column presents the raw estimator g^N\widehat{g}_Ng​N​ in red, marked with its high‑variance affliction for small KKK. The right column displays the control variate VNV_NVN​ in blue, with an arrow pointing to the key property E[VN]=0\mathbb{E}[V_N] = \mathbf{0}E[VN​]=0. A horizontal dashed line separates these building blocks from the final result below: a double‑bordered box in dark green containing the variance‑reduced estimator g~N(x)=∑i=1N(riK−1N)Si\widetilde{g}_N(x) = \sum_{i=1}^N (\frac{r_i}{K} - \frac{1}{N})S_ig​N​(x)=∑i=1N​(Kri​​−N1​)Si​. A small italic note reminds us that the estimator gracefully falls back to zero when K=0K = 0K=0. The colour coding and spatial layout reinforce the logical flow: we start from a noisy but unbiased estimate, subtract a zero‑mean companion that shares its stochastic source, and obtain a stabilised estimator that remains unbiased and is far more practical for iterative on‑policy training. This control variate trick turns Theorem 2’s unbiased estimator into a tool we can actually deploy, setting the stage for the complete on‑policy algorithm that follows.

12. Algorithm 1: On-Policy MaxRL Implementation

With the control‑variate estimator g~N(x)\widetilde{g}_N(x)g​N​(x) fully derived, the next step is to embed it inside a practical training loop. The result is Algorithm 1, an on‑policy update that replaces the usual REINFORCE gradient with a variance‑reduced contribution for each input xxx. The algorithm is simple to implement, yet its structure directly realises the truncated log‑likelihood objective we recovered from the Maclaurin expansion.
For a batch of inputs BBB, the algorithm samples NNN responses from the current policy mθ(⋅∣x)m_\theta(\cdot | x)mθ​(⋅∣x) for each xxx, evaluates the binary correctness reward rj=1{f(zj)=y∗(x)}r_j = \mathbf{1}\{ f(z_j) = y^*(x) \}rj​=1{f(zj​)=y∗(x)}, and records the score vectors Sj=∇θlog⁡mθ(zj∣x)S_j = \nabla_\theta \log m_\theta(z_j | x)Sj​=∇θ​logmθ​(zj​∣x). The crucial step is forming the empirical success rate r^(x)=1N∑j=1Nrj\hat{r}(x) = \frac{1}{N} \sum_{j=1}^N r_jr^(x)=N1​∑j=1N​rj​ and then computing the gradient contribution
g^(x)  =  1N r^(x)∑j=1N(rj−r^(x)) Sj,\hat{g}(x) \;=\; \frac{1}{N\,\hat{r}(x)} \sum_{j=1}^{N} \bigl( r_j - \hat{r}(x) \bigr) \, S_j,g^​(x)=Nr^(x)1​j=1∑N​(rj​−r^(x))Sj​,
whenever r^(x)>0\hat{r}(x) > 0r^(x)>0; otherwise g^(x)=0\hat{g}(x) = 0g^​(x)=0. Comparing this with the standard REINFORCE estimator 1N∑j(rj−b)Sj\frac{1}{N}\sum_j (r_j - b) S_jN1​∑j​(rj​−b)Sj​ reveals two fundamental differences: the denominator is the total number of successes Nr^(x)N \hat{r}(x)Nr^(x) rather than NNN, and the baseline is the per‑input empirical mean r^(x)\hat{r}(x)r^(x) instead of an exogenous baseline. Both modifications arise directly from the derivation; they are not heuristic tweaks.
The normalisation by Nr^(x)N \hat{r}(x)Nr^(x) is what ties the algorithm to maximum likelihood. Recall from the expansion of log⁡P[correct∣x]\log \mathbb{P}[\text{correct}|x]logP[correct∣x] that the kkk-th term involves the conditional expectation over the successful trajectories only. In the estimator g~N(x)\widetilde{g}_N(x)g​N​(x), the sum ∑j(rj−r^(x))Sj\sum_j (r_j - \hat{r}(x)) S_j∑j​(rj​−r^(x))Sj​ is equivalent to ∑j:rj=1Sj−r^(x)∑jSj\sum_{j: r_j=1} S_j - \hat{r}(x) \sum_j S_j∑j:rj​=1​Sj​−r^(x)∑j​Sj​. When we divide by Nr^(x)N \hat{r}(x)Nr^(x), we are effectively forming the estimator
1Nr^(x)∑j:rj=1Sj  −  1N∑jSj,\frac{1}{N \hat{r}(x)} \sum_{j: r_j=1} S_j \;-\; \frac{1}{N} \sum_{j} S_j,Nr^(x)1​j:rj​=1∑​Sj​−N1​j∑​Sj​,
which is a consistent sample approximation of Ez∼mθ(⋅∣x)[S∣r=1]−Ez∼mθ(⋅∣x)[S]\mathbb{E}_{z\sim m_\theta(\cdot|x)}[S \mid r=1] - \mathbb{E}_{z\sim m_\theta(\cdot|x)}[S]Ez∼mθ​(⋅∣x)​[S∣r=1]−Ez∼mθ​(⋅∣x)​[S]. The first term is the gradient of the log‑likelihood restricted to correct completions, while the second term (the average over all samples) acts as a control variate whose expected value is zero under the policy. Thus the estimator is targeting the truncated log‑likelihood objective—it pushes the policy toward distributing mass only among correct responses, exactly as demanded by the full MLE solution when all responses are correct.
The condition r^(x)>0\hat{r}(x) > 0r^(x)>0 is a practical safeguard. If the policy never produces a correct answer for a given input within the NNN samples, then there is no information about which directions would improve correctness, and the gradient contribution is set to zero. This prevents the update from being corrupted by division by zero and, more importantly, avoids misleading the policy when it is completely unsuccessful. It also reflects the truncation order: if no successes are observed, the corresponding term in the Maclaurin expansion would vanish, so the estimator stays coherent.
After accumulating g^(x)\hat{g}(x)g^​(x) across the batch, the final update direction is simply 1∣B∣∑x∈Bg^(x)\frac{1}{|B|} \sum_{x\in B} \hat{g}(x)∣B∣1​∑x∈B​g^​(x). This is a standard stochastic gradient step that averages gradient contributions over the batch. Note that the sampling of NNN completions per input is done on‑policy, so the rollouts must be re‑drawn after each parameter update to maintain consistency with the current policy—exactly as in any REINFORCE‑style algorithm.
The algorithm’s design highlights the role of NNN as a compute‑indexed knob. When N=1N=1N=1, r^(x)\hat{r}(x)r^(x) can only be 0 or 1, and the estimator collapses to (1−r1)S1(1 - r_1) S_1(1−r1​)S1​ when r1=0r_1=0r1​=0 (zero otherwise). As NNN grows, the estimator concentrates around the true truncated gradient, reducing variance and enabling finer‑grained updates that approach MLE behaviour. Later we will see that the sample count NNN directly corresponds to the truncation order in the objective, making the compute budget an explicit parameter that interpolates between RL and pure log‑likelihood training.
It is also instructive to contrast MaxRL with GRPO, which normalises the reward by the standard deviation of rewards within a group of rollouts. GRPO subtracts the group mean and divides by the group standard deviation to measure advantage, an approach that works well for scalar reward shaping but does not target a log‑likelihood objective. MaxRL, by using the per‑input success fraction r^(x)\hat{r}(x)r^(x) as both baseline and normaliser, recovers the exact gradient of a truncated log‑likelihood when rewards are binary and the success probability is non‑negligible.
The accompanying slide provides a compact, at‑a‑glance summary of these ideas. A central pseudocode block faithfully renders Algorithm 1, with the line computing g^(x)\hat{g}(x)g^​(x) highlighted—this is the key line that embodies the variance‑reduced estimator we have just deconstructed. Beneath the block, two concise bullet points contrast MaxRL with REINFORCE and GRPO: one notes that REINFORCE normalises by NNN and subtracts a baseline, while MaxRL normalises by success count Nr^N\hat{r}Nr^ and subtracts the mean reward r^\hat{r}r^; the other points out that GRPO divides by standard deviation, whereas MaxRL’s normaliser is tied directly to the sample success rate. These contrasts crystallise the implementation differences that make MaxRL distinct, turning the slide into a quick reference for anyone implementing the method.

13. Unifying Weight-Function View

Having implemented MaxRL as an on‑policy procedure in Algorithm 1, we now have a concrete algorithm that uses multiple independent attempts to construct an unbiased gradient. But stepping back, it becomes clear that MaxRL is not an isolated trick—it sits inside a broader family of methods that share a common mathematical skeleton. All of them can be understood as different ways of choosing how much to amplify the gradient signal from a prompt xxx based on the model’s current pass probability p=pθpass(x)p = p_\theta^{\text{pass}}(x)p=pθpass​(x). This suggests a simple, unifying language: a weight function w(p)w(p)w(p) that scales the per‑example gradient ∇θp\nabla_\theta p∇θ​p. Once we adopt this view, the design space of policy‑gradient algorithms for correctness tasks collapses to selecting the weight function w(p)w(p)w(p), and we can compare methods side‑by‑side by looking at their weight curves.
The shared template is deceptively compact. Let ρ\rhoρ denote the distribution over prompts (or more generally the state visitation distribution under the current policy). Then the gradient of any objective that factors through the pass probability can be written as
∇θJ=Ex∼ρ ⁣[w(p)  ∇θp],p=pθpass(x).\nabla_{\theta} J = \mathbb{E}_{x\sim\rho}\!\Big[ w(p) \; \nabla_{\theta} p \Big],
\qquad p = p_\theta^{\text{pass}}(x).∇θ​J=Ex∼ρ​[w(p)∇θ​p],p=pθpass​(x).
The weight function w(p)w(p)w(p) encodes the objective’s sensitivity to examples of different difficulty. An example where the model nearly always succeeds (p≈1p \approx 1p≈1) and an example where it nearly always fails (p≈0p \approx 0p≈0) can receive drastically different weights depending on the method. The template itself is a direct consequence of the policy gradient theorem when the per‑step reward is replaced by a binary correctness signal, but the real value is that it decouples the scale of the update from the raw probability, allowing us to design algorithms by reasoning about w(p)w(p)w(p) in isolation.
Standard reinforcement learning—the most common baseline—treats every completed trajectory equally: a correct answer yields a reward of 1, an incorrect one 0. Under this reward scheme the expected return is the pass probability, and its gradient reduces to E[∇θp]\mathbb{E}[\nabla_\theta p]E[∇θ​p], i.e., wRL(p)=1w_{\text{RL}}(p) = 1wRL​(p)=1. This constant weight ignores how certain the model already is about a prompt. An easy prompt contributes exactly as much gradient as a borderline one, which is wasteful when many gradient samples are dominated by high‑probability noise. GRPO, the method used in DeepSeek‑R1, attempts to remedy this by normalising rewards within a group of rollouts. Its effective weight function becomes the reciprocal of the standard deviation of the binary outcome: wGRPO(p)=1/p(1−p)w_{\text{GRPO}}(p) = 1/\sqrt{p(1-p)}wGRPO​(p)=1/p(1−p)​. This function is symmetric and strongly U‑shaped—it heavily upweights prompts where p≈0p \approx 0p≈0 or p≈1p \approx 1p≈1 because those are the cases with the smallest variance. While this gives maximal weight to confidently correct or confidently wrong answers, it also amplifies noise, since the variance estimate itself is unstable for extreme probabilities.
At the opposite extreme sits maximum likelihood estimation over correct trajectories. Maximising log‑likelihood of correct answers—or, equivalently, minimising the cross‑entropy loss on only positive examples—gives a gradient of the form Ex∼ρ[(1/p)∇θp]\mathbb{E}_{x\sim\rho}[(1/p)\nabla_\theta p]Ex∼ρ​[(1/p)∇θ​p], so wML(p)=1/pw_{\text{ML}}(p) = 1/pwML​(p)=1/p. This hyperbola is gentle for easy prompts (ppp near 1, weight near 1) but grows without bound as ppp shrinks, desperately trying to lift the tiniest success probabilities. It is compute‑hungry because it needs reliable estimates of ∇θlog⁡p\nabla_\theta \log p∇θ​logp on very rare successes, which usually requires millions of samples. MaxRL bridges these extremes. Its weight function, derived from the truncated Maclaurin expansion of log⁡p\log plogp, is
wMaxRL(T)(p)=1−(1−p)Tp.w_{\text{MaxRL}(T)}(p) = \frac{1 - (1-p)^T}{p}.wMaxRL(T)​(p)=p1−(1−p)T​.
For T=1T=1T=1 this reduces to wRLw_{\text{RL}}wRL​ (since 1−(1−p)=p1-(1-p)=p1−(1−p)=p), while as T→∞T\to\inftyT→∞ it approaches 1/p1/p1/p for any p>0p>0p>0, recovering ML. At finite TTT, the function behaves like wRLw_{\text{RL}}wRL​ when ppp is large because (1−p)T(1-p)^T(1−p)T decays quickly, and it smoothly bends upward toward the ML hyperbola as ppp becomes small. This interpolation is controlled solely by the truncation order TTT, which equals the number of independent attempts used in the MaxRL estimator—a direct compute‑indexed bridge.
Viewing all four weight functions on a single log‑log plot makes the relationships immediate and intuitive. The x‑axis is the pass probability ppp, spanning several orders of magnitude from near‑impossible prompts to near‑certain ones. The y‑axis is the weight w(p)w(p)w(p), also on a logarithmic scale to expose power‑law behaviour. In such a visual, RL appears as a flat horizontal line at w=1w=1w=1—utterly indifferent to ppp. Maximum likelihood traces a straight line with slope −1-1−1 (since log⁡w=−log⁡p\log w = -\log plogw=−logp), a hyperbola that skyrockets for tiny ppp. GRPO forms a symmetric bowl that rises sharply at both ends, visually distinct from everything else. The MaxRL family fans out between RL and ML: for T=2T=2T=2 the curve barely lifts above 1 except at the very lowest probabilities; T=10T=10T=10 bends significantly earlier; T=50T=50T=50 hugs 1/p1/p1/p over a wide range before saturating at w≈Tw \approx Tw≈T for p→0p\to0p→0. This saturation is the key—MaxRL never overweights hopeless prompts as severely as ML does, because the truncation caps the weight at TTT. The plot also reveals a subtle danger: GRPO’s peak at p→1p \to 1p→1 can be far larger than any MaxRL curve for confident successes, potentially causing overfitting to already‑mastered prompts instead of focusing compute on the tail. In contrast, MaxRL concentrates gradient credit on the examples that are neither impossible nor already solved, which matches the intuition of efficient learning.
The visual below distills the entire discussion into a single, glanceable comparison. The four families are colour‑coded and a legend identifies each by its functional form. The family of MaxRL(TTT) curves, plotted for T∈{2,4,10,50}T \in \{2,4,10,50\}T∈{2,4,10,50}, visibly threads the needle between the flat RL baseline and the steep ML target, illustrating how the hyperparameter TTT indexes a smooth trade‑off. The log‑log axes make it obvious that the weight functions differ most dramatically in the low‑probability regime, precisely where data scarcity forces an algorithm to choose between high variance and high bias. This unified weight‑function view not only organises existing methods but also suggests new ones: any monotonically decreasing w(p)w(p)w(p) with controlled growth near zero could be a candidate for compute‑efficient reinforcement fine‑tuning, and this plot gives us the mental model to design it.

14. Empirical Highlights

With a unified weight-function view of the optimization landscape, the theoretical promise of MaxRL becomes concrete: by indexing the gradient estimate with a truncation order NNN, the algorithm interpolates between a raw reinforcement signal and the exact maximum-likelihood gradient. The natural next step is to validate whether that interpolation translates into genuine empirical gains—especially in regimes where existing methods are known to struggle. The empirical highlights across image classification, spatial reasoning, mathematical problem-solving, and large-scale language model fine‑tuning all point to the same conclusion: MaxRL consistently outperforms REINFORCE-style baselines and the popular GRPO family, often by dramatic margins.
Recall the core predicament that standard policy‑gradient methods face on correctness tasks. When a binary reward only flags whether a sampled answer is correct, the gradient estimator is fundamentally limited to the support of positive rollouts. In problems with a low initial pass rate, that support can be extremely sparse; the resulting signal is weak, high‑variance, and entirely blind to the structure of incorrect answers. REINFORCE and its modern derivatives (RLOO, GRPO) consequently stall in these cold‑start conditions—they simply do not see enough correct traces to climb out of the low‑performance basin. MaxRL sidesteps this trap by exploiting the Maclaurin expansion of the log‑pass‑probability: instead of ignoring negative rollouts, it weights them according to a truncated exponential series that automatically assigns meaningful learning signals to both correct and incorrect samples. The truncation order NNN becomes a compute‑indexed dial that, when turned up, recovers the full log‑likelihood gradient with exactness.
The first striking demonstration comes from an ImageNet classification proxy, where the model is trained from a low initial pass rate. Standard REINFORCE plateaus early—its reliance on sporadic positive examples prevents convergence toward the cross‑entropy teacher. MaxRL, in contrast, closely tracks the cross‑entropy baseline as the number of rollouts per sample grows (Figure 2). This is a direct consequence of Theorem 2: with NNN rollouts, the finite‑sample MaxRL estimator exactly implements the NNN‑truncated Maclaurin term of the log‑pass‑probability. As NNN increases, the objective smoothly morphs into a proper maximum‑likelihood loss, explaining why it can eventually match cross‑entropy performance. The experiment thus confirms that MaxRL’s compute‑indexed bridge is not just a formal curiosity but a practical mechanism for escaping the cold‑start trap.
Equally telling is a Maze navigation task with access to effectively infinite training data. Here the question is not data scarcity but gradient efficiency: how many environment interactions are needed to reach a strong policy? MaxRL scales far more gracefully with the number of training rollouts than GRPO does. Notably, MaxRL with only 4 rollouts per problem instance already outperforms GRPO using 128 rollouts (Figure 3, Table 3). This counter‑intuitive result makes sense through the weight‑function lens. GRPO applies a severe advantage‑clipping operation that discards fine‑grained credit assignment among rollouts, effectively compressing the information into a crude relative‑ranking signal. MaxRL’s weight function, being a smooth polynomial in the pass rate, preserves richer per‑sample feedback even with a small rollout budget, so it needs far fewer samples to build an accurate gradient. Empirically, this translates into a massive reduction in required compute.
Perhaps the most dramatic warning for practitioners comes from the GSM8K data‑scarce regime. Here, fine‑tuning a language model on only a handful of math word‑problem chains reveals a dark side of optimizing solely for pass rate. GRPO and RLOO drive the model to high pass@1, but they simultaneously suffer catastrophic pass@k degradation: the set of valid solution paths collapses, and the model loses the diversity that makes test‑time majority voting effective (Figure 4, Table 4). MaxRL achieves a higher peak pass@1 while preserving pass@k diversity—the distribution over correct reasoning chains remains rich. This is exactly what we would expect when the objective approximates maximum likelihood rather than a mode‑seeking RL signal. The truncation order NNN acts as an implicit regularizer; even with finite NNN, the log‑probability target encourages coverage of all correct modes, not just the easiest one. For safety‑critical or reasoning‑intensive tasks, that property is invaluable.
The modern scale test cements the case. Fine‑tuning Qwen3 1.7B and 4B models on mathematical benchmarks (AIME, BeyondAIME, MATH‑500, Minerva) with a perfect outcome verifier reveals that MaxRL Pareto‑dominates GRPO on both pass@1 and pass@k across all tasks (Figure 5). The dominance is particularly tangible in test‑time compute scaling: when allowed to sample and majority‑vote at test time, models trained with MaxRL achieve up to a 20× efficiency gain over GRPO‑trained counterparts. In other words, to reach a target accuracy, a MaxRL model needs 20 times fewer test‑time samples, directly capitalizing on its preserved distributional diversity. The visual below captures this cluster of results in a 2×2 grid of summary bullet points, with green‑coded successes where MaxRL excels and red‑coded pitfalls for competing methods. A dedicated highlight box underscores the 20× test‑time scaling advantage, reminding us that the bridge from RL to log‑likelihood is not merely an academic equivalence but a recipe for substantially better sample efficiency at both training and inference time.

15. MaxRL at a Glance

The empirical results we just examined show a striking pattern: a model fine-tuned with a simple reinforcement learning reward—answer correctness—can boost its pass rate on held-out prompts, yet it often fails to capture the full statistical richness of the data. The learned policy may ignore subtle failure modes, become overconfident, and ultimately leave a gap when we measure its log-probability rather than the binary pass rate. This raises a deeper question: can we design a family of objectives that, at low compute, behaves like an RL pass-rate maximizer but, as we increase the number of samples, converges to the maximum likelihood estimate? Maximum Likelihood Reinforcement Learning (MaxRL) does exactly that by carving a compute-indexed path from the binary reward world to the log-likelihood ideal.
The key mathematical observation is that for any prompt xxx and any evaluation protocol that ultimately extracts a binary correctness outcome, the model’s pass probability p=pθpass(x)p=p_\theta^{\text{pass}}(x)p=pθpass​(x) satisfies
log⁡p=log⁡(1−(1−p))=−∑k=1∞(1−p)kk,\log p = \log(1-(1-p)) = -\sum_{k=1}^\infty \frac{(1-p)^k}{k},logp=log(1−(1−p))=−k=1∑∞​k(1−p)k​,
a Maclaurin series that converges for 0<p≤10 < p \le 10<p≤1. Truncating this expansion after TTT terms yields a family of surrogate objectives
JMaxRL(T)=−∑k=1T(1−p)kk,J^{(T)}_{\text{MaxRL}} = -\sum_{k=1}^{T} \frac{(1-p)^k}{k},JMaxRL(T)​=−k=1∑T​k(1−p)k​,
where we treat ppp as the quantity to be optimized over the policy parameters. When T=1T=1T=1, J(1)=−(1−p)J^{(1)}=-(1-p)J(1)=−(1−p), a linear function of the pass rate; maximizing it is equivalent (up to an additive constant) to maximizing the expected binary reward—standard RL on correctness. For T>1T>1T>1, higher-order terms—the pass@k probabilities that at least one of kkk sampled answers is correct—enter the objective, and they become increasingly influential when ppp is small. In other words, the truncation order TTT acts as a dial that controls how far we push beyond a single binary success toward the full log-probability surface.
Gradient-based optimization is made practical by the following identity (Theorem 1 of the MaxRL derivation):
∇θJ(T)=∑k=1T1k ∇θ pass@k(x).\nabla_\theta J^{(T)} = \sum_{k=1}^{T} \frac{1}{k}\,\nabla_\theta\,\text{pass@k}(x).∇θ​J(T)=k=1∑T​k1​∇θ​pass@k(x).
Each term ∇θ pass@k(x)\nabla_\theta\,\text{pass@k}(x)∇θ​pass@k(x) can be estimated without bias from a finite number of rollouts. Crucially, when we draw NNN independent rollouts per prompt and construct the natural estimator g^N\widehat{g}_Ng​N​ (described earlier in the lecture), that estimator is unbiased for ∇θJ(N)\nabla_\theta J^{(N)}∇θ​J(N). Thus the rollout count NNN directly sets the effective truncation order: with NNN samples we are, in expectation, optimizing JMaxRL(N)J^{(N)}_{\text{MaxRL}}JMaxRL(N)​. This compute-indexed link is the heart of MaxRL—the level of sampling determines which member of the objective family we are actually pursuing.
Why does this matter? As we increase NNN, the objective J(N)J^{(N)}J(N) progressively incorporates pass@k terms for larger kkk, each weighted by 1/k1/k1/k. Hard problems, where the pass probability ppp is low, see relatively larger contributions from higher-order terms because ∇θ pass@k(x)\nabla_\theta\,\text{pass@k}(x)∇θ​pass@k(x) tends to be more pronounced when the model struggles to get any sample correct. The gradient thus concentrates on the most difficult prompts, preventing the model from simply learning a uniform “easy mode” strategy. At the same time, the explicit dependence on pass@k for k>1k>1k>1 discourages pass@k collapse—a degenerate behavior observed in vanilla RL where the policy becomes deterministic and all kkk attempts produce the same (possibly wrong) answer, making pass@k estimates unreliable and learning stale. With MaxRL, even if pass@1 is high, the model retains incentive to produce diverse correct solutions, because any failure to achieve at least one correct answer among kkk trials is penalized by the (1−p)k/k(1-p)^k/k(1−p)k/k terms.
Taken together, the MaxRL framework cleanly unifies pure RL, maximum likelihood, and a spectrum of intermediate objectives under a single weight-function perspective (when compared to GRPO and similar algorithms, MaxRL can be seen as adjusting the sampling weight according to 1/p1/p1/p truncated at order NNN). The visual summary below condenses this unified view into a compact table, listing the core objective, its stochastic gradient, and the unbiased estimator that realizes the truncation via rollouts. Below the table, highlighted bullet points reinforce the compute-indexed property—more samples NNN imply a higher truncation T=NT=NT=N and a better ML approximation—and the practical advantages: concentrating gradient on hard examples, preventing pass@k collapse, and scaling effectively with both compute and data. This at-a-glance reference grounds the more detailed theorems we have explored and serves as a quick mental model for why MaxRL behaves as a smooth bridge from simple reward maximization to full log-probability learning.

Maximum Likelihood Reinforcement Learning (MaxRL): A Compute-Indexed Bridge from RL to Log-Likelihood

1. Why Standard RL Fails on Hard Correctness Tasks

Maximum Likelihood Reinforcement Learning (MaxRL): A Compute-Indexed Bridge from RL to Log-Likelihood

1. Why Standard RL Fails on Hard Correctness Tasks

2. Latent Generation Model and Pass Rate

3. ML vs RL Objectives in the Binary-Correctness Setting

4. Maclaurin Expansion of Log Pass Rate

5. MaxRL: A Compute-Indexed Family of Objectives

6. Theorem 1: Conditional Form of the ML Gradient

7. Proof of Theorem 1

8. Empirical Gradient Estimator \widehat g N

9. Theorem 2: Estimator–Objective Equivalence

10. Proof of Theorem 2

11. Variance Reduction via a Control Variate

12. Algorithm 1: On-Policy MaxRL Implementation

13. Unifying Weight-Function View

14. Empirical Highlights

15. MaxRL at a Glance

2. Latent Generation Model and Pass Rate

3. ML vs RL Objectives in the Binary-Correctness Setting

4. Maclaurin Expansion of Log Pass Rate

5. MaxRL: A Compute-Indexed Family of Objectives

6. Theorem 1: Conditional Form of the ML Gradient

7. Proof of Theorem 1

8. Empirical Gradient Estimator \widehat g N

9. Theorem 2: Estimator–Objective Equivalence

10. Proof of Theorem 2

11. Variance Reduction via a Control Variate

12. Algorithm 1: On-Policy MaxRL Implementation

13. Unifying Weight-Function View

14. Empirical Highlights

15. MaxRL at a Glance