ECHO: Turning Terminal Feedback into Dense Supervision for Agent RL - FeynmanWiki

CONTENTS

Bookmark this paper

Save for later reading

MACHINE LEARNING, REINFORCEMENT LEARNING - 45 MIN READ

ECHO: Turning Terminal Feedback into Dense Supervision for Agent RL

1. The Wasted Supervision in Terminal Agents

When we train an autonomous agent to interact with a terminal—editing code, running test suites, invoking the compiler—every rollout generates a long sequence of interleaved agent actions and environment observations. A typical trace might begin with a system prompt, continue with the assistant’s command python run.py, and then receive a burst of diagnostic output: a stack trace, an error type, and a line number pointing to the offending indentation. This raw interaction is rich with information about what went wrong, yet in the dominant paradigm of reinforcement learning with Group Relative Policy Optimization (GRPO), almost all of it is deliberately discarded before the optimizer ever sees it. Understanding why and what we lose is the first step toward the ECHO method.
The GRPO loss is computed token by token over the sequence, but a crucial loss mask restricts gradient flow to a specific subset. Let the rollout be x=(x1,…,xT)x = (x_1, \dots, x_T)x=(x1​,…,xT​). The model assigns a probability ratio ρt(i)\rho^{(i)}_tρt(i)​ at each token relative to the reference policy, and the advantage A^(i)\hat{A}^{(i)}A^(i) is computed from the final reward of the full rollout (or group-normalised reward). The per-token loss is
loss(xt)={clipped(ρt(i)) A^(i),t∈A0,t∉A\text{loss}(x_t) = 
\begin{cases} 
\text{clipped}\big(\rho^{(i)}_t\big)\,\hat{A}^{(i)}, & t \in A \\[4pt]
0, & t \notin A
\end{cases}loss(xt​)={clipped(ρt(i)​)A^(i),0,​t∈At∈/A​
where AAA is the set of action tokens generated by the assistant. Everything else—system messages, environment outputs, error reports, file contents, and build logs—carries a loss of zero. This masking scheme is a natural extension of the language model fine-tuning paradigm that treats only the model’s own text as trainable, but in agentic trajectories it creates a severe credit assignment bottleneck.
The problem emerges from the interaction between masking and the advantage signal. In terminal-based tasks, success is often binary and sparse: the agent either completes the objective (a passing test, a correct edit) or it does not. In practice, when fine-tuning a model like Qwen3-8B on a realistic coding benchmark, fewer than 15% of rollouts succeed. For the remaining >85%, the final reward—and therefore the advantage A^(i)\hat{A}^{(i)}A^(i)—is either zero or lies in a tightly grouped cluster where the group-normalized advantage becomes vanishingly small. The masked loss renders most interaction data invisible to the parameter update. The forward pass still processes the observation tokens (they consume compute and shape the internal states for subsequent actions), but the backward pass ignores them entirely. Learning is starved: thousands of failed explorations carry zero effective gradient for the portions of the sequence that describe why the failure occurred.
This would be merely inefficient if the discarded tokens were pure noise, but they are not. Every failed rollout still contains terminal feedback that explains the failure. A Python traceback tells the model which line caused an IndentationError and the exact context around it. A build log might reveal a missing header file. A test run output shows which assertion failed and by how much. These signals are supervision—just not in the form of a scalar reward, and just not placed on the action tokens themselves. Under standard GRPO masking, the traceback is a silent passenger: it enters the transformer stack, influences the hidden representations, but the loss function never explicitly rewards the model for generating actions that would have fixed the error. The supervision is, in effect, wasted.
The core insight of ECHO is that this wasted supervision can be reclaimed as a dense auxiliary training signal. By adding an extra objective—a cross-entropy loss that predicts the observation tokens themselves, conditioned on the preceding context—the model is forced to actively model the terminal feedback that follows its actions. If the assistant predicts that running a block of code will produce a certain error, it is effectively learning a world model of the terminal environment, and this predictive competence directly improves its ability to generate corrective actions. The mask that once silenced the environment becomes a teacher.
The accompanying diagram distills this imbalance. On the left, a terminal window shows the agent’s action python run.py and the resulting multi-line traceback ending in IndentationError: unexpected indent. In the center, the loss mask is rendered as a sequence of colored blocks: the action tokens are highlighted (say, in green) with an arrow pointing to the loss gradient, while the observation tokens—the entire error message—are grayed out and marked with a red “X” and the label loss = 0. From those grayed blocks, a branching arrow leads to a separate box labeled “Unused Supervision,” making the waste explicit. To the right, a small bar chart compares the tiny fraction of rollouts that receive a non-zero advantage (~15%) against the vast majority that contribute near-zero advantage (~85%), annotated “Policy gradient signal vanishes.” The visual layout forces the reader to confront the reality that nearly all diagnostic content is systematically excluded from training, and it frames ECHO’s ambition to turn that silent resource into a learning signal. In the next section, we will dissect the multi-turn rollout structure that makes this auxiliary supervision both feasible and effective.

CONTENTS

Bookmark this paper

Save for later reading

MACHINE LEARNING, REINFORCEMENT LEARNING - 45 MIN READ

ECHO: Turning Terminal Feedback into Dense Supervision for Agent RL

1. The Wasted Supervision in Terminal Agents

When we train an autonomous agent to interact with a terminal—editing code, running test suites, invoking the compiler—every rollout generates a long sequence of interleaved agent actions and environment observations. A typical trace might begin with a system prompt, continue with the assistant’s command python run.py, and then receive a burst of diagnostic output: a stack trace, an error type, and a line number pointing to the offending indentation. This raw interaction is rich with information about what went wrong, yet in the dominant paradigm of reinforcement learning with Group Relative Policy Optimization (GRPO), almost all of it is deliberately discarded before the optimizer ever sees it. Understanding why and what we lose is the first step toward the ECHO method.
The GRPO loss is computed token by token over the sequence, but a crucial loss mask restricts gradient flow to a specific subset. Let the rollout be x=(x1,…,xT)x = (x_1, \dots, x_T)x=(x1​,…,xT​). The model assigns a probability ratio ρt(i)\rho^{(i)}_tρt(i)​ at each token relative to the reference policy, and the advantage A^(i)\hat{A}^{(i)}A^(i) is computed from the final reward of the full rollout (or group-normalised reward). The per-token loss is
loss(xt)={clipped(ρt(i)) A^(i),t∈A0,t∉A\text{loss}(x_t) = 
\begin{cases} 
\text{clipped}\big(\rho^{(i)}_t\big)\,\hat{A}^{(i)}, & t \in A \\[4pt]
0, & t \notin A
\end{cases}loss(xt​)={clipped(ρt(i)​)A^(i),0,​t∈At∈/A​
where AAA is the set of action tokens generated by the assistant. Everything else—system messages, environment outputs, error reports, file contents, and build logs—carries a loss of zero. This masking scheme is a natural extension of the language model fine-tuning paradigm that treats only the model’s own text as trainable, but in agentic trajectories it creates a severe credit assignment bottleneck.
The problem emerges from the interaction between masking and the advantage signal. In terminal-based tasks, success is often binary and sparse: the agent either completes the objective (a passing test, a correct edit) or it does not. In practice, when fine-tuning a model like Qwen3-8B on a realistic coding benchmark, fewer than 15% of rollouts succeed. For the remaining >85%, the final reward—and therefore the advantage A^(i)\hat{A}^{(i)}A^(i)—is either zero or lies in a tightly grouped cluster where the group-normalized advantage becomes vanishingly small. The masked loss renders most interaction data invisible to the parameter update. The forward pass still processes the observation tokens (they consume compute and shape the internal states for subsequent actions), but the backward pass ignores them entirely. Learning is starved: thousands of failed explorations carry zero effective gradient for the portions of the sequence that describe why the failure occurred.
This would be merely inefficient if the discarded tokens were pure noise, but they are not. Every failed rollout still contains terminal feedback that explains the failure. A Python traceback tells the model which line caused an IndentationError and the exact context around it. A build log might reveal a missing header file. A test run output shows which assertion failed and by how much. These signals are supervision—just not in the form of a scalar reward, and just not placed on the action tokens themselves. Under standard GRPO masking, the traceback is a silent passenger: it enters the transformer stack, influences the hidden representations, but the loss function never explicitly rewards the model for generating actions that would have fixed the error. The supervision is, in effect, wasted.
The core insight of ECHO is that this wasted supervision can be reclaimed as a dense auxiliary training signal. By adding an extra objective—a cross-entropy loss that predicts the observation tokens themselves, conditioned on the preceding context—the model is forced to actively model the terminal feedback that follows its actions. If the assistant predicts that running a block of code will produce a certain error, it is effectively learning a world model of the terminal environment, and this predictive competence directly improves its ability to generate corrective actions. The mask that once silenced the environment becomes a teacher.
The accompanying diagram distills this imbalance. On the left, a terminal window shows the agent’s action python run.py and the resulting multi-line traceback ending in IndentationError: unexpected indent. In the center, the loss mask is rendered as a sequence of colored blocks: the action tokens are highlighted (say, in green) with an arrow pointing to the loss gradient, while the observation tokens—the entire error message—are grayed out and marked with a red “X” and the label loss = 0. From those grayed blocks, a branching arrow leads to a separate box labeled “Unused Supervision,” making the waste explicit. To the right, a small bar chart compares the tiny fraction of rollouts that receive a non-zero advantage (~15%) against the vast majority that contribute near-zero advantage (~85%), annotated “Policy gradient signal vanishes.” The visual layout forces the reader to confront the reality that nearly all diagnostic content is systematically excluded from training, and it frames ECHO’s ambition to turn that silent resource into a learning signal. In the next section, we will dissect the multi-turn rollout structure that makes this auxiliary supervision both feasible and effective.

2. Multi-Turn Rollout Structure

Building on the previous observation that terminal tasks waste a surprising amount of supervisory signal, we now zoom into the exact shape of the data that a policy model sees during multi‑turn agent training. Before we can understand how to recover that wasted signal, we need a precise, token‑level picture of a rollout and where the standard GRPO algorithm does—and more importantly, does not—apply a loss. This will make the sparsity problem concrete and lay the groundwork for the dense auxiliary objective that ECHO introduces.
A training rollout in an agent‑environment loop is not a flat slab of text; it is a structured conversation where the assistant and the environment take turns. After an initial system prompt and a task description, the model generates its first action, the environment responds with an observation, the model generates a second action, the environment emits a second observation, and so on for KKK turns. When we tokenize this entire exchange, we get a long sequence that alternates between tokens produced by the assistant and tokens produced by the environment. We can write it schematically as
[sys][task]  [action1]  [obs1]  [action2]  [obs2]  …  [actionK]  [obsK]\texttt{[sys][task]}\; \texttt{[action}_1]\; \texttt{[obs}_1]\; \texttt{[action}_2]\; \texttt{[obs}_2]\; \dots\; \texttt{[action}_K]\; \texttt{[obs}_K][sys][task][action1​][obs1​][action2​][obs2​]…[actionK​][obsK​]
where each block is a contiguous span of tokens. Let the full token sequence be x1,x2,…,xTx_1, x_2, \dots, x_Tx1​,x2​,…,xT​, with TTT the total length.
To talk precisely about what parts of this sequence influence the model’s parameters during training, we define two disjoint index sets that partition the action and observation tokens (the fixed prefix [sys][task] is omitted from both because it is neither). The set of action positions AAA contains every token index ttt where xtx_txt​ belongs to an assistant‑generated action block. The set of observation positions OOO contains every index ttt where xtx_txt​ belongs to an environment‑generated observation block. Formally,
A={t∈{1,…,T}∣xt is part of an assistant action},O={t∈{1,…,T}∣xt is part of an environment observation}.A = \{ t \in \{1,\dots,T\} \mid x_t \text{ is part of an assistant action} \},
\qquad
O = \{ t \in \{1,\dots,T\} \mid x_t \text{ is part of an environment observation} \}.A={t∈{1,…,T}∣xt​ is part of an assistant action},O={t∈{1,…,T}∣xt​ is part of an environment observation}.
These sets are complementary within the interactive portion of the sequence; every token after the prefix falls into exactly one of them.
The critical design choice in standard GRPO‑based agent training is that the policy‑gradient loss is computed only for tokens in AAA. For each action token, the model evaluates the log‑probability (or a related advantage‑weighted quantity) and backpropagates to encourage or discourage that generation. The tokens in OOO, on the other hand, are treated purely as conditioning context: the model reads them in order to compute the probability of subsequent actions, but it never receives a direct learning signal from whether those observation tokens are surprising, consistent, or informative. This means the parameters are never pushed to understand or predict the environment’s response as a target per se; the observations only matter insofar as they change the log‑probabilities of future actions that the loss actually touches.
This asymmetry has a far‑reaching consequence. Because the model typically generates only a small fraction of the total tokens—often a single short action followed by a much longer observation from a symbolic environment or a tool—the set AAA is vastly sparser than OOO. The model might see hundreds of observation tokens for every handful of action tokens, yet the gradient only flows through the thin blue slices of the sequence. All the rich information contained in the environment’s deterministic or stochastic transitions, coded into the tokens that spell out a file listing, an execution trace, or a web‑page dump, is ignored as a potential target. The sparsity is not merely cosmetic; it directly slows down learning because the model cannot exploit the density of supervision that the environment naturally provides.
The visual below encapsulates this rollout structure and the resulting training asymmetry in a single diagrammatic snapshot. A horizontal bar stretches from position 1 to TTT, with the initial [sys][task] segment colored in gray, followed by alternating blue (action) and green (observation) blocks. Beneath the bar, large labels mark each blue segment with “A” and each green segment with “O”, making the index sets immediately visible. Above the sequence, a curly brace indicates that the model conditions on all prior tokens, but a prominent callout emphasizes that only the blue (action) positions receive a policy‑gradient loss; the green observation positions are not supervised. This compact layout makes the core imbalance tangible: the model is forced to learn a dialogue through a handful of action‑level nudges while drowning in observation tokens it is never asked to predict. Recognizing this waste of dense, free supervision is the first motivation behind ECHO, which we will develop by converting those un‑supervised observation tokens into a complementary training signal.

3. GRPO Recap and Its Sparsity

Having unpacked the multi‑turn rollout structure where an agent and environment alternate between actions and observations, we can now examine how standard Group Relative Policy Optimization (GRPO) interacts with that sequence. GRPO, popularized in the DeepSeek‑R1 framework, adapts the PPO clipped surrogate objective to a group‑wise advantage normalization scheme. For each prompt xxx we sample nnn independent rollouts {y(i)}i=1n\{y^{(i)}\}_{i=1}^{n}{y(i)}i=1n​; after execution a terminal binary reward r(i)∈{0,1}r^{(i)}\in\{0,1\}r(i)∈{0,1} (success/failure) is assigned. Instead of estimating a value function, GRPO computes a simple, group‑normalized advantage for each rollout:
A^(i)=r(i)−r‾σr+ε,\hat{A}^{(i)} = \frac{r^{(i)} - \overline{r}}{\sigma_r + \varepsilon},A^(i)=σr​+εr(i)−r​,
where r‾\overline{r}r is the mean reward in the group and σr\sigma_rσr​ is the standard deviation. This normalization ensures ∑iA^(i)=0\sum_i \hat{A}^{(i)}=0∑i​A^(i)=0 — a property that will prove to be both a blessing and a curse when rewards are sparse.
The policy gradient is then applied only at the positions where the assistant model has agency: the action tokens. Formally, let A(i)A^{(i)}A(i) denote the set of time‑indices within rollout iii that correspond to tokens generated by the policy (the assistant). The GRPO loss is the clipped surrogate:
LGRPO(θ;A)=−1∑i∣A(i)∣∑i=1n∑t∈A(i)min⁡(ρt(i)A^(i),  clip⁡(ρt(i),1−ϵ,1+ϵ)A^(i)),L_{\text{GRPO}}(\theta; A) = - \frac{1}{\sum_{i}|A^{(i)}|} \sum_{i=1}^{n} \sum_{t \in A^{(i)}}
\min\Big(
\rho^{(i)}_t \hat{A}^{(i)},\;
\operatorname{clip}\big(\rho^{(i)}_t, 1-\epsilon, 1+\epsilon\big) \hat{A}^{(i)}
\Big),LGRPO​(θ;A)=−∑i​∣A(i)∣1​i=1∑n​t∈A(i)∑​min(ρt(i)​A^(i),clip(ρt(i)​,1−ϵ,1+ϵ)A^(i)),
where the importance ratio ρt(i)=pθ(xt(i)∣x<t(i))/pold(xt(i)∣x<t(i))\rho^{(i)}_t = p_\theta (x^{(i)}_t \mid x^{(i)}_{<t}) / p_{\text{old}}(x^{(i)}_t \mid x^{(i)}_{<t})ρt(i)​=pθ​(xt(i)​∣x<t(i)​)/pold​(xt(i)​∣x<t(i)​) measures how much the new policy deviates from the old one at token ttt. The clip operation, with typical ϵ=0.2\epsilon=0.2ϵ=0.2, limits destructive updates. Crucially, the double summation runs exclusively over t∈A(i)t\in A^{(i)}t∈A(i); every environment observation token — game states, tool outputs, system messages — is excluded from this sum. The model receives no gradient for the tokens in the observation set O(i)O^{(i)}O(i).
This design reflects a standard RL assumption: the environment is beyond the agent’s control, so we cannot, and should not, change the probability of an observation token that was generated by a simulator. However, when the only reward signal is a sparse terminal binary, the cost of ignoring observations becomes severe. During training, many rollouts end in failure. In a group where every rollout receives r(i)=0r^{(i)}=0r(i)=0, the group‑normalized advantage is exactly zero for all samples, and the loss gradient vanishes. The policy update collapses to a no‑op — the model learns nothing from that entire batch of interactions.
When a group does contain one or more successes, the normalization creates contrast: successes get a positive advantage, while failures receive a uniform negative advantage. All failure rollouts are pushed down by the same magnitude, regardless of why they failed. There is no mechanism to credit or blame specific actions or to differentiate between failures that were almost successful and those that were hopeless from the start. Even more troubling, the model trains almost exclusively on the handful of successful trajectories, because those are the only ones that can receive a positive gradient. The rich information contained in the failed rollouts — the observation sequences that might signal where things went wrong — is completely discarded.
The visual below (Figure 3) distills this GRPO recap into a single, equation‑centric slide. The core loss function appears centered in large LaTeX, with the importance ratio definition rendered succinctly underneath. Three red‑highlighted bullet points at the bottom spell out the sparsity consequences: no gradient on observation tokens, zero‑contrast groups when all rewards are identical, and uniform negative signal for all failure rollouts. This compact summary reinforces the key insight: standard GRPO, when applied to terminal tasks, squeezes its entire learning signal through a tiny number of action‑level updates that see a non‑zero advantage, leaving a huge volume of environment feedback unutilized. The result is extremely sample‑inefficient learning — a problem that ECHO directly addresses by converting terminal feedback into dense supervision.

4. Failure Case: Signal Discarded in Failed Rollouts

Having revisited the mechanics of GRPO, we can now scrutinize exactly where its advantages stop short. The policy gradient estimator in GRPO is computed only over a predefined set of tokens—the action tokens AAA. All other tokens that the model sees and must interpret, such as environment feedback, fall into the complement set OOO. The loss, recapped from the previous discussion, is
LGRPO(θ;A)=−1∑i∣A(i)∣∑i∑t∈A(i)min⁡ ⁣(ρt(i)A^(i),  clip(ρt(i),1−ϵ,1+ϵ) A^(i)).L_{\text{GRPO}}(\theta;A) = -\frac{1}{\sum_i |A^{(i)}|} \sum_i \sum_{t\in A^{(i)}} \min\!\big(\rho^{(i)}_t \hat{A}^{(i)},\; \text{clip}(\rho^{(i)}_t, 1-\epsilon, 1+\epsilon)\,\hat{A}^{(i)}\big).LGRPO​(θ;A)=−∑i​∣A(i)∣1​i∑​t∈A(i)∑​min(ρt(i)​A^(i),clip(ρt(i)​,1−ϵ,1+ϵ)A^(i)).
The outer summation runs over rollouts, and the inner summation over only those time steps where the token belongs to AAA. For any token position ttt whose token is not an action token—for instance, an environment observation or a prompt continuation—there is simply no term in the loss. Backpropagation never visits those positions; their contribution to the model’s log‑probabilities receives zero gradient.
This design is intentional—we don’t want to “train” the model to predict what the environment produced as if it were the agent’s own output—but it introduces a severe sparsity problem when rollouts fail. Consider a typical debugging scenario: the agent is asked “Find error messages in logs.” and it suggests a command. The generated action tokens are grep ERROR ./logs/. The terminal then prints: grep: ./logs/: No such file or directory. In the sequence that the model processes, the initial prompt is separate, the action tokens belong to AAA, and the terminal error output belongs to OOO. GRPO will compute the policy gradient only on the handful of tokens in grep ERROR ./logs/. The much longer (and far more informative) error message—grep:, ./logs/:, No, such, file, or, directory—receives no learning signal whatsoever.
That would be acceptable if the advantage signal on the action tokens were rich enough to teach the model what went wrong. But in a failed rollout the reward is typically r=0r=0r=0, and after group‑normalization the advantage A^\hat{A}A^ for that rollout is negative. Because GRPO applies this single scalar advantage uniformly to every action token, the entire action block is penalised, but without any token‑level differentiation. The model learns that this whole command was bad, but it has no way of knowing that the mistake was the assumption that ./logs/ exists as a directory, not, say, the use of grep or the pattern ERROR. The error message itself contains that nuance—it encodes file‑system constraints—and yet it is completely discarded.
The downstream consequence is a massive waste of supervision. In a typical agent‑environment rollout, the number of tokens belonging to OOO can far exceed the number of action tokens. The prompt is often short; the environment’s responses—error messages, file listings, stack traces—can be long. Each turn the agent makes, the model processes a wall of observation text that carries structured information about the problem state. When those tokens never receive gradient, the model cannot learn to anticipate the consequences of its actions or to interpret the environment’s signals more effectively. The only tokens that ever drive learning are the few action tokens, and for failed trajectories those tokens are updated with a coarse, uninformative negative signal.
Over many episodes, this sparsity dramatically slows down sample efficiency. The model must see a huge number of failed rollouts before it stumbles upon a correct one, because the informative feedback within each failure is wasted. Even then, the successful rollout’s positive advantage on its action tokens is again a uniform scalar—it tells the model “this exact sequence of actions is good,” but it doesn’t carve out understanding of why the environment responded the way it did. The learning is brittle and inefficient.
The visual below crystallises this waste. It shows a token‑coloured bar for the example rollout: the prompt in gray, the action tokens in blue, and the observation tokens in red. Brackets delineate the sets AAA and OOO. A red cross and the label “No gradient” mark the large red block, making it instantly clear that all those informative tokens are untouched by the loss. The right‑side table quantifies the imbalance: approximately 5 action tokens receive a gradient, while about 8 observation tokens do not, and in real sequences that ratio is far more lopsided. The annotation at the bottom—“GRPO applies loss only to blue tokens”—is a compact reminder of the root cause. This diagram takes the algebraic statement of the loss and renders it as a tangible picture of wasted information, preparing us to ask the central question: can we turn those red tokens into dense supervision?

5. Core Intuition: Observation Tokens as Supervision

In the previous section we examined a painful failure mode: when a reinforcement‑learning agent produces a rollout that never contacts a reward signal, the entire trajectory is discarded — no gradient flows to the tokens that caused the mistake. Standard GRPO, despite its formal sophistication, operates under the assumption that only rewarded actions provide a meaningful learning direction. For a terminal‑control agent that interacts with an operating system through a shell, this means that a rollout ending in an error message, a non‑zero return code, or an unexpected directory listing is effectively wasted. Yet those failed attempts contain a wealth of implicit knowledge about the environment that the policy desperately needs to learn.
Ilya Sutskever once remarked, “Predicting the next token well means you understand the underlying reality that led to the creation of that token.” The insight runs deeper than the famous quote suggests: the training objective of next‑token prediction, when applied to sequences generated by a structured external process, forces the model to construct an internal representation of the generative mechanism itself. For a language model that produces commands in a terminal, the environment observations — error messages, file listings, standard output, return codes — are not arbitrary text; they are the deterministic, token‑level fingerprints of an OS state. The command ls in a certain directory yields a specific set of filenames, and gcc main.c -o main under an outdated compiler version yields a predictable error pattern. Every token of the observation is a causal consequence of the preceding action and the hidden state of the system.
This perspective suggests a radical reframing of the agent’s training objective. Instead of treating observation tokens as inert context that merely conditions future action distributions, we can treat them as supervised prediction targets. If the policy can be trained to forecast the exact bytes the OS will return, then it must internalize command semantics, file‑system constraints, error formats, and even the quirks of specific tools. The act of predicting "Permission denied" after rm /etc/passwd teaches the agent about file permissions without any crafted reward function; predicting the output of find . -name "*.py" teaches it to model directory structures. This is dense, token‑level feedback that the terminal already provides for free — every observation becomes a training signal, regardless of whether the rollout eventually receives a sparse downstream reward.
The ECHO hypothesis is built directly on this reframing: augment the standard policy‑gradient loss LGRPO(θ;A)L_{\text{GRPO}}(\theta; A)LGRPO​(θ;A) with an auxiliary cross‑entropy loss LEnv(θ;O′)L_{\text{Env}}(\theta; O')LEnv​(θ;O′) computed over a carefully chosen subset O′O'O′ of the observation tokens:
LECHO(θ)=LGRPO(θ;A)+λ LEnv(θ;O′)L_{\text{ECHO}}(\theta) = L_{\text{GRPO}}(\theta; A) + \lambda\, L_{\text{Env}}(\theta; O')LECHO​(θ)=LGRPO​(θ;A)+λLEnv​(θ;O′)
Here AAA denotes the GRPO advantages, and λ\lambdaλ controls the trade‑off between reward‑seeking behaviour and world‑model fidelity. The subset O′O'O′ is critical: not every observation token is equally informative. As we will explore in later sections, warning tokens that are purely cosmetic, highly repetitive, or unrelated to the underlying task logic are excluded to prevent the auxiliary loss from diluting the policy’s focus. But the core idea is independent of the filtering details: we transform a sparse, terminal‑only reward problem into a mixed objective that provides token‑level supervision on every environment turn.
The consequences are immediate and powerful. First, the gradient now reaches the policy at every observation token, not only at action positions where an advantage‑weighted log‑probability is computed. In standard GRPO the model receives no gradient on the tokens it produced that led to a failed outcome; with ECHO, the model is penalized for not predicting the error message that actually appeared, forcing it to adjust the action‑generation distribution that caused it. Second, the auxiliary loss acts as a form of state representation learning, pushing the policy to build a rich latent model of the environment’s transition dynamics just by mimicking the ground‑truth outputs. Finally, the most dramatic qualitative shift is that failed rollouts are no longer wasted. A rollout that ends with command not found is now a valuable supervised example: the agent learns the mapping between a misspelled tool name and the corresponding error pattern, a lesson that would otherwise vanish into a zero‑advantage gradient.
The visual that follows distills this core intuition into a compact schematic. At the top, Sutskever’s observation sits as an italicized quote, directly linking the philosophical motivation — understanding reality through prediction — to the technical proposal. Below it, three bullet points unpack what it means for a terminal agent: every observation is a token sequence emitted by the OS state; learning to predict those tokens forces the policy to absorb command semantics, filesystem rules, and error formats; therefore, the auxiliary loss turns the environment output into a dense stream of supervision. On the right, a highlighted equation block displays the ECHO loss decomposition LECHO=LGRPO+λLEnvL_{\text{ECHO}} = L_{\text{GRPO}} + \lambda L_{\text{Env}}LECHO​=LGRPO​+λLEnv​, with a faint arrow connecting the quote to the equation, reinforcing the idea that “understanding reality” maps directly onto the prediction objective. The bottom compares gradient flow: the standard GRPO path only touches action positions, while ECHO pushes gradient through all observation tokens, with a red indicator for the now‑activated supervision signal on failed rollouts. Together, the diagram serves not as a stand‑alone lesson, but as a one‑glance consolidation of why plugging a next‑token prediction loss into the agent’s training loop transforms sparse terminal feedback into a world‑modeling engine.

6. ECHO Loss Definition

The GRPO family of algorithms trains an agent policy through sparse terminal rewards: the learner receives a single scalar judgment only after a multi‑step trajectory concludes. This sparsity leaves vast stretches of the trajectory—environment observations, intermediate outputs, partial results—uninformed by any explicit learning signal. The core intuition we developed earlier is that observation tokens, the raw text the environment prints in response to the agent’s actions, carry dense information about whether the trajectory is on a successful path. If the model can be made to anticipate those observations, it must internalize the dynamics of the environment and the consequences of its own actions. The question becomes how to turn that insight into a concrete, differentiable loss that can be minimized alongside the policy gradient.
We formalize this idea by introducing an environment prediction loss, LEnvL_{\text{Env}}LEnv​. Let the full sequence of tokens in a trajectory be x1,x2,…,xTx_1, x_2, \dots, x_Tx1​,x2​,…,xT​. The subset that corresponds to environment observations is denoted OOO. Within that subset, we further select a refined set O′⊆OO' \subseteq OO′⊆O of tokens that carry meaningful feedback about the task, deliberately excluding noisy or uninformative content such as routine warnings or diagnostic messages that do not correlate with eventual success. The loss is defined as the average negative log‑likelihood of the model on those selected tokens, conditioned on all preceding tokens:
LEnv(θ;O′)=−1∣O∣∑t∈O′log⁡pθ(xt∣x<t)L_{\text{Env}}(\theta; O') = -\frac{1}{|O|} \sum_{t \in O'} \log p_\theta(x_t \mid x_{<t})LEnv​(θ;O′)=−∣O∣1​t∈O′∑​logpθ​(xt​∣x<t​)
The normalization uses the total size of the observation set, Z=∣O∣Z = |O|Z=∣O∣, rather than the size of the filtered set O′O'O′. This choice is subtle but important. By dividing by a larger denominator that is constant across trajectories, the loss magnitude remains stable and comparable to the policy gradient term LGRPOL_{\text{GRPO}}LGRPO​, even when the number of selected tokens O′O'O′ varies. If we normalized by ∣O′∣|O'|∣O′∣, the loss scale could fluctuate wildly depending on how many warning tokens were dropped, introducing instability into optimization. The current design ensures that the auxiliary loss contributes a consistent, controlled gradient signal.
With the environment prediction loss in hand, the ECHO objective combines it with the standard GRPO policy gradient:
LECHO(θ)=LGRPO(θ;A)+λ LEnv(θ;O′)L_{\text{ECHO}}(\theta) = L_{\text{GRPO}}(\theta; A) + \lambda\, L_{\text{Env}}(\theta; O')LECHO​(θ)=LGRPO​(θ;A)+λLEnv​(θ;O′)
Here λ\lambdaλ is a scalar hyper‑parameter that balances the two objectives. When λ=0\lambda = 0λ=0, we recover vanilla GRPO. Increasing λ\lambdaλ injects progressively denser supervision, encouraging the model to predict environment feedback at every observation step, not just to chase the terminal reward. This joint minimization can be seen as a form of multi‑task learning where the shared representation must simultaneously support optimal action selection and accurate environment modeling.
A key practical advantage is that both losses are computed from a single forward pass through the network. For each trajectory, we collect the log‑probabilities at all positions t∈O′t \in O't∈O′ during the ordinary rollout; no additional model evaluations, teacher distillations, or extra rollouts are required. The gradient of LEnvL_{\text{Env}}LEnv​ flows back through exactly the same transformer weights as the gradient of LGRPOL_{\text{GRPO}}LGRPO​. This means ECHO imposes no inference overhead and negligible implementation complexity—one simply aggregates the cross‑entropy terms on selected observation tokens and adds them to the policy loss.
Why is it beneficial to exclude warning tokens from O′O'O′? In many interactive environments, warnings are emitted deterministically in certain states but carry no predictive power about eventual failure or success. For example, a compiler warning about an unused variable might appear in both correct and incorrect solutions. Including such tokens in the auxiliary loss would encourage the model to predict them regardless of trajectory quality, effectively adding noise to the learning signal and potentially distracting the policy from more consequential cues. By restricting O′O'O′ to tokens that genuinely signal task progress—such as test pass/fail messages, error codes, or final output summaries—the dense supervision focuses the model on the environmental dynamics that matter.
The introduction of this auxiliary loss transforms the learning dynamics. Instead of waiting until the trajectory’s end to receive any gradient information, the model now gets frequent updates that nudge it toward trajectories where the environment yields favorable observations. In early training, this dramatically reduces the variance of the policy gradient estimate and accelerates the discovery of successful action sequences. Even in later stages, maintaining the pressure to predict observations helps the policy avoid degenerate solutions that would otherwise exploit reward hacking, because the environment model must remain self‑consistent.
The slide that accompanies this section—titled “ECHO Loss Definition”—distills these ideas into a compact, at‑a‑glance visual. It displays the two central equations prominently, with the environment prediction loss annotated by its normalization rationale Z=∣O∣Z = |O|Z=∣O∣ and a note about filtering to O′O'O′. The joint objective follows directly below, marked by the weighting coefficient λ\lambdaλ. A concise callout reinforces that both losses are evaluated from a single forward pass, requiring zero extra computation. By seeing the two losses side by side, the reader can immediately grasp how ECHO augments GRPO without altering its core machinery. The handwritten, diagram‑style layout emphasizes the additive relationship and the flow of gradients through a shared network, making the theoretical definition feel tangible and ready for implementation.

7. Shared Forward Pass and Gradient Flow

Understanding the ECHO loss definition is only the first step; a practical algorithm must also ensure that adding dense auxiliary supervision does not come at the cost of exploding compute or brittle multi‑stage training. In standard actor–critic or GRPO pipelines the agent is trained exclusively on the tokens it emits—actions and thoughts—while the environment’s responses (observations) serve merely as conditioning context for the next step. If we now ask the model to also predict those observation tokens, a naive implementation might require a separate forward pass for the dense loss, maybe even a second model, which would double memory and time. ECHO avoids that entirely by reusing the same forward pass that already produces the policy logits, turning it into a single integrated computation graph.
The central idea is mechanically simple: the agent processes the full trajectory—interleaved action tokens and environment-observation tokens—through the language model exactly once. The model’s output logits are produced at every position, including those corresponding to environment tokens. For GRPO, we collect the log‑probabilities of the action (and thought) tokens under the policy, compute the advantage‑weighted loss, and discard the rest. For ECHO, we take the logits at the positions of the environment tokens and compute the auxiliary cross‑entropy loss against the ground‑truth observations, but crucially we mask out warning tokens as explained earlier. The two losses are then simply added together with a balancing coefficient λ\lambdaλ:
Ltotal=LGRPO+λ LECHO.\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{GRPO}} + \lambda \, \mathcal{L}_{\text{ECHO}}.Ltotal​=LGRPO​+λLECHO​.
Because both losses are constructed from the very same logits tensor, the entire forward pass is shared. Gradient computation then flows backward from this combined loss through a single computation graph; there is no extra teacher model, no separate inference step, and no need to replay the trajectory.
This shared forward pass has several deep implications for training dynamics. First, it makes the auxiliary supervision computationally cheap. At each training step we already pay the cost of running the model on the full sequence; adding LECHO\mathcal{L}_{\text{ECHO}}LECHO​ only introduces an extra cross‑entropy summation term, which is negligible compared to the forward pass itself. Second, the shared computation couples the two learning signals tightly: the hidden representations that the model builds while attending over the trajectory must simultaneously serve the policy’s decision‑making and the prediction of environment states. That coupling acts as an implicit regularizer, discouraging representations that are useful only for the sparse terminal reward but forgetful of the environment dynamics that generated the trajectory. In agent tasks where understanding the world is essential for good actions, this inductive bias is likely to be beneficial.
However, sharing the forward pass also introduces a risk of gradient interference. The two losses may pull the model’s parameters in conflicting directions, especially early in training when the policy is poor and the environment‑prediction task might be trivial or misleading. The ECHO paper mitigates this by carefully choosing which observation tokens to target (excluding warnings) and by tuning λ\lambdaλ so that the auxiliary loss behaves as a gentle nudging signal rather than a dominant objective. In practice, the observed improvements in sample efficiency and terminal dynamics prediction suggest that the shared pass works synergistically: the model learns a richer internal world model, and the GRPO signal can focus on strategy and planning.
The visual below condenses this design into a single forward‑pass diagram. Imagine the model consuming an interleaved sequence of action and observation tokens from left to right. At each observation position, an extra head (or simply the same output projection) extracts logits that are compared against the true environment observation. Two loss bubbles branch out of the same computational trunk: one feeds the GRPO advantage loss from action tokens, the other feeds the ECHO cross‑entropy loss from selected observation tokens. Backward arrows then show the gradients recombining and flowing into the shared model backbone. The diagram makes immediately obvious that no second model nor separate inference pass exists; the whole process is a single end‑to‑end training step, turning what might have been an expensive bolt‑on supervision into a lightweight, unified gradient flow.
This architectural choice is what lets ECHO scale to long‑horizon agent tasks without compromising throughput. More importantly, because the same forward pass that evaluates the policy also generates the dense supervision targets, the agent never has to choose between environment understanding and reward optimization—the two are trained together, one step at a time.

8. Choosing Observation Targets: Env Tokens Only

Transitioning from the shared forward pass, we now face a subtle but decisive design choice: which tokens in the environment observation should actually receive the auxiliary prediction loss. Observations returned by interactive environments such as a shell or a code interpreter are not monolithic text blocks. They typically follow a structured template that includes boilerplate warning messages (e.g., “WARNINGS: …”) and the actual terminal output wrapped in a marker like &lt;command_output&gt; ... &lt;/command_output&gt;. The warning substrings are low‑entropy by construction — they repeat fixed patterns, error boilerplates, and rule‑based headers that change little across trajectories. If our auxiliary environment loss indiscriminately targets every token in the observation, the model’s learning dynamics become distorted in a way that undermines the very purpose of dense supervision.
The core problem is early memorization. Because warnings are highly regular and predictable, the language model rapidly reduces their cross‑entropy to near‑zero. Empirically, when training with the full observation token set, the cross‑entropy loss on warning tokens collapses to below 0.05 nats within roughly the first 60 gradient steps. This might sound like rapid convergence, but it is a Pyrrhic victory: once a token’s prediction probability approaches 1, the log‑likelihood gradient with respect to that token vanishes, contributing essentially zero gradient flow. Even worse, these now‑dead tokens still occupy a position in the per‑token average loss, diluting the informativeness of the overall environment loss. The model effectively stops learning anything useful about the dynamic, reward‑bearing parts of the observation because the majority of the loss signal is swamped by near‑zero terms.
To prevent this loss collapse, ECHO restricts the environment prediction target to a carefully selected subset O′⊂OO' \subset OO′⊂O, where O′O'O′ consists only of the env tokens — the content inside &lt;command_output&gt; — and excludes the warning preamble. Formally, the auxiliary loss becomes
LEnv(θ;O′)=−1Z∑t∈O′log⁡pθ(xt∣x<t),Z=∣O∣,  O′⊂O (env tokens only).L_{\text{Env}}(\theta; O') = -\frac{1}{Z} \sum_{t \in O'} \log p_\theta(x_t \mid x_{<t}), \qquad Z = |O|,\; O' \subset O \text{ (env tokens only)}.LEnv​(θ;O′)=−Z1​t∈O′∑​logpθ​(xt​∣x<t​),Z=∣O∣,O′⊂O (env tokens only).
The normalization constant ZZZ is still set to the full observation length ∣O∣|O|∣O∣, not to ∣O′∣|O'|∣O′∣. This deliberate choice preserves a consistent scaling with respect to the original sequence length (as discussed earlier in the shared forward‑pass section), ensuring that the magnitude of the gradient signal remains comparable across episodes with varying amounts of boilerplate. The model must still process all tokens but is only penalized for mispredicting the informative environment output.
The effect of this token‑type masking is striking. With O′O'O′ restricted to env tokens only, the cross‑entropy on the targeted tokens no longer plummets to zero. Instead, it plateaus at a healthy ~0.1 nats, indicating sustained uncertainty that provides a steady, meaningful gradient. The loss remains sensitive to the distribution of command outputs, keeping the model engaged with the environmental consequences of its actions throughout training. This dense, non‑degenerate signal is exactly what enables ECHO to serve as effective auxiliary supervision alongside the GRPO‑based RL objective.
The visual below consolidates this dynamic in a simple line plot of per‑token‑type cross‑entropy over training. The red dashed curve tracks the warning tokens: it plunges from around 0.5 nats to below 0.05 nats by step 60, after which it is effectively flat and gradient‑free. The blue solid curve traces the env tokens: it starts near 0.35 nats, decays gradually, and stabilizes around 0.1 nats. A vertical dotted line at step 60 marks the memorization point, making visually clear why warning tokens must be excluded. The plot provides the empirical backbone of the design choice: if we had succumbed to the seduction of quickly‑solved boilerplate and left warnings in the target set, the environment loss would have become a hollow number, and ECHO’s benefit as dense supervision would have evaporated.

9. ECHO Algorithm

Having chosen to train on environment tokens only and exclude warnings, we can now embed that selection into a clean training step. The ECHO algorithm is trivial to describe once the masks are defined: take the GRPO policy gradient computed on action tokens as usual, then add a cross‑entropy term that asks the language model to predict the ground‑truth output of the environment exactly as it appeared in the rollouts.
The motivation is straightforward. In terminal‑reward tasks, the agent only sees a sparse outcome signal at the end of a trajectory. GRPO, like any policy‑gradient method, propagates that reward backward through the action tokens, but the environment‑observation tokens—containing rich feedback from the world—receive zero learning pressure. If the model can be forced to anticipate how the environment responds to its actions, it develops an internal forward model that grounds the policy and makes better use of every sample. The ECHO loss turns that idea into a single differentiable term that re‑uses the same log‑probabilities already computed for the policy update.
Recall the standard GRPO (Group Relative Policy Optimization) step. For a sequence x1:Tx_{1:T}x1:T​ that interleaves actions and environment outputs, we run a forward pass and collect logits. For every position ttt we can compute the log‑probability of the actual token xtx_txt​ under the model:
log⁡pt=log⁡softmax(logitst)[xt].\log p_t = \log \mathrm{softmax}(\mathrm{logits}_t)[x_t].logpt​=logsoftmax(logitst​)[xt​].
In a pure GRPO update we would only use the subset {log⁡pt}t∈A\{\log p_t\}_{t \in A}{logpt​}t∈A​ corresponding to the agent’s actions, together with pre‑computed advantages {A^t}t∈A\{\hat{A}_t\}_{t \in A}{A^t​}t∈A​, and apply a clipped surrogate objective. All other positions—the environment responses—are ignored. The ECHO algorithm simply gathers the log‑probabilities for a second set of positions: the env‑observation mask O′O'O′ that we decided earlier contains only the informative environment tokens, excluding warnings. On those tokens we compute a cross‑entropy loss:
LCE=−1∣O∣∑t∈O′log⁡pt,L_{\mathrm{CE}} = -\frac{1}{|O|} \sum_{t \in O'} \log p_t,LCE​=−∣O∣1​t∈O′∑​logpt​,
where the normalizer ∣O∣|O|∣O∣ is the total length of the true environment observation in the episode (not just the masked subset), which guarantees that the scale of the auxiliary loss remains consistent across episodes with different observation lengths. The final ECHO loss is a simple blend:
LECHO=LGRPO+λ LCE.L_{\mathrm{ECHO}} = L_{\mathrm{GRPO}} + \lambda \, L_{\mathrm{CE}}.LECHO​=LGRPO​+λLCE​.
Everything comes from a single forward pass; there is no additional model evaluation, no teacher network, and no extra rollouts. The cross‑entropy term nudges the model to increase the probability of what the environment actually produced, effectively turning every environment token into a dense supervised target. Because the targets are the rollouts’ own tokens, the signal is available for every trajectory in the batch, including failed ones—no extra labelling required.
The design choices matter. Normalizing by ∣O∣|O|∣O∣ avoids under‑weighting long episodes; if we divided by ∣O′∣|O'|∣O′∣ instead, trajectories with many warning tokens would have a smaller effective loss after masking, defeating the purpose of excluding warnings. The coefficient λ\lambdaλ controls the balance; setting it too high could cause the model to overfit to the environment’s surface form at the expense of the policy, while a moderate value gives a reliable improvement in sample efficiency and terminal‑task success rates that we will see in the experiments.
This simple recipe means that any GRPO trainer can be upgraded to ECHO with just a few lines of code: after the forward pass, log‑probabilities are gathered at positions O′O'O′ and a scalar cross‑entropy is added to the policy loss before backpropagation. No extra hyperparameter beyond λ\lambdaλ is required, and the auxiliary loss works with any advantage estimator by leaving the action‑token computation unchanged.
The visual below distills the entire algorithm into a self‑contained pseudocode block. The function takes the full sequence together with the pre‑defined masks and advantages, runs the model once, computes the policy loss with ClippedGRPO, then adds the length‑normalized environment‑prediction loss, and returns their sum. The italic annotation emphasizes the remarkable simplicity: no extra model evaluations, teacher, or rollouts—only adding log‑probability gathering at positions O′O'O′. That succinct picture is the central contribution of ECHO: dense, verifier‑free supervision that costs almost nothing beyond the standard RL loop.

10. Experimental Setup

With the ECHO algorithm fully defined, the next step is to test whether its auxiliary observation-prediction objective genuinely improves agent training in realistic, sparse-reward terminal tasks. The evaluation is designed to be broad, reproducible, and deliberately tough, combining a large-scale training corpus with multiple out-of-distribution benchmarks that probe both in-domain mastery and generalization. The result is a setup that pushes GRPO-based RL recipes to their limit—and gives ECHO a demanding proving ground.
The training corpus comprises 8,770 terminal tasks, sourced from the Endless Terminals and OpenThoughts collections. These tasks span a wide range of command-line interactions: file system navigation, text processing, system administration snippets, and error recovery. Each task is effectively a mini-episode where the agent must issue a sequence of shell commands in a Docker sandbox and is rewarded only when the final environment state matches a target condition (e.g., a file contains a specific string). The sheer diversity of tasks ensures that an agent cannot succeed by memorizing action templates; it must learn to interpret environment feedback dynamically and adapt its commands across varied contexts.
The runtime constraints tighten the challenge further. Every episode unfolds inside an ephemeral Docker container, with a hard limit of 16 conversational turns and a 16k-token context window. After 16 command–observation exchanges, the episode terminates. This bounded horizon forces the agent to reason efficiently under uncertainty: a single misinterpreted error message can derail an entire attempt. It also magnifies the importance of dense learning signals, because waiting for a sparse terminal reward across 16 turns creates a weak gradient bottleneck. ECHO’s core idea—transforming the raw environment observation tokens into an immediate auxiliary loss—is a direct antidote to this sparsity.
The experiments center on three model variants, all based on the Qwen3 family. The primary RL starting point is the Qwen3‑8B model, a modern 8‑billion‑parameter language model. To isolate the effect of reinforcement learning, we also compare against a strong supervised fine‑tuning baseline: OpenThinker‑Agent‑v1‑SFT (OT‑SFT), which was trained on roughly 15k expert demonstrations of terminal‑agent interactions. Additionally, Qwen3‑14B is included to examine how scaling the base model interacts with both GRPO and ECHO. All RL training initializes from the same SFT checkpoint, placing the comparison on a common footing: any improvement beyond the SFT baseline can be squarely attributed to the RL phase and the additional ECHO signal.
The GRPO recipe itself is kept intentionally simple and robust. For each prompt, we sample n=16n=16n=16 independent rollouts, generating a group of terminal trajectories whose outcomes are ranked via a task‑specific verifier. The batch collects 16 prompts, so every update step sees 256 trajectories, providing a rich ensemble for advantage estimation. The optimizer uses a learning rate of 1×10−61\times10^{-6}1×10−6 and gradient clipping at 0.2 to prevent extreme parameter updates. Notably, no KL penalty is applied: we found that the combination of clipping and the large rollout group naturally kept the policy within a stable region without the need for an explicit divergence term, simplifying the loss landscape and letting ECHO operate on equal footing with standard GRPO.
The ECHO auxiliary loss is added to the policy gradient objective with a coefficient λ=0.05\lambda = 0.05λ=0.05, a value selected after sweeping over {0.001,0.005,0.01,0.05,0.1,0.2}\{0.001, 0.005, 0.01, 0.05, 0.1, 0.2\}{0.001,0.005,0.01,0.05,0.1,0.2}. A coefficient this small ensures that the primary RL signal—the terminal reward—remains the dominant training force, while the observation-prediction term gently injects gradients at every token position. Larger values risk over‑regularizing the policy towards mimicking exact environment outputs and can degrade exploration; the chosen 0.05 struck the best balance in preliminary validation, accelerating convergence without sacrificing final pass rates.
Generalization is measured across four distinct evaluation suites. val100 contains 100 held‑in‑distribution tasks that were never seen during training, gauging in‑domain reliability. ITD (71 tasks) and OpenThoughts‑TBLite (100 tasks) are entirely out‑of‑distribution, with novel command vocabularies and unseen compositional structures. Finally, TerminalBench‑2.0 (89 tasks) serves as a standardized, community‑recognized benchmark for terminal agent performance. Together these suites form a rigorous stress‑test: a method that improves only in‑distribution but fails on OOD tasks is of limited practical interest. ECHO is expected to deliver consistent gains across all of them by teaching the agent a transferable skill—reading and anticipating environment state.
Every experiment runs for 500 GRPO steps on 8 GPUs, a deliberately modest compute budget that reflects a practical, resource‑conscious setting. The entire setup underscores that ECHO is not merely evaluated in isolation but pitched against robust baselines under realistic constraints.
The accompanying diagram condenses this multi‑faceted experimental design into a single, glanceable table. On the left, bold Setting labels organize the configuration into logical groups—task corpus, runtime, models, RL recipe, the ECHO coefficient, evaluation benchmarks, and compute. The right column delivers crisp, concrete values, with numeric quantities set in a monospaced font for immediate readability. This structured layout makes it effortless to cross‑reference the scale of data, the training hyperparameters, and the exact evaluation domains as the upcoming sections present empirical comparisons. It functions as a compact anchor: everything needed to understand how ECHO was tested is captured in one place, ready to support the pass‑rate figures and learning curve analyses that follow.

11. ECHO Doubles TerminalBench-2.0 Pass Rate

With the experimental frame in place, we turn to the central quantitative question: does ECHO’s dense auxiliary supervision actually improve agent performance on hard, out-of-distribution terminal tasks? The answer, captured on TerminalBench‑2.0, is unambiguous. Even a casual glance at the headline numbers reveals that ECHO nearly doubles the pass@1 of standard GRPO—a result that holds across two model sizes and multiple evaluation protocols.
The TerminalBench‑2.0 suite consists of 89 carefully curated OOD problems that stress‑test an agent’s ability to reason through multi‑step interactions and produce the correct terminal action. Because the final reward is binary and arrives only after the last environment step, credit assignment under vanilla GRPO becomes extremely sparse. ECHO injects a dense learning signal by treating the sequence of environment observations as targets for an auxiliary cross‑entropy loss, forcing the policy model to build an internal terminal world model. That side‑channel forces the model to pay attention to the unfolding state dynamics, not just to a handful of reward‑bearing tokens. The empirical payoff is dramatic.
For the Qwen3‑8B model, GRPO alone achieves a pass@1 of 2.70; ECHO lifts this to 5.17, a 1.9× improvement. Scaling up to the 14‑billion‑parameter variant, GRPO’s 5.17 jumps to 10.79 under ECHO—a 2.1× multiplier. These are not cherry‑picked numbers from a single favorable split. The same pattern emerges consistently across val100, ITD, and TBLite evaluation slices, each of which stresses a different aspect of generalization. The robustness of the gain argues that the benefit comes from better grounding in environment dynamics rather than from exploiting peculiarities of a specific test distribution.
Perhaps even more instructive than the final scores are the learning curves. Standard GRPO displays the agonisingly slow climb characteristic of sparse‑reward RL: many thousands of steps of virtually flat performance before a gradual lift‑off. ECHO’s training trajectory paints a different picture. From early in training the orange curve separates visibly from the dashed blue GRPO baseline, and the gap widens steadily. The improvement is not merely asymptotic; it is kinetic.
For the 8B model, ECHO reaches the final pass@1 plateau that GRPO ultimately attains after roughly 1.5–2.3× fewer environment steps. In practical terms, this means that an RL practitioner could cut training time by more than half and still match the best that sparse‑reward optimisation can offer—or continue training to a substantially higher final performance. The 14B model shows a similar acceleration, though the speedup is partly masked by the fact that larger models already generalise better and benefit from a higher starting point. Still, the orange asymptote sits clearly above the blue one, confirming that ECHO’s supervision continues to pay dividends long after the initial burst of learning.
The visual below distills this evidence into a single at‑a‑glance comparison. A two‑panel plot—one per model size—displays pass@1 versus GRPO training steps, with shaded ±1 standard‑deviation regions to give a sense of run‑to‑run variance. An annotation on the 8B panel explicitly marks the point where ECHO matches GRPO’s final performance in roughly half the steps. On the right, a compact table isolates the terminal pass@1 numbers for both models, showing the multiplier effect in bold. A small caption reminds us of the abysmal before‑RL baselines (1.35 for 8B, 3.37 for 14B), underscoring just how steep the cliff is that RL must climb—and how much ECHO flattens it.
What emerges is a consistent narrative: by converting terminal feedback into dense, observation‑level supervision, ECHO effectively solves the credit‑assignment bottleneck that holds back standard GRPO. The framework not only raises the performance ceiling; it also brings the agent to that ceiling much faster, a dual advantage that carries profound implications for scaling agent RL.

12. Evidence of Terminal World Model Learning

While the previous results show that ECHO substantially lifts pass rates, a skeptic might ask whether this improvement stems from a genuine understanding of the environment’s dynamics or merely from better exploitation of the terminal reward signal. After all, a policy that learns to act successfully need not have formed an internal model of why those actions succeed—it may simply be a reactive function mapping observations to actions without predicting future observations. In embodied agent settings, however, a reliable policy should anticipate the consequences of its actions; in other words, it should develop a terminal world model that captures how the environment responds over a trajectory. ECHO’s auxiliary objective directly encourages this by penalizing the model when it mispredicts the tokens that the environment itself generates. To isolate whether this supervision actually teaches dynamics, we need a clean evaluation that separates task success from predictive understanding.
A direct way to probe the model’s grasp of environment dynamics is to measure how well it predicts the tokens of held‑out trajectories that it has never seen, and that were generated by a different, stronger policy. This off‑policy setup removes any confounding effects of the model’s own action distribution: if the model truly internalizes the generative rules of the environment, it should be a good predictor even on trajectories produced by another agent. We therefore collect trajectories from Qwen3‑32B, a model significantly larger and more capable than the 8B and 14B policies under study. These trajectories contain the same observation/action structure from TerminalBench tasks but were produced by a policy that the trained models never encountered during their own reinforcement learning. The test slices include the in‑distribution val100 set and two additional evaluation spreads, ITD and TBLite, which together cover varying levels of task diversity.
The metric we use is the per‑token cross‑entropy (CE) over observation positions O′O'O′:
CE=−1∣O′∣∑t∈O′log⁡pθ(xt∣x<t).\text{CE} = -\frac{1}{|O'|}\sum_{t\in O'} \log p_\theta(x_t \mid x_{<t}).CE=−∣O′∣1​t∈O′∑​logpθ​(xt​∣x<t​).
By computing the model’s log‑probability only on the tokens that correspond to environment outputs (e.g., system messages, command results, file listings), we directly quantify how surprised the policy is by the world’s response. A low CE means the model expects the environment to produce exactly those tokens, a hallmark of an accurate internal dynamics model. Crucially, this evaluation does not involve any reward signal; it purely tests next‑token prediction accuracy on an off‑policy, purely observational set of subsequences.
Comparing the variants yields a striking dissociation. GRPO alone—which optimizes only the terminal reward—barely moves the CE needle relative to the base model, even when its task success rate climbs (as shown in the earlier slide). In other words, the standard RL‑tuned policy can learn to act effectively without learning to predict what the world will do next. This finding reveals a blind spot in pure reward‑maximization: it can produce competent agents that nevertheless lack a robust mental model of their environment, making them brittle under distribution shift.
In stark contrast, ECHO slashes cross‑entropy across every evaluation slice. For the 14B model, CE on the val100 slice falls from 0.24 to 0.07; for the 8B model, from 0.28 to 0.11. On ITD the drop goes from 0.39 to 0.31, and on TBLite from 0.30 to 0.23. These are substantial reductions—in many cases cutting the prediction error by more than half—demonstrating that the auxiliary observation‑level loss indeed trains the policy to anticipate terminal feedback. The fact that the improvement is largest on the in‑distribution val100 slice (which shares the same task family as the training set) but still pronounced on the other slices confirms that ECHO learns transferable dynamics, not a brittle memorization of training noise. The model has acquired a genuine, reusable environment predictor.
It is worth pausing on what this implies. ECHO’s dense supervision turns the policy into a terminal world model as a side effect of its training. The model no longer merely reacts to environment tokens; it internalizes their generative structure. This property is not only theoretically pleasing—it directly supports more stable data‑efficient learning and opens the door to zero‑reward adaptation, which we will examine next. The contrast between GRPO and ECHO underscores that rich, token‑level learning signals act as a catalyst that transforms a policy optimizer into a dynamics learner.
The visual that accompanies this analysis consolidates the full set of CE measurements into a compact, comparative format. An Excalidraw‑style multi‑panel bar chart is arranged with two groups—one for each model scale (8B and 14B)—and within each group, three clusters of three bars represent the val100, ITD, and TBLite slices. The bars for the base model and GRPO are nearly overlapping, colored in muted gray and orange, while the ECHO bars in green are dramatically shorter, with downward arrows emphasizing the magnitude of the drop. The sparse, hand‑drawn aesthetic ensures the core story is absorbed at a glance: dense observation supervision teaches the model to expect what the environment will do, whereas pure RL, however successful on task metrics, teaches little more than how to get the reward. The figure’s large, readable labels and equation callout tie the visual back to the CE formula, making it both a summary of the results and a prompt for the informed reader to appreciate the deeper claim—ECHO yields policies with a genuine, testable understanding of terminal dynamics.

13. Reducing Dependence on Expert Demonstrations

The previous section showed that ECHO training on terminal outcomes induces a rich terminal world model: the policy learns to forecast the future state of the environment without any auxiliary prediction head. This emergent capability suggests that ECHO is compressing environment dynamics into the policy itself. But an immediately practical question follows: if the policy is already internalizing the environment through ECHO, does it still need the kind of environmental familiarity that is typically injected via expensive expert demonstrations? For agent tasks where the reward signal only arrives at the end of a multi-step trajectory, the standard recipe today is supervised fine‑tuning (SFT) on expert trajectories, followed by GRPO‑based reinforcement learning. The SFT phase teaches the model what a successful rollout looks like, giving it a crucial head start in a sparse‑reward world. However, collecting high‑quality demonstrations can be costly or impossible in new domains. ECHO’s dense, step‑level supervision from environment observations raises the possibility that we can largely eliminate that dependency.
To quantify this, the authors define two gaps that measure the benefit of an intervention over a baseline that only uses GRPO on frozen base‑model weights. The SFT gap is the performance improvement of expert‑SFT + GRPO over base + GRPO. This is the traditional lift that expensive demonstrations provide. The ECHO lift is the improvement of base + GRPO + ECHO over base + GRPO. If ECHO can match the SFT gap, then the model achieves the same terminal‑task performance without any expert trajectories. The ratio of these two quantities, expressed as a percentage, answers the headline question: What fraction of the expert‑demonstration advantage is recovered by auxiliary environment prediction?
Experiments on Qwen3‑8B reveal a striking result. Across three standard internal benchmarks—val100, ITD, and TBLite—ECHO closes the SFT gap almost entirely. On val100 it recovers 101.6%, on ITD 103.9%, and on TBLite 88.9%. These numbers indicate that for most tasks in these suites, the environment familiarity gained by observing and imitating expert rollouts can be replaced wholesale by the dense autoregressive objective that forces the policy to predict the next observation in successful trajectories. The policy, forced to model the consequences of its own actions, learns to navigate the environment without a single human or scripted demonstration.
The story changes when we move to the more challenging TerminalBench‑2.0, which contains harder planning and tool‑use tasks. Here the recovery percentages sit at 50.0% for pass@1, 48.6% for pass@3, and 50.0% for pass@5. ECHO still provides half of the SFT advantage, a substantial fraction, but the remaining gap signals a deeper requirement. Harder tasks demand not only knowledge of what happens next in the environment, but also which strategic action sequences are likely to succeed—a kind of procedural know‑how that is directly embedded in the expert demonstrations themselves. ECHO’s auxiliary loss does not explicitly teach trajectory‑level planning; it teaches local dynamics, leaving the policy to discover effective action selection through RL. The half‑recovery therefore suggests that while ECHO internalises the world model, the action model still benefits from seeing expert choices.
A visual summary of these experiments brings the quantitative comparison into focus. The diagram below depicts a horizontal bar chart for each evaluation benchmark, contrasting the SFT gap and the ECHO lift side‑by‑side. Each bar is conceptually split: the lower portion (in blue) represents the ECHO lift relative to base GRPO, and the upper portion (in light grey) represents whatever residual advantage expert SFT still holds over ECHO. When the blue bar fills the entire span, recovery exceeds 100%—ECHO actually surpasses the SFT variant. The dashed vertical line marking 100% recovery makes it immediately obvious that val100 and ITD cross that threshold, TBLite comes close, and TerminalBench‑2.0 hovers around 50%. The annotated percentages on each bar (101.6%, 103.9%, 88.9%, 50.0%) keep the numerical precision while the graphical encoding conveys the pattern at a glance.
What the reader takes away from this figure is twofold. First, ECHO is an extraordinarily effective substitute for expert demonstrations in many terminal‑task domains, essentially making SFT optional rather than mandatory. Second, the residual gap on harder benchmarks is not a failure of ECHO’s world‑model learning but a sign that purely environment‑side supervision cannot fully replace the action‑distribution guidance of expert trajectories—a limitation that invites future methods to combine dense observation prediction with some form of strategic imitation. For the practitioner, this means that when expert data is scarce or expensive, turning on ECHO recovers the vast majority of the benefit at a fraction of the cost, while pointing to the hardest cases where a handful of expert examples might still be worth their weight.

14. Verifier-Free Adaptation from Environment Prediction Alone

The previous results showed that augmenting GRPO with an auxiliary environment‑observation loss substantially improves pass@k, terminal‑state prediction, and sample efficiency. Yet the policy remained trained with both the reward signal and the dense supervision from ECHO. A natural next question is whether the dense environment signal is strong enough to stand on its own – can a capable agent bootstrap further improvement using only its own predictions about the environment, with no reward whatsoever?
That question matters because, if environment‑prediction alone can lift performance, we have a path toward verifier‑free adaptation: an agent that continues to refine its behavior simply by practising its ability to anticipate what the environment will return, without needing a hand‑crafted or learned verifier. In the ECHO study, the authors isolate this effect by taking the best Qwen3‑8B+ECHO checkpoint and running 100 additional gradient steps where the GRPO term is masked entirely – only the environment loss is active. The environment loss is the auxiliary cross‑entropy computed over a well‑chosen set of observation tokens:
LEnv(θ)=−1Z∑t∈O′log⁡pθ(xt∣x<t),Z=∣O′∣L_{\text{Env}}(\theta) = -\frac{1}{Z} \sum_{t \in O'} \log p_\theta(x_t \mid x_{<t}), \qquad Z = |O'|LEnv​(θ)=−Z1​t∈O′∑​logpθ​(xt​∣x<t​),Z=∣O′∣
Here O′O'O′ denotes the set of environment‑observation tokens that were excluded from the warning tokens (system‑error messages and other uninformative outputs) as described in earlier slides. By forcing the model to assign high probability to the actual environment outputs, LEnvL_{\text{Env}}LEnv​ acts as a dense, self‑supervised push toward a more accurate internal world model – one that captures the causal link between actions and their consequences.
What happens under this pure environment‑prediction regime? The in‑distribution validation set (val100) gains +3.8 percentage points in success rate even when all trajectories are used, without any filtering. This is a striking result: without a single scalar reward, the agent improves on tasks that resemble those seen during reward‑based training. The more discriminating test comes from three out‑of‑distribution (OOD) task suites – PyTerm (928 synthetic Python tasks), ITD (a diverse instruction‑following benchmark), and TBLite (a table‑based environment). For OOD evaluation, it is necessary to apply a filter: trajectories that contain parse errors or tool‑execution failures are discarded before averaging the success rate. The reason is that invalid trajectories produce garbled or empty environment observations, and forcing the model to predict garbage distorts its latent representations. After filtering, the gains become even more pronounced: PyTerm jumps by +10.0 pp, and ITD improves by +5.2 pp. In stark contrast, TBLite degrades.
The contrasting behaviour of TBLite is illuminating. In that environment, the observations are relatively shallow – they mostly consist of formatted table snippets that change little from step to step and carry little predictive information about future success. Attempting to predict such observations does not force the model to build a richer causal model of the task; instead, it may overfit to spurious patterns or simply waste capacity. PyTerm and ITD, on the other hand, expose the model to rich, intermediate artifacts (Python shell outputs, error traces, partial results) that are tightly coupled to the underlying execution state. Successfully predicting those outputs demands an accurate mental simulation of the environment’s dynamics, which naturally transfers to better action selection – even though no external reward is ever calculated.
The visual below consolidates these empirical findings into a grouped bar chart titled Verifier‑Free Adaptation Gains (Δ success rate). The x‑axis lists the four datasets, with two contrasting bar groups: an Unfiltered bar (shown only for val100) measuring raw improvement without trajectory cleaning, and Filtered bars (for all datasets) showing the change after invalid runs are removed. The numeric gains – +3.8 for val100, a striking +10.0 for PyTerm, +5.2 for ITD, and a negative value for TBLite – are plotted on a tight y‑axis ranging from ‑5 to +12 percentage points. A legend distinguishes ECHO‑only from ECHO‑only (filtered). This layout makes immediately visible the two central takeaways: first, environment‑prediction alone can be a powerful self‑improvement mechanism when the environment is sufficiently predictive, and second, that power collapses when the observational channel becomes too impoverished to support model‑based credit assignment.

15. ECHO: Summary and Broader Implications

ECHO’s central idea is disarmingly simple: when a language model agent acts in a terminal environment, the observations it receives at each step are not just state signals—they are dense, token‑level targets that can be learned. In standard GRPO‑based training for such agents, the reward signal is sparse: a binary success indicator appears only at the very end of a trajectory, and all intermediate generations are governed purely by the policy gradient. The model never gets explicit feedback that this action, in this context, should have led to that observation. ECHO fills that gap by turning every environment‑observation token into a supervised learning target, simultaneously with the reinforcement learning signal.
To understand how this works, recall that in a terminal‑task rollout, the agent autoregressively produces an action string, the environment responds with an observation string (often containing status updates, error messages, or numeric outputs), and the process repeats until termination. Under GRPO, the loss is a clipped surrogate computed from the terminal reward advantage. ECHO simply adds an auxiliary cross‑entropy loss LEnvL_{\text{Env}}LEnv​ on the observation tokens. Critically, the observation tokens are on‑policy: they come from the environment during the rollout, so they reflect the actual consequences of the model’s own actions in that trajectory. Unlike demonstrations from an expert policy, these targets are perfectly aligned with the distribution of states the current policy visits, providing dense supervision that reduces variance and accelerates learning.
The construction of LEnvL_{\text{Env}}LEnv​ deserves careful attention. For a trajectory with observation sequence O′O'O′, the loss is the standard log‑likelihood of predicting each token in O′O'O′ given the preceding context—but only for positions that belong to the observation, not the model’s own action tokens. Moreover, the authors report that excluding warning tokens (such as system‑generated cautionary prefixes) from the target is beneficial: these tokens often carry little informational content and can bias the model toward passive responses. This filtering is an important practical detail; it ensures that the dense supervision focuses on semantically meaningful prediction of actual environmental state transitions.
With this design, the unified ECHO objective becomes
LECHO=LGRPO+λ LEnvL_{\text{ECHO}} = L_{\text{GRPO}} + \lambda \, L_{\text{Env}}LECHO​=LGRPO​+λLEnv​
where LGRPOL_{\text{GRPO}}LGRPO​ is the standard group‑relative policy gradient loss and λ\lambdaλ balances the two terms. Because both losses are computed from the same forward pass (the model generates actions, receives observations, and computes log‑probabilities over the full sequence), the auxiliary supervision comes with minimal computational overhead. The gradient from LEnvL_{\text{Env}}LEnv​ flows back through the action tokens as well, so the model learns to produce actions that anticipate the subsequent observation—effectively internalizing a predictive model of the environment.
The empirical impact is striking. On TerminalBench‑2.0, applying ECHO to Qwen3‑8B and Qwen3‑14B agents yields a 1.9× and 2.1× improvement in pass@1, respectively, compared to vanilla GRPO. These gains are robust across tasks, and the approach also recovers a large fraction of the benefit that previously required initializing from expert‑supervised fine‑tuning (SFT): on internal benchmarks, ECHO alone attains up to 104% of the demonstration‑initialization gap, meaning it can surpass what was previously achievable only by providing expert trajectories as a starting point. This suggests that dense environment feedback effectively replaces the need for costly human‑labeled demonstrations while staying on‑policy.
Even more provocatively, the verifier‑free adaptation experiments (highlighted in the previous section) demonstrate that LEnvL_{\text{Env}}LEnv​ by itself—without any reward signal—can improve the agent’s performance on out‑of‑distribution tasks. The model is simply trained to predict the environment’s responses to its own actions, and self‑consistency or a success heuristic is used for evaluation. That such a pure self‑supervised objective leads to gains underscores the richness of the learning signal hidden in ordinary environment interactions. This blurs the line between RL and imitation from self‑generated data, opening a door to continual, unsupervised improvement of deployed agents.
The visual that follows serves as a consolidated summary of ECHO’s contributions. It presents a clean two‑column table: the left column lists the five numbered insights—dense on‑policy supervision, the unified echo equation, pass@1 gains, recovery of expert‑SFT benefit, and verifier‑free self‑improvement—while the right column gives a concise detail for each, such as the loss coverage over all observation tokens or the multiplicative improvements on TerminalBench. Beneath the table, a highlighted box captures the broader implication in a single italic paragraph: agent RL has been overlooking an immense supervisory source in the observable consequences of its own actions, and ECHO shows that even a simple next‑token prediction objective on terminal observations can yield dense, transferable, and often reward‑free gains.
That box, along with the centrally displayed equation LECHO=LGRPO+λLEnvL_{\text{ECHO}} = L_{\text{GRPO}} + \lambda L_{\text{Env}}LECHO​=LGRPO​+λLEnv​, distills the lecture’s parting message. The equation is not merely a loss formula; it represents a conceptual shift—from treating an environment as a black‑box that returns a reward at the end, to treating it as a teacher that provides immediate correction tokens at every step. By embracing that teacher, ECHO recovers the benefits of expert data without requiring it, and accelerates RL for terminal tasks beyond what sparse rewards alone can achieve.