Odysseus: Stable RL Training of VLMs for Long-Horizon Game Decision-Making - FeynmanWiki

CONTENTS

Bookmark this paper

Save for later reading

MACHINE LEARNING, REINFORCEMENT LEARNING - 45 MIN READ

Odysseus: Stable RL Training of VLMs for Long-Horizon Game Decision-Making

1. Long-Horizon VLM Decision-Making in Super Mario Land

If you’ve ever watched a frontier vision-language model try to play a platformer for more than a few seconds, you know the arc is predictable: a promising first jump, a moment of hesitation, then a dead-end pattern of button mashing or total paralysis. The core challenge isn’t visual recognition—it’s coherent, extended decision-making under sparse feedback. In Super Mario Land, a single misstep can cascade into failure dozens of frames later, and the reward (finishing the level) arrives only at the very end, if at all. Turning a VLM into an agent that can sustain precise, reactive control for thousands of timesteps requires rethinking how we train these models, and Odysseus sets out to do exactly that by combining reinforcement learning with a lightweight critic architecture.
The task is deceptively hard. The agent sees raw RGB frames of the Game Boy screen at 60 fps and must output one of a small set of button combinations: move right, jump, dash, or some chord thereof. The physical world of Mario—gravity, momentum, enemy AI patterns—is never explicitly communicated to the model; it must infer the latent state from raw pixels and a brief textual instruction. Small errors compound: a slightly mistimed jump over a Goomba may cause the agent to land on it, lose a life, and then struggle to recover its previous progress. This is the textbook definition of a long-horizon control problem, where the temporal gap between an action and its consequences suffocates naive credit assignment.
Why use a VLM at all? Historically, deep RL with compact CNNs has performed admirably on Atari games and even some platformers. But those agents are brittle, trained from scratch for every new game, and they cannot leverage natural language instructions, commonsense reasoning, or cross-game priors. A VLM, especially one pretrained on web-scale image–text data, brings a rich semantic understanding of objects (a pipe, a Koopa, a block) and a semblance of affordance reasoning. The hope is that this prior knowledge can dramatically improve sample efficiency and enable the kind of generalization that makes an agent play a user-described novel level without retraining. The challenge is that VLMs are enormous, their internal representations are not directly shaped for sequential decision-making, and fine-tuning them on a per-task basis with RL can be catastrophically unstable if done carelessly.
The standard RL formulation for such a game casts the VLM as a stochastic policy π(at∣ot,instruction)\pi(a_t \mid o_t, \text{instruction})π(at​∣ot​,instruction) that maps the current observation oto_tot​ and the human-readable goal to a distribution over discrete actions ata_tat​. The environment then transitions to ot+1o_{t+1}ot+1​ and emits a scalar reward rtr_trt​. In Super Mario Land, the reward signal is engineered to be slightly denser than the final win—typically a small positive reward for progressing rightward and a penalty upon death—but it is still extremely noisy. With a purely policy-gradient approach like REINFORCE, the gradient estimator has high variance because the action log-probability is weighted by the entire future cumulative reward, an estimate that fluctuates wildly across the long horizon. This is why a value function—a critic—becomes indispensable: it provides a learned baseline that reduces variance and enables more stable, sample-efficient updates.
Enter PPO (Proximal Policy Optimization), the workhorse of modern RL. PPO clips policy updates to stay within a trust region, but it relies on an advantage estimate At=Qπ(st,at)−Vπ(st)A_t = Q^{\pi}(s_t,a_t) - V^{\pi}(s_t)At​=Qπ(st​,at​)−Vπ(st​). For VLMs, the computation of VπV^{\pi}Vπ is not trivial. You could augment the VLM with a value head that shares its visual backbone, but that couples the critic’s stability to the same gigantic transformer that needs to be updated cautiously. Odysseus instead pairs the VLM policy with a lightweight, turn-level CNN critic that operates on compact summaries of the game state. This separation decouples the heavy visual-language reasoning (where a few updates can destabilize the policy) from the fast, low-variance value estimation needed for stable PPO. The critic is quick to train and can be aggressively updated to provide reliable advantage signals, while the VLM policy is updated more conservatively—sometimes only on actions where the estimated advantage is positive, a trick we’ll unpack later.
The visual below captures the high-level architecture of this decision-making loop. It depicts an agent observing a screen from Super Mario Land, processing it through a VLM module that outputs both an action and, optionally, a textual reasoning trace, while a separate CNN critic consumes a compressed visual representation (or a dedicated feature map) to estimate the state’s value. The diagram emphasizes the two-stream design: the heavy VLM for policy and the lightweight critic for advantage estimation, with the environment feedback flowing back into both. Importantly, the VLM’s policy is not constantly retrained from scratch; it builds on pretrained knowledge, and the RL pipeline is carefully gated to preserve that knowledge while adapting to the game’s physics and control demands. This separation is the key that makes long-horizon RL with VLMs feasible, avoiding the variance explosion that plagues critic-free methods and the slow, unstable fine-tuning that comes from coupling the critic too tightly to the huge model.

CONTENTS

Bookmark this paper

Save for later reading

MACHINE LEARNING, REINFORCEMENT LEARNING - 45 MIN READ

Odysseus: Stable RL Training of VLMs for Long-Horizon Game Decision-Making

1. Long-Horizon VLM Decision-Making in Super Mario Land

If you’ve ever watched a frontier vision-language model try to play a platformer for more than a few seconds, you know the arc is predictable: a promising first jump, a moment of hesitation, then a dead-end pattern of button mashing or total paralysis. The core challenge isn’t visual recognition—it’s coherent, extended decision-making under sparse feedback. In Super Mario Land, a single misstep can cascade into failure dozens of frames later, and the reward (finishing the level) arrives only at the very end, if at all. Turning a VLM into an agent that can sustain precise, reactive control for thousands of timesteps requires rethinking how we train these models, and Odysseus sets out to do exactly that by combining reinforcement learning with a lightweight critic architecture.
The task is deceptively hard. The agent sees raw RGB frames of the Game Boy screen at 60 fps and must output one of a small set of button combinations: move right, jump, dash, or some chord thereof. The physical world of Mario—gravity, momentum, enemy AI patterns—is never explicitly communicated to the model; it must infer the latent state from raw pixels and a brief textual instruction. Small errors compound: a slightly mistimed jump over a Goomba may cause the agent to land on it, lose a life, and then struggle to recover its previous progress. This is the textbook definition of a long-horizon control problem, where the temporal gap between an action and its consequences suffocates naive credit assignment.
Why use a VLM at all? Historically, deep RL with compact CNNs has performed admirably on Atari games and even some platformers. But those agents are brittle, trained from scratch for every new game, and they cannot leverage natural language instructions, commonsense reasoning, or cross-game priors. A VLM, especially one pretrained on web-scale image–text data, brings a rich semantic understanding of objects (a pipe, a Koopa, a block) and a semblance of affordance reasoning. The hope is that this prior knowledge can dramatically improve sample efficiency and enable the kind of generalization that makes an agent play a user-described novel level without retraining. The challenge is that VLMs are enormous, their internal representations are not directly shaped for sequential decision-making, and fine-tuning them on a per-task basis with RL can be catastrophically unstable if done carelessly.
The standard RL formulation for such a game casts the VLM as a stochastic policy π(at∣ot,instruction)\pi(a_t \mid o_t, \text{instruction})π(at​∣ot​,instruction) that maps the current observation oto_tot​ and the human-readable goal to a distribution over discrete actions ata_tat​. The environment then transitions to ot+1o_{t+1}ot+1​ and emits a scalar reward rtr_trt​. In Super Mario Land, the reward signal is engineered to be slightly denser than the final win—typically a small positive reward for progressing rightward and a penalty upon death—but it is still extremely noisy. With a purely policy-gradient approach like REINFORCE, the gradient estimator has high variance because the action log-probability is weighted by the entire future cumulative reward, an estimate that fluctuates wildly across the long horizon. This is why a value function—a critic—becomes indispensable: it provides a learned baseline that reduces variance and enables more stable, sample-efficient updates.
Enter PPO (Proximal Policy Optimization), the workhorse of modern RL. PPO clips policy updates to stay within a trust region, but it relies on an advantage estimate At=Qπ(st,at)−Vπ(st)A_t = Q^{\pi}(s_t,a_t) - V^{\pi}(s_t)At​=Qπ(st​,at​)−Vπ(st​). For VLMs, the computation of VπV^{\pi}Vπ is not trivial. You could augment the VLM with a value head that shares its visual backbone, but that couples the critic’s stability to the same gigantic transformer that needs to be updated cautiously. Odysseus instead pairs the VLM policy with a lightweight, turn-level CNN critic that operates on compact summaries of the game state. This separation decouples the heavy visual-language reasoning (where a few updates can destabilize the policy) from the fast, low-variance value estimation needed for stable PPO. The critic is quick to train and can be aggressively updated to provide reliable advantage signals, while the VLM policy is updated more conservatively—sometimes only on actions where the estimated advantage is positive, a trick we’ll unpack later.
The visual below captures the high-level architecture of this decision-making loop. It depicts an agent observing a screen from Super Mario Land, processing it through a VLM module that outputs both an action and, optionally, a textual reasoning trace, while a separate CNN critic consumes a compressed visual representation (or a dedicated feature map) to estimate the state’s value. The diagram emphasizes the two-stream design: the heavy VLM for policy and the lightweight critic for advantage estimation, with the environment feedback flowing back into both. Importantly, the VLM’s policy is not constantly retrained from scratch; it builds on pretrained knowledge, and the RL pipeline is carefully gated to preserve that knowledge while adapting to the game’s physics and control demands. This separation is the key that makes long-horizon RL with VLMs feasible, avoiding the variance explosion that plagues critic-free methods and the slow, unstable fine-tuning that comes from coupling the critic too tightly to the huge model.

2. Failure Modes of Frontier VLMs

The previous section established that Super Mario Land is a demanding testbed, with its carefully spaced platforms, hidden enemies, and unforgiving timer demanding precise, long-horizon decision-making. Yet anyone who has spent a Sunday afternoon with an emulator knows that the opening levels are far from impossible—for human players. So one might reasonably expect that frontier vision–language models, trained on millions of internet-scale images and texts, could at least make a respectable attempt. The reality, as the current section reveals, is sobering: when dropped into the game and asked to act turn-by-turn, even the most capable off-the-shelf VLMs fail utterly to complete the first five levels.
Quantitatively, the gap between expectation and outcome is stark. Across four leading models—GPT‑5.4, GLM‑4.6V, LLaVA‑Next‑34B, and Qwen‑VL‑Max—the average progress measured in in-game meters before the agent’s final death remains well below 300 m. With a single level spanning roughly 550 m, none of these models manages even half of a level. The highest figure, 240 m for GPT‑5.4, represents a sequence of failed attempts that never break out of the early game. The other models fare worse, with LLaVA‑Next‑34B averaging just 180 m. This is a clean, numerical demonstration that vision–language capability alone does not translate into effective control when the environment demands persistence and precise coordination over dozens of consecutive decisions.
Why is performance so poor? The root causes are not esoteric; they fall into two broad categories that anyone watching a replay can recognize. Perception errors occur when the VLM misreads the visual scene. A stationary background pipe might be interpreted as a harmless decoration, causing Mario to walk straight into it and lose momentum. An enemy goomba can be ignored entirely if the model’s attention is drawn to a more salient but irrelevant shape in the sky. Gaps between platforms are frequently misjudged, with the model either leaping far too early or not at all, sentencing the character to a death pit. These aren’t random pixel-level mistakes—they reflect a fundamental brittleness in visual grounding when the model lacks experience in the specific physics and level geometry of this domain.
Compounding the problem are timing errors. Even when the object is perceived correctly, the action may be mistimed relative to the game’s dynamics. A jump initiated a fraction of a second too soon clears a pit but lands on a descending platform that has already moved; a jump delayed by a similar margin sends Mario into the gap. Over long stretches where acceleration must be held for a precise interval to bridge a wide gap, frozen VLMs exhibit erratic throttle control, often releasing the run button prematurely and dying inches from safety. These errors are especially insidious because they are not obviously “wrong” in a single frame—a screenshot might look perfectly reasonable, and only the instant-by-instant action sequence reveals the flaw.
Where perception and timing mistakes become catastrophic is in their accumulation. A single mis‑step might be survivable, but over a 100‑turn trajectory—the typical number of discrete actions needed to cross a level—the probability that the agent avoids all failures becomes vanishingly small. Worse, the game provides only a sparse final reward upon completing the level or a negative reward upon death. With no intermediate shaping signal, the VLM receives essentially no feedback to correct its behavior; a model that runs into the first pipe because it looked like background never learns to treat that pipe as an obstacle. This sparse credit‑assignment problem means that frozen models cannot adapt, and their zero‑shot performance plateaus at a level far below what is needed for progress.
These empirical findings carry a clear implication: for long-horizon, visually grounded decision-making, we cannot rely on the perceptual and reasoning abilities that impress us in single‑image benchmarks. Instead, we need to embed the VLM in a training loop where it can learn from its own mistakes within the environment. That is, successful decision‑making demands environment‑specific training via reinforcement learning that can shape behaviour over extended horizons, gradually correcting the perceptual and timing errors that plague pretrained models.
A compact visual summary of these failures—the quantitative progression gap alongside the qualitative error types—is shown below. On the left, a bar chart places the four models’ average progress against a dashed line representing the level completion threshold, making instantly visible just how far short they fall. On the right, annotated screenshots capture the essence of the two dominant failure modes: one frame circling a pipe that the model treated as background, another with a timing arrow showing that a jump triggered too late caused Mario to fall into a gap. Beneath the images, a callout reinforces the core insight: perception and timing errors compound over long horizons, yielding a near‑zero credit signal that no frozen model can overcome. Together, the chart and the screen captures turn an abstract failure into a lesson that motivates the RL‑based approach developed in the remainder of this lecture.

3. POMDP Formulation and Interaction Protocol

The failure modes of critic-free methods in long-horizon tasks are not mysterious: without dense, informative feedback, policy gradients wander. To create a training environment where a VLM can steadily improve, the Odysseus system grounds the agent’s learning in a formal Partially Observable Markov Decision Process (POMDP) whose design choices – from observation space to reward signal – are optimized for stable RL from vision and language.
The game is cast as a POMDP ⟨S,A,Ω,P,O,R⟩\langle \mathcal{S}, \mathcal{A}, \Omega, P, O, R\rangle⟨S,A,Ω,P,O,R⟩, acknowledging that the agent never sees the full game engine internals.  
States S\mathcal{S}S contain every detail the simulator tracks (positions, velocities, enemy status, etc.), but they are hidden from the agent.  
Actions A\mathcal{A}A are discrete button combinations: up to two simultaneous presses chosen from seven directional and action buttons. This yields a manageable 15‑action space that still supports expressive control (e.g., right + jump).  
Observations Ω\OmegaΩ at turn ttt are the raw pixel frame rendered by the engine, concatenated with a textual instruction (like “Go to the flagpole”). The combination of vision and language forces the agent to ground its decisions in real visual context while keeping the objective explicit.
Because the true state is never revealed, the agent must infer the evolving situation from a stream of high‑dimensional observations – a much harder perception problem than in typical symbolic POMDP benchmarks, and precisely the setting where pre‑trained VLM backbones can bring strong inductive biases.
The transition function P(st+1∣st,at)P(s_{t+1} \mid s_t, a_t)P(st+1​∣st​,at​) is deterministic but augmented with frame‑skip mechanics to guarantee that the effect of an action is manifest in the next observation. For most actions, the environment advances 5 game frames per turn – enough to register a visible step without losing fine control. For Jump actions, however, the skip is 15 frames, matching the duration of a full jump arc. This keeps the agent from seeing multiple intermediate frames where Mario is still airborne with no new agency, simplifying temporal credit assignment and eliminating “stutter” in the observation stream.
The reward is deliberately dense and task‑agnostic:
R(st,at)=rt=xt+1−xtR(s_t, a_t) = r_t = x_{t+1} - x_tR(st​,at​)=rt​=xt+1​−xt​
where xtx_txt​ is a scalar measuring forward progress (e.g., horizontal position in a side‑scrolling game). Every action yields an immediate, continuous signal that guides the agent toward movement that progresses toward the goal. Sparse “level complete” bonuses are insufficient in a long‑horizon setting with hundreds or thousands of turns; the dense progress reward turns every step into a learning opportunity, giving PPO’s value estimator something concrete to predict and enabling stable advantage estimation. This formulation also naturally ties to the auto‑curriculum: tasks where the agent cannot advance (e.g., blocked by an obstacle) automatically yield low rewards, steering the training distribution toward solvable challenges.
The interaction protocol enforces a turn‑based loop with explicit reasoning. At turn ttt, the VLM receives the observation oto_tot​ (frame + instruction) and must generate three segments before declaring an action. They are produced in order as a structured Chain‑of‑Thought:
⟨perception⟩\langle\text{perception}\rangle⟨perception⟩: The model describes visually grounded entities and their spatial relationships (“a gap to the right, an enemy approaching from the left”).
⟨reasoning⟩\langle\text{reasoning}\rangle⟨reasoning⟩: It strategizes the next move, connecting perception to the instruction (“need to jump over the gap while avoiding the enemy”).
⟨answer⟩\langle\text{answer}\rangle⟨answer⟩: Finally, it outputs a discrete button combination ata_tat​ in a machine‑parsable format (e.g., “RIGHT + A”).
This three‑part decomposition serves two purposes. It regularizes the model’s decision process – forcing it to ground its actions in observable evidence – and it makes the final action extraction robust: the environment simply parses the ⟨answer⟩\langle\text{answer}\rangle⟨answer⟩ block without relying on brittle prompt‑engineering or hidden state.
The diagram that follows encapsulates this back‑and‑forth in a single glance. On the left, the Environment box houses the game engine, emitting the observation oto_tot​ – visualized as a screen‑shot with an instruction text bubble – and the reward rtr_trt​. On the right, the Agent side shows the VLM receiving oto_tot​ and internally traversing the three Chain‑of‑Thought stages, finally returning the action ata_tat​ to the environment. Arrows trace the loop: observation and reward flow from environment to agent; the chosen action flows back; and the turn repeats. Annotations highlight the frame‑skip rule and the turn’s temporal boundary. The sparse set of labels – ⟨perception⟩→⟨reasoning⟩→⟨answer⟩\langle\text{perception}\rangle \rightarrow \langle\text{reasoning}\rangle \rightarrow \langle\text{answer}\rangle⟨perception⟩→⟨reasoning⟩→⟨answer⟩ – reinforces the structured reasoning that is the agent’s core operational contract, while the dense reward equation at the bottom reminds us that every step yields a measurable learning signal.

4. RL Objective and Critic-Free Methods

With the observation–action protocol firmly established, we now turn to the problem that sits at the heart of Odysseus: designing a reinforcement learning objective that can turn a pre-trained vision–language model into a competent, long-horizon game player. The agent must map a partially observable history of screenshots and text into a series of discrete actions (keystrokes, mouse clicks, navigation commands) so as to maximise the cumulative reward over an entire episode—often spanning dozens or even hundreds of turns. The natural starting point is the standard policy optimisation objective of reinforcement learning:
J(θ)=Eτ∼πθ[R(τ)],R(τ)=∑t=1Tγ t−1rt,J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[ R(\tau) \right],
\qquad R(\tau) = \sum_{t=1}^{T} \gamma^{\,t-1} r_t,J(θ)=Eτ∼πθ​​[R(τ)],R(τ)=t=1∑T​γt−1rt​,
where πθ\pi_\thetaπθ​ denotes the VLM policy parameterised by θ\thetaθ, the trajectory τ=(o1,a1,…,oT,aT)\tau = (o_1, a_1, \dots, o_T, a_T)τ=(o1​,a1​,…,oT​,aT​) follows the POMDP interaction from the previous section, and rtr_trt​ is the scalar reward awarded at turn ttt. In game environments rewards are often sparse (e.g., completing a level or finishing a quest) and may become available only after tens of successful intermediate steps, which makes credit assignment especially difficult.
A straightforward way to obtain an unbiased estimate of the policy gradient is the REINFORCE estimator:
∇θJ≈1N∑i=1N∑t=1Ti∇θlog⁡πθ(at(i)∣ot(i)) Gt(i),\nabla_\theta J \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T_i} \nabla_\theta \log \pi_\theta(a_{t}^{(i)}\mid o_{t}^{(i)}) \, G_{t}^{(i)},∇θ​J≈N1​i=1∑N​t=1∑Ti​​∇θ​logπθ​(at(i)​∣ot(i)​)Gt(i)​,
where Gt(i)G_{t}^{(i)}Gt(i)​ is the Monte‑Carlo return from time ttt onward. While this estimator is unbiased, its variance scales with the length and stochasticity of the trajectory. For sequences of tens or hundreds of turns, the signal-to-noise ratio becomes so poor that learning grinds to a halt. In modern fine‑tuning of language and vision–language models, it is common to replace the raw return with an advantage estimate A^t\hat{A}_tA^t​ obtained by subtracting a baseline, for instance with the generalised advantage estimator (GAE) or simply a running average of past returns. These critic‑free baselines do not depend on the state and therefore cannot adapt to the agent’s current situation; they remove only a fraction of the variance.
Odysseus’s design process began with a thorough examination of critic‑free reinforcement learning methods applied directly to the VLM. Natural candidates include the grouped relative policy optimisation (GRPO) used in recent reasoning‑model training, and PPO variants that employ a global or per‑game average reward as baseline. These schemes have shown promise for short‑horizon alignment tasks (e.g., summarisation, preference tuning) or for RL problems where the token horizon is small, but they falter when the decision span stretches to hundreds of steps. The primary failure modes are clear:
High variance in advantage estimates. Without a learned value function, the advantage is effectively a centred return that must absorb the full episode’s randomness. The variance grows linearly with horizon length, making it impossible to disentangle the effect of a single turn‑level action from the cumulative noise.
Poor credit assignment. In a long game, a successful completion may be preceded by dozens of mundane navigation steps. Critic‑free methods have difficulty distinguishing which actions were truly instrumental, often reinforcing irrelevant or even harmful patterns.
Task‑scale mismatch. When the agent is trained on a mixture of games with vastly different reward magnitudes and episode lengths, a single global baseline cannot normalise the returns across tasks. This leads to gradient interference and the dominance of easier or higher‑reward tasks, causing the multi‑task policy to collapse.
These deficiencies are especially damaging for a VLM that must learn to chain visual perception, language instructions, and low‑level motor actions over extended contexts. Early experiments in the Odysseus project confirmed that even with careful reward normalisation and clipping, a critic‑free PPO agent learned erratically. The win rate on the suite of Atari, Minecraft, and custom mini‑games plateaued far below acceptable performance, and the training curves exhibited wild oscillations that made checkpoint selection unreliable. Worse, the agent would often over‑optimise a single game while completely forgetting others—a classic symptom of a high‑variance policy gradient that fails to provide balanced feedback across tasks.
The key insight, therefore, is not that critic‑free RL is inherently hopeless, but that for long‑horizon visually grounded control, a more sophisticated variance‑reduction mechanism is needed. The policy gradient itself:
∇θJ=E[∑t∇θlog⁡πθ(at∣ot) (Qπ(ot,at)−Vπ(ot))]\nabla_\theta J = \mathbb{E}\left[\sum_t \nabla_\theta \log \pi_\theta(a_t\mid o_t)\, \big(Q^{\pi}(o_t,a_t) - V^{\pi}(o_t)\big)\right]∇θ​J=E[t∑​∇θ​logπθ​(at​∣ot​)(Qπ(ot​,at​)−Vπ(ot​))]
decomposes into an expectation over an action‑value function and a state‑value function. Any baseline that approximates Vπ(ot)V^{\pi}(o_t)Vπ(ot​) but does not depend on the action will reduce variance without biasing the gradient. A global average return is a very crude approximation of VπV^{\pi}Vπ; a state‑aware critic can do dramatically better. By conditioning on the turn‑level representation (the sequence of visual and textual tokens at time ttt), a learned critic can predict the expected future return from that point, yielding an advantage estimate A^t=Gt−V^(ot)\hat{A}_t = G_t - \hat{V}(o_t)A^t​=Gt​−V^(ot​) with significantly lower variance. This observation directly motivates the subsequent component of Odysseus: a lightweight, turn‑level CNN critic that shares visual representations with the VLM but is cheap to train and integrate.
The visual below distils this conceptual transition into a single clear diagram. On the left side, a VLM policy interacts with the game environment, producing actions and collecting rewards over a long trajectory. The upper branch illustrates the critic‑free gradient estimation: raw returns (or a simple global baseline) are fed back to the log‑probability gradient, resulting in a noisy, high‑variance signal that can destabilise training across diverse games. On the lower branch, a critic module takes the intermediate turn‑level features and computes a state‑dependent value, leading to a low‑variance advantage estimate and a smooth, stable policy update. The contrast between the jagged, high‑variance arrow of the critic‑free path and the clean, steady arrow of the critic‑based path visually reinforces why state‑aware variance reduction is not an optional luxury but a necessary ingredient for long‑horizon VLM control. It also foreshadows how the subsequent lightweight CNN critic design will preserve the efficiency gains of using a large pre‑trained VLM while taming the variance that would otherwise paralyse learning.

5. PPO with a Lightweight Turn-Level CNN Critic

In the previous discussion we examined critic-free policy gradient methods applied to VLM agents—approaches like REINFORCE or the more recent GRPO—and saw how they can, in principle, optimise language model policies from sparse task rewards. Yet in long-horizon game settings such as Odysseus, these methods hit a fundamental wall: credit assignment across tens or hundreds of turns becomes extremely noisy, the variance of the gradient estimator explodes, and training collapses or stagnates without careful regularisation. The core difficulty is that the agent receives a single scalar outcome after a long multi-turn trajectory, and the corresponding policy gradient weights every turn’s action proportionally to that raw return, irrespective of the true contribution of each turn. Without a baseline that distinguishes “this turn was actually the key” from “we were already doomed regardless,” the learning signal is too coarse.
The standard answer in deep RL is to introduce a critic—a function approximator that predicts the expected future return from the current state, called the state-value function V(s)V(s)V(s). With an estimate V^(st)\hat{V}(s_t)V^(st​) of the value at turn ttt, we can compute an advantage for each action as the difference between the actual return RtR_tRt​ experienced after turn ttt and the predicted baseline:
At=Rt−V^(st).A_t = R_t - \hat{V}(s_t).At​=Rt​−V^(st​).
Subtracting this baseline does not bias the expected gradient when V^\hat{V}V^ is a valid expectation, but it often reduces variance dramatically. In modern actor-critic methods like PPO, the advantage is fed into a clipped surrogate objective that stabilises the policy update, making it the workhorse of many RL breakthroughs. The question, then, is not whether to use a critic, but how to design one that is compatible with a massive VLM without introducing a prohibitive computational burden or interfering with the pre-trained language model’s reasoning.
Using the VLM itself to output a value alongside the action text would be the obvious route—but it is costly and problematic. The VLM generates tokens autoregressively; adding a separate value head on top of its final hidden state would require backpropagation through the whole transformer stack and could distort the rich linguistic representations. Moreover, the value prediction would be entangled with the language generation, making it hard to train independently and risking catastrophic forgetting of the pre-trained knowledge. Finally, the VLM itself is already too heavy to run a separate value inference at every step of the inner RL loop without dramatically slowing wall-clock time, which matters in interactive game environments.
Odysseus sidesteps these issues by pairing the VLM with a lightweight turn-level CNN critic. This critic is a minuscule neural network, completely separate from the VLM, that operates on a compact summary of the agent’s recent turn history. Concretely, after each turn the environment produces a small tensor that captures the action taken and the resulting observation (or a learned embedding thereof). The critic takes a fixed-length window of the last TTT such tensors, stacks them into a 2D input (turns ×\times× features), and processes them with 1D convolutions along the turn axis. The architecture is essentially a small temporal CNN—just a few thousand parameters—that outputs a scalar value estimate for the current turn. Because it works at the turn granularity rather than per-token, its prediction aligns exactly with the decision points of the VLM, and its computational cost is negligible compared to the VLM forward pass.
The beauty of this design is that it fully decouples value estimation from language generation. The VLM remains frozen in its core language capabilities, receiving only a policy gradient signal filtered through the critic’s advantage; the critic, in turn, can be trained rapidly with a simple mean-squared error loss against Monte Carlo returns or TD targets, much like the value network in a DQN system. The CNN’s convolutions naturally capture local temporal patterns—for example, a sequence of failed attempts followed by a clever change of strategy—giving it the ability to recognise which contexts tend to lead to high final returns, even if the exact reasoning behind the VLM’s actions is opaque to the critic. This separation also allows the team to iterate on the critic’s design (e.g., adding attention, changing the history length) without touching the bulky VLM.
The resulting PPO loop is straightforward: in each turn, the VLM generates an action, the environment executes it and returns a reward, and the critic’s value estimate is recorded. After collecting a batch of trajectories, the critic is updated to minimise the prediction error of the bootstrapped return. Meanwhile, the VLM is updated using the PPO clipped objective, where each turn’s advantage is computed as the difference between the critic’s value and the actually observed discounted sum of future rewards. This immediately stabilises training: the critic provides a dense, per‑turn assessment of the action’s quality, turning a sparse terminal reward into a sequence of local advantage signals that guide the VLM toward which specific turns were above average under the current policy—and, crucially, which were not.
The accompanying diagram consolidates this architecture. It illustrates the training loop where the VLM (shown as a large, language-savvy model) interacts with the game environment one turn at a time. Parallel to it, the lightweight turn-level CNN critic reads a compact history tensor—depicted as a small grid of past action-observation features—and outputs a scalar value VtV_tVt​. The advantage, computed by comparing this value with the actual return, is then fed back into the PPO update for the VLM. The contrast between the heavy VLM block and the tiny CNN block visually reinforces the idea that a critic need not be complex to be effective; its job is simply to learn, from repetition, which turn contexts predict success. This diagram also sets the stage for the next refinement: how this advantage is filtered to keep only positive experiences, a technique we explore in the following section.

6. Turn-Level Advantage and Positive-Advantage Filtering

Building on the per-turn value estimates produced by the lightweight CNN critic, the next algorithmic ingredient in Odysseus is the computation of turn-level advantages—quantities that tell the policy learner how much better (or worse) its actual actions turned out to be compared to the critic’s baseline prediction. For each turn ttt in a trajectory we have the reward-to-go R^t\hat{R}_tR^t​ and the critic’s value estimate Vϕ(ot)V_\phi(o_t)Vϕ​(ot​). Their difference yields the raw advantage:
A~t=R^t−Vϕ(ot).\tilde{A}_t = \hat{R}_t - V_\phi(o_t).A~t​=R^t​−Vϕ​(ot​).
A positive A~t\tilde{A}_tA~t​ signals that the policy outperformed expectations on that turn, while a negative value indicates the outcome fell below the baseline. This raw signal, however, is extremely noisy in long‑horizon game environments because the variance of a sequence of cumulative rewards can be large and the critic itself is still learning. If we were to feed these unprocessed advantages directly into a policy gradient update, a single turn with an outlier magnitude could dominate the gradient step and cause the training to swing wildly.
Standard Proximal Policy Optimization normalizes advantages across a batch of collected turns to stabilize training. Concretely, each raw advantage is divided by the empirical standard deviation of the raw advantages in the batch,
A^t=A~tσ({A~t′}).\hat{A}_t = \frac{\tilde{A}_t}{\sigma\bigl(\{\tilde{A}_{t'}\}\bigr)}.A^t​=σ({A~t′​})A~t​​.
This scaling ensures that the advantages have unit standard deviation, which keeps the gradient updates from a PPO‑clipped surrogate loss roughly comparable in size across iterations and reduces the need for delicate, per‑task learning‑rate tuning. Yet, when training a VLM agent on game trajectories that span dozens or hundreds of turns, the simple global normalization still exposes the policy to a regime where many turns carry a negative advantage—often the majority in early exploration—and those negative signals can push the update towards undoing good behaviours that the critic has not yet learned to accurately value.
Odysseus addresses this by introducing positive‑advantage filtering, a simple but empirically decisive modification that zeros out negative advantages before normalization and then re‑normalizes using only the positive subset:
A^t=max⁡(A~t, 0)σ({A~t′:A~t′>0}).\hat{A}_t = \frac{\max(\tilde{A}_t,\,0)}{\sigma\bigl(\{\tilde{A}_{t'} : \tilde{A}_{t'} > 0\}\bigr)}.A^t​=σ({A~t′​:A~t′​>0})max(A~t​,0)​.
The motivation is twofold. First, in long‑horizon tasks a turn where the actual return fell short of the critic’s estimate might be a genuine mistake, but it is equally often a stochastic outcome or a consequence of exploratory actions that are necessary for discovering winning strategies. Penalizing those turns with a strongly negative normalised advantage can produce large gradient steps that destabilise the VLM’s delicate instruction‑following and visual reasoning patterns. Second, by concentrating the PPO update solely on the positive‑advantage turns—the moments where the policy did better than the baseline—the training signal becomes denser and more directly relevant. The policy is explicitly reinforced for actions that led to higher‑than‑expected returns, while simply ignoring the sub‑baseline turns rather than trying to suppress them. This asymmetric filtering acts as a soft trust‑region mechanism that complements the PPO objective’s own clipping, and it reduces the variance of the policy gradient without discarding entire trajectories (the critic still learns from all turns via its value regression loss).
The practical effect is a more stable learning curve and faster convergence, especially when the original PPO updates with a large VLM backbone would otherwise oscillate between re‑learning and forgetting due to noisy negative‑advantage gradients. The technique can be viewed as a form of reward‑shaping on the advantage side: it encourages the agent to repeat what worked without explicitly punishing what didn’t, which is particularly suitable for games where optimal play consists of a sparse sequence of correct decisions interspersed with many neutral or low‑reward actions.
The visual below encapsulates this three‑part advantage processing pipeline. It places the raw advantage equation in a first box, then the standard normalised version in a second, and finally highlights the positive‑advantage filtering equation in a green‑bordered box to emphasise its added benefit. A small callout on the right succinctly notes the empirical motivation: avoiding large destabilising gradient steps from negative‑advantage turns and focusing updates on outperforming actions. Seen this way, the diagram turns a handful of formulas into a clear recipe: compute, clip negativity, re‑normalize—and let the policy learn from what it did right.

7. Adapted PPO for Long-Horizon VLM Training

After establishing turn-level advantages and the intuition behind ignoring negative advantage steps, the natural question is: how do these pieces come together into a stable reinforcement learning algorithm for vision-language models (VLMs)? Standard policy-gradient methods, even modern ones like PPO, were designed for neural network policies with relatively low-dimensional continuous action spaces or small discrete action sets. Adapting them to the autoregressive token-generation process of a VLM — where an “action” is a variable-length sequence of sub-word tokens — introduces daunting variance and the risk of catastrophic forgetting in the pretrained model. The Adapted PPO used in Odysseus is not a radical redesign; it is a careful composition of a lightweight critic, a filtered training set, and a policy objective that respects the VLM’s autoregressive nature, all tuned to make long-horizon language-guided control feasible without reward hacking or mode collapse.
A first critical ingredient is the turn-level CNN critic. In a long-horizon game, the return of a whole trajectory is noisy and the credit assignment across hundreds of decisions is fragile. Instead of relying on a purely reward-based critic-free method, Odysseus introduces a small convolutional neural network that takes the VLM’s frozen visual embeddings as input and predicts a scalar value V(st)V(s_t)V(st​) for each turn ttt. Because the visual encoder is shared with the VLM, the critic can be very lean — effectively just a few convolutional and linear layers on top of cached image features — and it can be trained with simple mean-squared error on the observed turn-level Monte Carlo returns or TD(λ\lambdaλ) targets. This critic serves two pivotal roles: it provides a baseline to reduce the variance of the policy gradient estimate (via the advantage At=Rt−V(st)A_t = R_t - V(s_t)At​=Rt​−V(st​)), and it indirectly regularizes the VLM’s policy updates by making under-performing action sequences clearly identifiable.
The policy update itself follows a PPO-style clipped surrogate objective, but applied to each turn-token generation block. For a given game state, the VLM produces an action by sampling a sequence of tokens (a1,a2,…,aK)(a_1, a_2, \dots, a_K)(a1​,a2​,…,aK​). The log-probability of that entire action is the sum of token-level log-probabilities under the VLM’s autoregressive decoder. The PPO ratio is formed between the current policy’s action log-probability and the log-probability from the iteration that originally collected the trajectory, all multiplied by the filtered turn-level advantage. Clipping limits how much the policy can diverge from the behavior policy used during data collection, safeguarding the delicate pretrained weights. Importantly, this objective skips any per-token advantage signals — it treats the entire turn-level decision as the atomic unit of optimization.
Now we fold in the positive-advantage filtering described earlier: a trajectory step (st,at,At)(s_t, a_t, A_t)(st​,at​,At​) is included in the PPO batch only if At>0A_t > 0At​>0. This simple rule, combined with the clipped objective, creates a robust self-stabilizing dynamic. When the agent is performing well, most sampled turns will already have positive advantages, and the training set shrinks to a handful of particularly promising or curious transitions. The VLM stays calibrated to its current strength and is not punished for tiny dips in return. Conversely, when the agent is struggling, many turns will yield negative advantages, and those are simply discarded — the policy avoids reinforcing mistakes and instead waits for the critic to identify steps that did better than expected. The result is an implicit curriculum where the VLM progressively refines its highest-value action patterns and gradually expands the region of positive advantage, all without manual scheduling.
The visual below consolidates this adapted PPO pipeline into a single glanceable diagram. It shows the VLM policy as the actor that maps images and text history to an action sequence, the lightweight turn-level CNN critic that estimates state values from the VLM’s frozen visual features, and the advantage filter marked with a “+” icon that separates positive-advantage turns from negative ones. The flow traces from environment rollouts through the critic’s value estimate to the filtered replay buffer, then back to the PPO update on the VLM. This diagram highlights the crucial decoupling: the heavy VLM only receives gradient updates for steps that the critic deems worth learning from, while the cheap critic continuously adapts to the changing policy. It also emphasizes that the entire training loop operates at the turn level, aligning the advantage computation, filtering, and policy objective to the natural decision granularity of language-based game interaction.
With this adapted PPO in place, the training of VLMs for long-horizon games becomes not only feasible but remarkably sample-efficient. The critic steadies the learning signal, the positive-advantage filter protects the pretrained knowledge, and the VLM can gradually internalize rich strategies without the brittleness that plagues critic-free, sparse-reward setups. The next section will compare this approach directly to critic-free methods, illuminating the precise stability and generalization gains that make such an adaptation necessary.

8. Critic-Free vs

In the previous section, we walked through the core loop that turns a pretrained vision–language model into a game-playing agent under PPO. That loop produces sequences of text actions, compute returns from sparse victory-or-defeat signals, and updates the model with a carefully adapted objective. But buried inside that loop is a design tension that can make or break training on long-horizon tasks: how do we compute the advantage—the signal that tells the policy which specific moves pushed the outcome above or below expectation? The answer splits neatly into two families: critic-free methods that use raw Monte Carlo returns as baselines, and critic-based methods that learn a value function on the side. In long-horizon VLM control, leaning too heavily on the critic-free path leads to failure modes that a lightweight turn-level critic resolves cleanly.
To see why, recall that a VLM playing, say, a multi‑room puzzle game might generate 50 to 200 discrete text actions before a single reward arrives. A critic-free advantage estimate would look something like  
A^t=Rt−b,\hat{A}_t = R_t - b,A^t​=Rt​−b,
where RtR_tRt​ is the total discounted return from step ttt onward and bbb is a baseline (often a moving average of past returns). Because the VLM’s action space is open‑vocabulary and the environment dynamics are highly stochastic, each episode’s return is a random variable with enormous variance. The Monte Carlo return RtR_tRt​ inherits all the noise accumulated over dozens of steps, including exploration noise, irrelevant good luck later in the trajectory, and the inherent randomness of the game itself. Subtracting a static baseline barely dent that variance. The result is a gradient estimate that swings wildly from update to update, making learning slow and fragile. Worse, when the reward is strictly zero‑one (win/lose), most episodes produce the same return, and the handful that differ create a sparse, high‑variance signal that is entirely incommensurate with the per‑action credit‑assignment problem.
This high variance manifests as two concrete failure modes. Failure mode 1: credit misattribution. When a lucky chain of mediocre moves happens to end in victory, the critic-free advantage treats every action in the episode as positively contributing. The VLM receives reinforcing gradients for actions that were, at best, harmless. It soon over‑commits to local patterns that correlate with random success rather than genuine skill, and performance plateaus or collapses once the luck runs out. Failure mode 2: catastrophic unlearning. Because the advantage is computed from a single noisy roll‑out and no value prediction constrains what the VLM should expect from a state, the policy can drift catastrophically within a handful of batches. A suddenly high return on a mediocre trajectory can heavily skew the next policy update, erasing previous progress. In long‑horizon games where each exploratory episode takes real world‑time (or expensive GPU time), this instability forces a painfully small learning rate, and the project may never converge within a reasonable compute budget.
These failure modes naturally motivate a turn‑level CNN critic. The idea is to attach a small, separate convolutional network that takes the raw visual observation from the game and outputs a scalar value estimate V(st)V(s_t)V(st​). This critic is trained with a simple mean‑squared‑error regression toward the empirical return:  
Lcritic=1N∑t(V(st)−Rt)2.\mathcal{L}_{\text{critic}} = \frac{1}{N} \sum_{t} \bigl( V(s_t) - R_t \bigr)^2.Lcritic​=N1​t∑​(V(st​)−Rt​)2.
Crucially, the critic uses only the state—not the VLM’s token stream—so it can be kept extremely lightweight. A few convolutional layers followed by a linear head run at a fraction of the cost of the VLM’s forward pass, yet they capture enough of the visual structure to predict expected future rewards reliably. This small critic becomes a learned baseline, and the advantage is computed as  
A^t=Rt−V(st),\hat{A}_t = R_t - V(s_t),A^t​=Rt​−V(st​),
or, for even lower variance, with Generalized Advantage Estimation (GAE) over the critic’s own temporal-difference residuals. By conditioning the baseline on the actual state, the critic directly absorbs the portion of return variance that is predictable from the observation—the context, the agent’s location, visible enemies, inventory, etc. The remaining advantage noise is therefore drastically smaller, and the policy updates become more focused.
But there is a second subtlety that makes the critic particularly effective for VLM policies: positive‑advantage filtering. Because the VLM already starts from a pretrained understanding of language and visual grounding, its initial actions are rarely uniformly random. Many of its plausible outputs lie near semantic “null actions”—utterances that do not change the game state meaningfully but are still grammatical. A standard PPO update would apply both positive and negative advantage signals, pushing the VLM away from neutral or harmless actions just because they coincided with a poor outcome elsewhere in the episode. Positive‑advantage filtering eliminates this destructive learning: we mask out updates for any action whose advantage is below zero. In practice, we only let the policy reinforce moves that the critic judged to be better than expected, and ignore moves that were worse. This asymmetry preserves the VLM’s pretrained language priors where the critic sees no strong signal, while still driving improvement on the critical few decisions that unlock progress.
The visual below summarizes this contrast. On the left, the critic‑free regime shows a single return block propagating high‑variance reward directly into the policy update, with arrows that suggest noisy gradients and the risk of credit misattribution across a long chain of actions. On the right, the critic‑based regime inserts a compact CNN block that reads the visual state and produces a learned value V(s)V(s)V(s). The flow then follows a two‑stage logic: the critic delivers a low‑variance baseline that grounds the advantage computation, and a positive‑advantage gate lets only superior actions update the VLM. The diagram visually isolates the problem—where critic‑free methods accumulate noise—and the solution—where a lightweight value head and a simple filtering rule convert an unstable long‑horizon optimization into a stable training recipe that respects the VLM’s pretrained knowledge.

9. VLM-based RL vs

Building on the previous investigation of critic‑free instability, we now examine a more fundamental question: how much does a pretrained vision‑language model (VLM) improve the sample complexity and action‑space design burden in long‑horizon game decision‑making? The answer turns out to be dramatic—roughly a factor of two in environment steps and the complete elimination of hand‑crafted action discretizations. This section contrasts a VLM‑based policy (the Odysseus agent, using a vision‑language backbone with a turn‑level PPO critic and positive‑advantage filtering) against a classical CNN‑based policy trained from scratch with PPO on the same environment. The comparison exposes both the brittleness of naive deep RL on large action spaces and the unique efficiency of language‑grounded visual priors.
The game environment, a long‑horizon Minecraft‑style task, presents a raw action space of 22 discrete key‑and‑mouse combinations. Classical deep RL with a CNN policy and PPO utterly fails when trained directly on this full set: the learning signal is too sparse, the exploration too difficult, and the policy never leaves its starting region. The response, in standard RL fashion, is to engineer a reduced, more manageable action space. By hand‑selecting 8 semantically meaningful combos (move forward, turn, attack, etc.), the classical agent finally begins to learn. Yet even then, progress is painfully slow. The agent requires about twice as many environment samples to match the forward progress achieved by the VLM agent using the full 22‑combo set. 
Why does the VLM succeed where hand‑crafting cannot? The pretrained vision‑language model πθ\pi_\thetaπθ​ embodies a rich set of visual semantics and commonsense priors: it already knows what an open door looks like, which actions tend to align with visual goals like navigation or interaction, and how language instructions map to spatial behaviors. This dramatically reduces the exploration burden—the VLM is not discovering the meaning of pixels from scratch; it is recognizing its situation and selecting plausible actions from a library it already internalized. In effect, the VLM prior substitutes for the expensive search that manual action engineering only partially alleviates.
Both agents share a common RL scaffolding: they use a turn‑level PPO algorithm with a learned critic VϕV_\phiVϕ​ to estimate the advantage A~t\tilde{A}_tA~t​ at each decision turn. However, the VLM agent additionally employs positive‑advantage filtering, keeping only advantages where A~t>0\tilde{A}_t > 0A~t​>0 and effectively zeroing out negative ones (using max⁡(A~t,0)\max(\tilde{A}_t, 0)max(A~t​,0)). This simple filter further stabilizes training by preventing the VLM from unlearning useful priors when PPO would otherwise assign a high probability to a momentarily poor action. The classical CNN‑based agent does not benefit from such filtering; it relies wholly on the reduced action set to regulate gradient variance.
The practical upshot is a two‑pronged qualitative advantage:
Elimination of manual action‑space design. The VLM directly consumes the 22‑combo raw space, removing the need for an RL engineer to pre‑select a handful of “good” actions.  
Approximately 2× sample efficiency gains. As shown in the progression curves, the VLM reaches a comparable average forward progress (xT−x0)≈0.8(x_T - x_0) \approx 0.8(xT​−x0​)≈0.8 in half the environment steps required by the best classical setup.
The accompanying Figure 4b distills these findings into a clean line plot. The x‑axis tracks total environment steps in millions, the y‑axis the average progress. A solid blue curve for the VLM agent (full 22‑combo set, positive‑advantage filtering) ascends steeply and levels off around a progress value of 0.8. An orange dashed curve for the classical CNN policy with 8 engineered actions rises far more sluggishly, taking roughly twice as many steps to approach the same performance plateau. A flat gray line near zero labeled “Classical RL (22 combos, no engineering)” marks the complete failure mode without action reduction. An arrow annotation explicitly calls out the ~2× sample gap, and the legend at top‑left keeps the comparison clear. The dull grid lines keep focus on the curves themselves, letting the eye immediately read the central story: VLM priors, coupled with a stable critic and a simple advantage filter, not only unlock the full action space but also slash the required environment interactions by half.

10. Odysseus Pipeline: Lightweight SFT Initialization

The previous sections have painted a stark picture: applying critic‑free RL directly to a small vision‑language model for long‑horizon game play is like asking a student to solve calculus before they can read. The policy gradient signal is too noisy, the credit assignment across hundreds of steps is too diffuse, and the model’s latent visual representations are simply not yet rich enough to support the structured chain‑of‑thought that Odysseus demands. The solution begins not with a better RL algorithm, but with a carefully designed warm‑up: a lightweight Supervised Fine‑Tuning (SFT) stage that teaches the VLM to see and reason in the game domain, without yet asking it to act optimally.
This SFT stage is built from a modest but information‑dense dataset. The Odysseus team recorded walkthrough videos of human‑like play across 10 different game levels, then sampled approximately 5,000 representative frames from those recordings. Each frame was paired not just with an action label, but with a full chain‑of‑thought annotation generated by a powerful teacher model, GPT‑o3. The teacher’s output follows a disciplined three‑part structure: a ⟨perception⟩ block describing what is on the screen, a ⟨reasoning⟩ block that interprets the game state and plans the next move, and an ⟨answer⟩ block that selects the discrete action. This yields a supervised dataset DSFT={(o,yteacher)}\mathcal{D}_{\text{SFT}} = \{(o, y_{\text{teacher}})\}DSFT​={(o,yteacher​)} where each observation ooo is labeled with the rich textual reasoning sequence, not merely with a class label.
The base VLM, Qwen3‑VL‑8B‑Instruct, is then fine‑tuned on this dataset using the standard cross‑entropy objective:
LSFT(θ)=−E(o,y)∼DSFT ⁣ ⁣log⁡πθ(y∣o)\mathcal{L}_{\text{SFT}}(\theta) = - \mathbb{E}_{(o,y) \sim \mathcal{D}_{\text{SFT}}} \!\! \log \pi_\theta(y \mid o)LSFT​(θ)=−E(o,y)∼DSFT​​logπθ​(y∣o)
Notice what this loss does: it encourages the model to reproduce the teacher’s full chain‑of‑thought, token by token, given the visual observation. By backpropagating through the language‑modeling head and the vision encoder, the model learns to ground its visual features in the game’s entities, layouts, and affordances, while also absorbing the causal reasoning patterns typical of the domain. Crucially, the loss treats the answer tokens as just another part of the output sequence; it does not explicitly optimize for decision accuracy over a long horizon. There is no environment reward signal, no advantage estimation, and no exploration.
The beauty of this SFT step lies in its deliberate lightweight nature: it uses only a few thousand frames, a single frozen teacher, and standard autoregressive training. The objective is not to teach the model which action to pick in every situation—that is left to the subsequent RL stage with its carefully constructed CNN critic. Instead, SFT gives the VLM a reliable perceptual vocabulary and a rudimentary strategic intuition. After SFT, the model already “knows” that a red potion restores health, that an approaching shadow indicates an enemy, and that a locked door sometimes requires a key hidden in an adjacent room. It lacks, however, the fine‑grained optimization needed to chain dozens of correct actions together under resource constraints and stochasticity.
From an RL perspective, this SFT‑initialized model is a dramatically better starting point than a raw pretrained VLM. The policy no longer wastes millions of environment steps just learning to parse the pixels or to output structurally valid chain‑of‑thought. Instead, PPO can immediately focus on refining action quality, with the critic providing stable value estimates and the positive‑advantage filter preventing destructive updates. The SFT loss also implicitly shapes the early policy distribution: it prevents the model from drifting into degenerate language modes (e.g., repetitive or nonsensical reasoning) during the first RL updates, a common failure mode of critic‑free approaches.
The accompanying diagram turns this sequential narrative into an intuitive visual pipeline. From left to right, it traces the data journey: walkthrough videos are sampled into a pool of frames (colored blue for observation data), which are fed to the GPT‑o3 teacher model (orange) that outputs the structured chain‑of‑thought. The paired observations and teacher texts then flow into a green‑tinted SFT block, which optimizes the base Qwen3‑VL‑8B‑Instruct using cross‑entropy loss, ultimately producing the fine‑tuned VLM πθinit\pi_{\theta}^{\text{init}}πθinit​. A callout beneath the output emphasizes the scope of what this stage does—perception and game knowledge, not action optimality—encapsulating the deliberately narrow remit of the SFT before the heavy lifting of RL begins. The color‑coded separation of data, teacher, and training process makes it instantly clear that SFT is a distillation of human‑like reasoning, not a prescription for optimal sequential choices.
This pipeline choice embodies a recurring lesson in deep RL for language‑based agents: separate representation learning from control learning, especially when the control signal is scarce and delayed. By investing a few thousand teacher‑labeled frames into SFT, Odysseus buys a stability margin that pays compound interest during the subsequent RL training. In the next section, we will see how the RL stage itself is further stabilized by an auto‑curriculum strategy that dynamically reweights training tasks, ensuring that the agent never plateaus on simple levels nor gets stuck on impossibly hard ones.

11. Auto-Curriculum via Inverse Trajectory Weighting

After initializing the VLM agent with lightweight supervised fine‑tuning, we turn to reinforcement learning to push its decision‑making further. The training environment in Odysseus comprises a collection of KKK distinct game levels, each presenting its own dynamics and strategic demands. A natural first impulse is to sample these levels uniformly when constructing rollout batches for PPO. Yet this seemingly neutral strategy hides a pernicious bias: not all levels are equally difficult, and difficulty manifests not just in success rate but in the length of the resulting trajectories. Easier levels allow the agent to survive for many turns before the episode terminates, while harder levels often end quickly in failure. Under uniform sampling, a batch of fixed total size will therefore contain vastly more turns from the easy levels than from the hard ones. Because the PPO loss is summed or averaged over all timesteps, the easy levels dominate the gradient signal, effectively starving the agent of practice on the challenges that need it most.
This imbalance is especially damaging in long‑horizon tasks where the VLM must learn to chain many correct actions. If the agent rarely encounters a difficult early‑game situation because those episodes are short, it never learns to escape it; meanwhile it refines already‑mastered easy stages to the point of overfitting. The consequence is a policy that looks polished on simple levels but collapses under the slightest pressure. Uniform sampling thus undermines the whole purpose of multi‑task training: to produce a single agent that is robust across the full difficulty spectrum. The problem is structural, not cosmetic, and it calls for an automatic mechanism that adapts the training distribution to the agent’s current capabilities without manual reward tuning or hand‑picked curricula.
The Odysseus framework addresses this with an auto‑curriculum driven by inverse trajectory weighting. After each batch of rollouts, we compute a simple statistic for every level kkk: the average trajectory length
Nk=1Mk∑m=1Mkℓ(τk,m),N_k = \frac{1}{M_k}\sum_{m=1}^{M_k} \ell(\tau_{k,m}),Nk​=Mk​1​m=1∑Mk​​ℓ(τk,m​),
where MkM_kMk​ is the number of episodes collected from level kkk in the batch, and ℓ(τ)\ell(\tau)ℓ(τ) counts the number of turns (actions taken) in trajectory τ\tauτ. This number NkN_kNk​ acts as a dynamic proxy for the current difficulty experienced by the agent on that level: easier levels produce larger NkN_kNk​, harder levels smaller NkN_kNk​.  
The core insight is that we want each level to contribute roughly equally to the total number of training turns in the next batch, so that the agent does not coast on its strengths. To achieve this, we set the sampling weight for level kkk proportional to the inverse of its average length:
ωk=1/Nk∑j=1K1/Nj.\omega_k = \frac{1/N_k}{\sum_{j=1}^{K} 1/N_j}.ωk​=∑j=1K​1/Nj​1/Nk​​.
With these normalized weights, the next batch is constructed by sampling levels according to {ωk}\{\omega_k\}{ωk​}. Levels with shorter trajectories (harder) receive higher sampling probabilities, while those with long, easy rollouts are down‑weighted. The mechanism is purely data‑driven, requiring no prior knowledge about level difficulty; it simply observes what the agent is experiencing and rebalances accordingly.
This auto‑curriculum is applied continuously throughout training. As the agent improves, the lengths NkN_kNk​ shift, and the weights ωk\omega_kωk​ adapt. An initially hard level may become easier for the VLM over time, leading to longer trajectories and a corresponding decrease in its sampling probability. Conversely, if the agent forgets or a new challenge emerges, the weight rises again. The process thus implements a natural, self‑correcting curriculum that keeps the training signal diverse and prevents any single level from monopolizing the optimizer’s attention. It works in concert with the stable PPO updates described earlier, ensuring that the advantages computed by the lightweight CNN critic are grounded in a balanced replay of experiences.
Crucially, inverse trajectory weighting avoids the pitfall of naïve weighting schemes that might, for instance, simply reweight by the inverse of success rate. A level could be hard yet still yield long episodes if the agent gets stuck in loops rather than failing instantly; trajectory length captures both the time‑cost and the effective learning signal density—shorter episodes concentrate the agent’s mistakes into fewer timesteps, making them information‑rich. The scheme also implicitly handles the fact that episode length and return are correlated in game‑like environments: longer survival usually means higher reward, so down‑weighting long trajectories prevents reward‑hacking behaviors that might stall training on easy stages.
The visual below distills this adaptive loop into a compact flowchart. On the left, a rollout batch groups trajectories by level, with easy (long) runs shown in blue and hard (short) runs in red. For each level, an average length NkN_kNk​ is computed—this is the key statistic that feeds into the weight update. The central block then applies the inverse formula ωk∝1/Nk\omega_k \propto 1/N_kωk​∝1/Nk​ and normalizes across all levels. The resulting sampling probabilities drive the construction of the next batch, where the harder levels appear more frequently, and the cycle repeats. Color coding and arrow flow make it clear that the curriculum is a closed‑loop feedback process: the agent’s own experience reshapes its future training data, balancing the multi‑task landscape without any external intervention.

12. Odysseus Training Results: Outperforming Frontier Models

After establishing an auto-curriculum that balances task exposure by inverse trajectory weighting, the central question becomes whether the full Odysseus training recipe actually translates into superior performance on the long-horizon game tasks themselves. The training setup combines a lightweight initial supervised fine‑tuning phase (slide 10) that improves visual grounding without altering decision behavior, followed by a stable PPO loop augmented with a turn‑level CNN critic and positive‑advantage filtering to overcome the critic‑free RL failure modes identified earlier. It is one thing to design these components in isolation; it is another to see them cohere into a policy that reasons over hundreds of turns without collapsing.
The headline result is striking: Odysseus achieves an average progress of 1512 across the suite of procedurally generated games. To appreciate the magnitude, one must compare against both the base vision‑language model and the strongest publicly available frontier models. The base model, Qwen3‑VL‑8B‑Instruct, tops out at an average progress of only 270 when evaluated under the same protocol. That means Odysseus improves upon its own backbone by a factor of  
1512270≈5.6×,\frac{1512}{270} \approx 5.6\times,2701512​≈5.6×,
a leap that signals not marginal tuning but a qualitative shift in the agent’s ability to chain subtasks and recover from errors. Even the best frontier model tested, GLM‑4.6V, reaches just 513, while GPT‑5.4 trails at 310. Against the strongest frontier competitor, Odysseus delivers a  
1512513≈2.95×\frac{1512}{513} \approx 2.95\times5131512​≈2.95×
improvement. These numbers are not merely large in relative terms; they represent the difference between an agent that consistently reaches the later stages of the game and one that stalls early, often failing to navigate the first divergent branch.
What makes these gains credible is the care taken to isolate the contributions of each training phase. A simple baseline that applies only the lightweight SFT step (Odysseus‑SFT) achieves an average progress of 262—virtually indistinguishable from the base model’s 270 and actually a slight regression. This confirms our earlier hypothesis: SFT alone can sharpen perceptual alignment, but it does not imbue the model with the long‑horizon credit assignment required to act purposefully across dozens of steps. The policy is still making blind guesses with better‑looking inputs, and without reward‑guided optimization it cannot learn to chain actions that only pay off in the distant future.
Removing the SFT stage entirely and training from scratch with the RL pipeline (Odysseus‑Zero) yields an average progress of 1355, already a 5.0× improvement over the base model and a 2.64× gain over the best frontier model. This confirms that the RL‑centric design—particularly the stable CNN critic and positive‑advantage filtering—can autonomously discover temporally extended strategies. Yet the gap between Odysseus‑Zero (1355) and full Odysseus (1512) reveals a synergy: the initial SFT phase provides a perception‑aligned initialization that makes the RL optimization landscape more benign, enabling the agent to explore more efficiently and converge to an even higher plateau. The auto‑curriculum further ensures that the agent’s improving competence is continuously challenged rather than wasted on tasks already mastered, creating a virtuous cycle that the pure RL setting cannot replicate as rapidly.
The visual below compacts these comparisons into a single glance. A clean table places the models in descending order of performance, with the Odysseus row distinctly emphasised to draw attention to the leading 1512. The ablation rows are presented with an italicised, indented style that visually separates them as controlled experiments while preserving the logical flow. Directly beneath the table, short bullet points reinforce the three key takeaways: the 5.6× and 2.95× improvements, the ineffectiveness of SFT alone, and the complementary benefit of combining SFT with RL. This arrangement transforms a dense set of numbers into a hierarchy of evidence, making it immediately clear that the full Odysseus pipeline does not merely match hand‑coded heuristics or off‑the‑shelf large models but fundamentally redefines what is possible for a vision‑language‑model agent in a long‑horizon interactive environment.

13. Generalization and Capability Retention

The remarkable in-distribution performance reported in the previous section – where Odysseus surpasses even the most capable frontier models – raises an immediate and more consequential question: have we simply taught a VLM to follow a brittle script inside a narrow set of training levels, or have we instilled a genuinely transferable decision-making competence? In any practical deployment of an RL-trained agent, out-of-distribution robustness is the true litmus test. This section dissects the generalization behavior of Odysseus along three carefully designed axes, while also verifying that the RL fine-tuning does not come at the cost of the base model’s rich multimodal understanding. The results reveal that the training pipeline achieves a rare equilibrium: strong improvement on the target task without overfitting, and without eroding the general knowledge that makes large VLMs invaluable in the first place.
Three generalization regimes are evaluated, each increasing in difficulty and distance from the training distribution. The first, off‑policy states, probes the agent’s behavior when the game is paused and resumed from states generated by an entirely different policy – states the VLM never visited during its own rollouts. Because the training curriculum uses PPO with a positive-advantage filter, the critic learns a value landscape that generalizes beyond the trajectories sampled by the current actor. Off‑policy evaluation thus tests whether the value function and the policy can handle the natural distribution shift that arises when a human or another agent intervenes, or when online exploration drifts. On ten held-aside states per training level, Odysseus improves over the base VLM by an average of 32.2%. This indicates that the learned strategies are not fragile memorizations of specific state-action mappings, but rather encode robust, state-aware heuristics.
The second axis, unseen levels, removes entire levels from the training set – new geometry, new object layouts, new puzzles. An agent that overfits to level-specific patterns would collapse here. Instead, Odysseus achieves a 41.5% improvement compared to the zero-shot performance of the base VLM. This is a striking gain: it means the RL training has cultivated a level-agnostic grasp of the game’s underlying mechanics. The lightweight turn-level CNN critic, trained on visual observations from the VLM’s own embeddings, appears to abstract spatial relationships and causal structures that transfer seamlessly to novel configurations. This kind of structural generalization is the hallmark of a policy that has internalized the logical constraints of the environment, rather than merely mimicking observed sequences.
The most ambitious generalization test is cross‑game transfer. The team took an Odysseus policy trained exclusively on one game and evaluated it, without any further fine‑tuning, on Super Mario Bros. – a classic side-scrolling platformer with entirely different dynamics, objectives, and visual aesthetic. On 32 unseen Mario levels, the agent improved average progress by 23.1% over the base VLM. While this is a lower absolute gain than within the original game, the very existence of a positive transfer is remarkable: the VLM’s visual chain-of-thought reasoning, combined with the critic-guided action selection, discovers strategies (like forward exploration and obstacle avoidance) that transcend the original training domain. Such cross-game generalization hints that RL training can bootstrap domain-transferable planning skills, not just task-specific reflexes.
As impressive as these generalization gains are, a separate and equally critical concern is capability retention – the agent must not lose its general multimodal intelligence while acquiring embodied control. Catastrophic forgetting is a well-known pathology when fine-tuning large pretrained models with sequential optimization. The authors monitored Odysseus on three challenging multimodal benchmarks: MMMU (massive multi-discipline multimodal understanding), MathVision (visual math reasoning), and RealWorldQA (grounded question answering about real-world images). The scores tell a clear story: MMMU drops imperceptibly from 68.2 to 67.9, MathVision from 54.1 to 53.8, and RealWorldQA from 72.4 to 71.8. These differences are well within the noise of benchmark evaluation and confirm that RL training for decision-making is essentially orthogonal to the knowledge encoded during pretraining. The VLM’s semantic comprehension, reasoning, and world knowledge remain intact.
The visual below consolidates these findings into a side-by-side comparison that makes the dual success of the training pipeline immediately legible. On the left, a grouped bar chart contrasts the base model (blue bars) with Odysseus (green bars) across the three generalization tasks, with the percentage improvements prominently labeled. The bars for off‑policy states, unseen levels, and cross‑game transfer all show a significant green lift, reinforcing that the agent has not merely overfit to its training environments. On the right, a compact table lists the multimodal benchmark names and the nearly identical scores of the two models, with a caption underscoring the central message: “RL for decision-making leaves general VLM abilities intact.” Together, the chart and table transform abstract numbers into a single, unambiguous takeaway: the Odysseus training recipe produces agents that are both more skillful and still broadly intelligent.

14. Summary: Key Ingredients for Stable Long-Horizon VLM RL

Stepping back from the empirical results on generalization and capability retention, it becomes clear that the remarkable stability of Odysseus in long-horizon game tasks is not a fortuitous property of one design choice, but the outcome of four carefully engineered components working in concert. Each piece addresses a distinct failure mode that plagues critic‑free reinforcement learning when applied to visual language models (VLMs) operating over hundreds of turns. To appreciate the synthesis, it helps to recall the central challenge: VLMs produce token‑level autoregressive decisions that are embedded in a high‑dimensional, partially observable environment, making return estimation extremely noisy. Without a reliable baseline, policy gradients explode or vanish, and the agent quickly forgets its pretrained capabilities—exactly the problems that motivated the earlier sections on critic‑free failures and SFT‑only brittleness.
The first ingredient, a turn‑level CNN critic Vϕ(otimage)V_\phi(o_t^{\text{image}})Vϕ​(otimage​), directly compensates for this variance. Instead of trying to estimate value from the VLM’s internal state or token‑level features, Odysseus trains a tiny 8‑million‑parameter convolutional network that ingests only the visual observation at each turn. This critic produces a scalar value estimate used to compute a per‑turn advantage A~t=R^t−Vϕ(ot)\tilde{A}_t = \hat{R}_t - V_\phi(o_t)A~t​=R^t​−Vϕ​(ot​), where R^t\hat{R}_tR^t​ is the return measured from that turn onward. Because the critic operates on a fixed‑size visual signal and is independent of the language stream, it provides a stable, low‑variance baseline that is easy to train and does not interfere with the VLM’s representation learning. The small size also means the critic can be updated with standard regression losses without dominating the compute budget, effectively decoupling value prediction from policy improvement. In effect, the VLM is free to focus on generating good action tokens while the critic silently absorbs the credit assignment noise that would otherwise corrupt the policy gradient.
Building on the reliable advantage signal, positive‑advantage filtering addresses a subtler problem: even with a good critic, the raw advantage can be negative for many turns, especially in long trajectories where most decisions lead to ordinary progress rather than decisive gains. Naively including these negative advantages in a PPO‑style objective can cause the policy to actively unlearn reasonable behaviors, as the gradient pushes the model away from states where advantage is only slightly below zero. Odysseus solves this by applying the transformation A^t=max⁡(A~t,0)σ({A~t′:A~t′>0})\hat{A}_t = \frac{\max(\tilde{A}_t, 0)}{\sigma(\{\tilde{A}_{t'}:\tilde{A}_{t'} > 0\})}A^t​=σ({A~t′​:A~t′​>0})max(A~t​,0)​: it clips all negative advantages to zero and then divides each positive advantage by the standard deviation of all positive advantages in the mini‑batch. This normalization keeps the scale of positive gradients consistent across turns and episodes, while completely ignoring turns where the agent performed worse than the critic’s baseline. The intuition is that the VLM already has a strong pretrained prior; we only want to reinforce when it does something exceptionally good, not penalize it for every ordinary or below‑average move. This stabilizes training, prevents gradient explosion, and dramatically reduces forgetting of general capabilities.
The third pillar—lightweight SFT initialization—is subtle but essential. Before any RL, the VLM is fine‑tuned on a mere ∼5,000\sim5{,}000∼5,000 frames of expert gameplay, but critically, this supervised fine‑tuning is restricted to perception and state description in a chain‑of‑thought format; it does not ask the model to output control actions. The objective is to inject strong visual grounding into the VLM’s internal representations without locking in any particular policy. Starting RL from a model that can already recognize game elements, read UI text, and reason about spatial relationships means the policy search begins in a region of the weight space where basic competence is guaranteed. The small dataset size ensures that the SFT does not over‑fit the model to mimicking the expert’s decisions, which would be detrimental in long‑horizon tasks where the optimal strategy may differ from demonstrations. Instead, the VLM enters RL with a rich perceptual scaffold, enabling the critic and policy to quickly align on meaningful features rather than wasting samples on learning to see.
Finally, the auto‑curriculum ωk∝1/Nk\omega_k \propto 1/N_kωk​∝1/Nk​ balances the multi‑level training regime. In a game with many levels of varying difficulty, a naive uniform sampling of levels would bias the policy toward easy levels where trajectories are short and rewards dense, leading to neglect of harder levels that require genuine long‑horizon reasoning. By weighting each level inversely to its average trajectory length, the auto‑curriculum ensures that the agent spends proportionally more effort on the challenging levels that form the eventual benchmark. The weighting is recomputed periodically from the agent’s own current performance, so as the policy improves and trajectory lengths shrink, the weights adapt—a dynamic curriculum that prevents catastrophic forgetting of difficult skills while still allowing mastery of easier ones. This mechanism acts as a load balancer, making the overall RL update distribution representative of the true difficulty spectrum and thereby enabling the policy to generalize across levels without hand‑tuned scheduling.
Together, these four ingredients create a positive feedback loop: the critic provides a clean advantage signal, the filtering focuses updates on beneficial decisions, the SFT initialization ensures the policy begins with usable percepts, and the auto‑curriculum keeps the entire level set in play. The result is stable RL training of a VLM for tasks exceeding 100 consecutive turns, with a 3×3\times3×–6×6\times6× gain in game progress over the strongest frontier models—and all without sacrificing the general reasoning and chat capabilities the pretrained VLM started with.
The table that follows distills each component into a compact reference, listing its role and key effect alongside the equations that define its operation. Scanning from top to bottom, you can trace the flow of the Odysseus design: a lightweight vision‑only critic (component 1) computes low‑variance advantages using A~t=R^t−Vϕ(ot)\tilde{A}_t = \hat{R}_t - V_\phi(o_t)A~t​=R^t​−Vϕ​(ot​); the positive‑advantage filter (2) then normalizes these to A^t=max⁡(A~t,0)σ({A~t′:A~t′>0})\hat{A}_t = \frac{\max(\tilde{A}_t,0)}{\sigma(\{\tilde{A}_{t'}:\tilde{A}_{t'}>0\})}A^t​=σ({A~t′​:A~t′​>0})max(A~t​,0)​, cutting out harmful gradients; the SFT initialization (3) injects visual grounding without policy bias; and the auto‑curriculum (4) dynamically weights levels as 1/Nk1/N_k1/Nk​ to maintain a balanced training diet. The final row underscores the synergy: together they unlock long‑horizon RL for VLMs that would otherwise collapse. As we move into the concluding section, we will examine what this framework leaves unresolved and what open questions remain for building truly generalist game‑playing agents.

15. Conclusions and Open Questions

The previous section assembled the building blocks of stable long‑horizon RL for vision‑language model (VLM) agents: a lightweight turn‑level critic, a careful clipping of the advantage signal, and a training curriculum that keeps the agent balanced across tasks. Now we step back to consolidate the core insights that emerged from the entire Odysseus study and to outline the open questions that will shape the next generation of grounded, language‑conditional policy learning.
The central finding is that critic‑free REINFORCE‑style methods, which rely solely on Monte Carlo returns, collapse when the horizon stretches to hundreds of turns. Without a learned baseline, the variance of the return grows quadratically with the horizon, making it impossible to distinguish a good action from a lucky sequence. Odysseus replaces that fragile estimator with a turn‑level CNN critic VϕV_\phiVϕ​ that processes the same visual observations the VLM sees. This critic is cheap to train—only a few convolutional layers on top of a frozen image encoder—and provides a state‑dependent baseline that dramatically reduces gradient variance. However, a critic alone is not enough. Even with an accurate baseline, standard PPO will strengthen actions that produced a negative advantage, effectively un‑learning good behaviors that happened to occur in a bad rollout. The team therefore clamps the advantage estimate to zero via positive‑advantage filtering: A^t←max⁡(A^t,0)\hat{A}_t \leftarrow \max(\hat{A}_t, 0)A^t​←max(A^t​,0). This simple rule transforms the policy gradient into a conservative, purely reinforcing update: the agent only changes its distribution toward actions that outperformed the critic’s expectation. No good action is ever penalized for bad luck, and the policy converges reliably even when early episodes are dominated by noise.
The second pillar is the prior already baked into the VLM. A frozen or fine‑tuned pretrained VLM πθ\pi_\thetaπθ​ brings strong semantic grounding—it knows what a “door,” “enemy,” or “key” looks like from internet‑scale pretraining—and it can reason about multi‑step goals expressed in natural language. When compared with a classical CNN‑only policy trained from scratch on the same frames, the VLM‑based agent achieves roughly twice the sample efficiency, measured by the number of environment interactions needed to reach a given success rate. This is not just a matter of better representations; the VLM also provides a functional action space of natural language commands (e.g., “jump right,” “pick up key,” “open door”), which is far more structured than a 4‑way discrete D‑Pad. The inductive bias of language acts as a scaffold, letting the policy land in a region of the policy space where reasonable sequences are already plausible.
The third major ingredient is the recipe that packages these components into a stable training loop: the Odysseus framework. It initializes the VLM via lightweight supervised fine‑tuning on a handful of successful trajectories, then deploys an auto‑curriculum that dynamically samples tasks from a growing pool so that the agent always trains on a balanced mixture of mastered, emerging, and frontier problems. This automatic task scheduling prevents catastrophic forgetting and forces the agent to continuously rehearse easier skills while gradually expanding its frontier. Across a suite of long‑horizon platformer tasks, Odysseus surpasses frontier closed‑source models (prompted zero‑shot) and strong supervised baselines by a factor of three to six in generalization to unseen level configurations.
Despite these successes, the work leaves several critical challenges. The current advantage estimation operates at the turn level, aggregating all token predictions within a single action command into one scalar reward. That granularity misses the possibility of fine‑grained token‑level advantage estimation: a long action phrase like “walk to the ladder and climb” contains sub‑actions that may be beneficial or detrimental in different context windows. Assigning credit at the token level could unlock precise timing and richer multi‑step planning, but it introduces a much harder optimization landscape where the advantage of each token is conditioned on all preceding tokens. Equally pressing is the question of action‑space coverage. The discrete language‑command head works beautifully for platformers, but how do we extend the recipe to continuous control (e.g., steering and throttle) or hybrid spaces that combine symbolic commands with analog parameters? More fundamentally, the entire framework relies on dense progress‑oriented rewards—the agent receives a positive signal almost every turn—and it is unclear how to cope with sparse feedback that arrives only after hundreds of steps. New forms of intrinsic motivation or reward shaping that respect the VLM’s semantic knowledge become essential for even longer horizons.
Finally, the entire evaluation is confined to 2D platform environments. Generalizing beyond platformers—into rich 3D worlds and ultimately real‑world embodied tasks—introduces a step change in visual complexity, physical dynamics, and safety constraints. Whether the same combination of a turn‑level critic, positive‑advantage filtering, and a pretrained VLM can be ported directly, and whether it maintains its sample efficiency, is an open empirical question.
The visual below synthesizes these threads. It draws the stable closed‑loop RL cycle that makes Odysseus work: an observation flows into the VLM policy, which issues a language action; the environment returns a reward, and the turn‑level critic estimates the advantage. The positive‑advantage filter then gates the PPO update, ensuring that only beneficial moves are reinforced. On the right, a set of callout boxes frames the open questions around token‑level credit assignment, continuous control, sparse rewards, and broader generalization—less as a list of failures, and more as the research frontiers that the field must now address. Together, the loop and the questions capture both the practical recipe and the scientific gaps that define the state of long‑horizon VLM RL today.