World Action Models are Zero-shot Policies: The DreamZero Approach - FeynmanWiki

CONTENTS

Bookmark this paper

Save for later reading

World Action Models are Zero-shot Policies: The DreamZero Approach

1. The Generalization Gap in Robot Manipulation

The past few years have seen a dramatic leap in robotic manipulation, driven largely by the integration of large pretrained vision–language models (VLMs) with robot action spaces. Models like RT‑2, GR00T N1, and π0.5 — collectively called Vision‑Language‑Action (VLA) models — encode images and language instructions into a shared semantic space and directly output discrete or continuous action tokens. This marriage of Internet‑scale visual and textual knowledge with embodiment seems to unlock a host of tasks that were previously out of reach. A robot can now be told “move the coke can to Taylor Swift” and, using its VLM priors, locate the can, recognize the iconic singer’s face on a poster, and execute a reaching‑and‑placing trajectory. The success here hinges on semantic understanding: language specifies what to manipulate and where to place it, and the robot’s pre‑trained visual backbone supplies the object recognition, spatial reasoning, and scene grounding necessary to translate that command into a standard sequence of grasps and moves.
Yet when the same robot is asked to untie a shoelace, shake hands, or peel a fruit, it flounders. The VLA model, despite having seen millions of images of shoelaces and hands in its pretraining data, fails catastrophically at the physical execution. Its fingers misalign, forces are misjudged, and the motion looks nothing like the intended skill. The root cause is subtle but fundamental: semantic priors do not equal physical execution competence. Language instructions that describe how to move — the fine‑grained geometry of deformation, the interplay of contact forces, the rhythm of a coordinated bimanual gesture — probe a part of the skill space that is almost entirely absent from typical robot datasets. Those datasets consist overwhelmingly of rigid‑body tabletop pick‑and‑place tasks, where the mapping from language to action can be learned with high‑level motion primitives. They rarely contain the low‑level motor patterns for dexterous, contact‑rich, or dynamic motions. Consequently, a VLA model’s internal representation of “untie” is a linguistic concept divorced from the physical sequence of forces and finger trajectories that would actually untie a knot.
This gap is not merely a matter of data scale — it is an architectural misalignment. A VLA model, at its core, learns a function from (observation, instruction) to the next action (or action chunk). During training, it sees action sequences that belong to a narrow support in the space of all possible robot behaviors. At test time, when the instruction demands a skill outside that support, the model has no mechanism to imagine how the world would evolve under its own actions. It can hallucinate a plausible semantic plan (“grasp the lace, pull it…”) but its action generator, conditioned only on the current frame and a language embedding, cannot extrapolate to the unfamiliar sensorimotor consequences. The failure is one of dynamics generalization: the model has no internal representation of p(future video∣current observation,action sequence,instruction)p(\text{future video} \mid \text{current observation}, \text{action sequence}, \text{instruction})p(future video∣current observation,action sequence,instruction), so it cannot assess whether its predicted actions will actually bring about the desired visual outcome.
The key insight that motivates the DreamZero approach is therefore as crisp as it is urgent: a model that only directly maps language to actions cannot extrapolate to novel physical dynamics. To close this generalization gap, we need a world model — a generative system that learns the joint evolution of video observations and robot actions. The formal target is the full conditional distribution
p(o l:l+H,  a l:l+H  ∣  o0:l, c),p(o_{\,l:l+H},\; a_{\,l:l+H} \;\mid\; o_{0:l},\, c),p(ol:l+H​,al:l+H​∣o0:l​,c),
where o0:lo_{0:l}o0:l​ is a short history of observed frames, ccc is a language command (or more generally a task specification), and the model is asked to produce a future video clip ol:l+Ho_{l:l+H}ol:l+H​ of length HHH together with the corresponding action trajectory al:l+Ha_{l:l+H}al:l+H​ that would realize it. By training a model to generate videos conditioned on actions — and, symmetrically, to infer the actions that connect two frames — the system internalizes the physical cause‑effect relationships of the world. Once the model can “see” in imagination the result of an action sequence, it can be used as a zero‑shot policy: given a command ccc and the current observation, we sample an action trajectory from the conditional distribution (for instance, by fixing the first frame and allowing the model to autoregressively hallucinate the future video and actions). The resulting actions can then be executed on the real robot, even if that exact motion was never seen during robot training, because the world model has acquired a transferable understanding of how objects and hands interact from large‑scale video data — just as language models acquire reasoning from text.
The visual below brings this entire narrative into a single, compact diagram. The left panel depicts a clear VLA success: a robot arm holding a Coke can and placing it beside a photo of Taylor Swift, with a green check mark labeled Semantic understanding (VLM priors). The right panel shows the bitter failure: a tangled, misaligned attempt to untie a shoelace, marked with a red cross and Physical execution (novel motion). A dashed arrow intended to connect these two regimes is broken by a prominent red barrier labeled “Generalization Gap.” Behind that barrier, a stylized world‑model icon — a globe with an eye — suggests the missing piece. At the bottom of the slide, the central equation p(ol:l+H,al:l+H∣o0:l,c)p(o_{l:l+H}, a_{l:l+H} \mid o_{0:l}, c)p(ol:l+H​,al:l+H​∣o0:l​,c) sits as the mathematical encapsulation of what must be learned. The diagram thus functions as a visual mnemonic: language‑to‑action mappings succeed when the instruction stays within the data‑supported “semantic” region, but they shatter when novel motion physics is required; a world action model that jointly generates video and action sequences promises to bridge that chasm by grounding language in a generative dynamics of perception and action.

CONTENTS

Bookmark this paper

Save for later reading

World Action Models are Zero-shot Policies: The DreamZero Approach

1. The Generalization Gap in Robot Manipulation

The past few years have seen a dramatic leap in robotic manipulation, driven largely by the integration of large pretrained vision–language models (VLMs) with robot action spaces. Models like RT‑2, GR00T N1, and π0.5 — collectively called Vision‑Language‑Action (VLA) models — encode images and language instructions into a shared semantic space and directly output discrete or continuous action tokens. This marriage of Internet‑scale visual and textual knowledge with embodiment seems to unlock a host of tasks that were previously out of reach. A robot can now be told “move the coke can to Taylor Swift” and, using its VLM priors, locate the can, recognize the iconic singer’s face on a poster, and execute a reaching‑and‑placing trajectory. The success here hinges on semantic understanding: language specifies what to manipulate and where to place it, and the robot’s pre‑trained visual backbone supplies the object recognition, spatial reasoning, and scene grounding necessary to translate that command into a standard sequence of grasps and moves.
Yet when the same robot is asked to untie a shoelace, shake hands, or peel a fruit, it flounders. The VLA model, despite having seen millions of images of shoelaces and hands in its pretraining data, fails catastrophically at the physical execution. Its fingers misalign, forces are misjudged, and the motion looks nothing like the intended skill. The root cause is subtle but fundamental: semantic priors do not equal physical execution competence. Language instructions that describe how to move — the fine‑grained geometry of deformation, the interplay of contact forces, the rhythm of a coordinated bimanual gesture — probe a part of the skill space that is almost entirely absent from typical robot datasets. Those datasets consist overwhelmingly of rigid‑body tabletop pick‑and‑place tasks, where the mapping from language to action can be learned with high‑level motion primitives. They rarely contain the low‑level motor patterns for dexterous, contact‑rich, or dynamic motions. Consequently, a VLA model’s internal representation of “untie” is a linguistic concept divorced from the physical sequence of forces and finger trajectories that would actually untie a knot.
This gap is not merely a matter of data scale — it is an architectural misalignment. A VLA model, at its core, learns a function from (observation, instruction) to the next action (or action chunk). During training, it sees action sequences that belong to a narrow support in the space of all possible robot behaviors. At test time, when the instruction demands a skill outside that support, the model has no mechanism to imagine how the world would evolve under its own actions. It can hallucinate a plausible semantic plan (“grasp the lace, pull it…”) but its action generator, conditioned only on the current frame and a language embedding, cannot extrapolate to the unfamiliar sensorimotor consequences. The failure is one of dynamics generalization: the model has no internal representation of p(future video∣current observation,action sequence,instruction)p(\text{future video} \mid \text{current observation}, \text{action sequence}, \text{instruction})p(future video∣current observation,action sequence,instruction), so it cannot assess whether its predicted actions will actually bring about the desired visual outcome.
The key insight that motivates the DreamZero approach is therefore as crisp as it is urgent: a model that only directly maps language to actions cannot extrapolate to novel physical dynamics. To close this generalization gap, we need a world model — a generative system that learns the joint evolution of video observations and robot actions. The formal target is the full conditional distribution
p(o l:l+H,  a l:l+H  ∣  o0:l, c),p(o_{\,l:l+H},\; a_{\,l:l+H} \;\mid\; o_{0:l},\, c),p(ol:l+H​,al:l+H​∣o0:l​,c),
where o0:lo_{0:l}o0:l​ is a short history of observed frames, ccc is a language command (or more generally a task specification), and the model is asked to produce a future video clip ol:l+Ho_{l:l+H}ol:l+H​ of length HHH together with the corresponding action trajectory al:l+Ha_{l:l+H}al:l+H​ that would realize it. By training a model to generate videos conditioned on actions — and, symmetrically, to infer the actions that connect two frames — the system internalizes the physical cause‑effect relationships of the world. Once the model can “see” in imagination the result of an action sequence, it can be used as a zero‑shot policy: given a command ccc and the current observation, we sample an action trajectory from the conditional distribution (for instance, by fixing the first frame and allowing the model to autoregressively hallucinate the future video and actions). The resulting actions can then be executed on the real robot, even if that exact motion was never seen during robot training, because the world model has acquired a transferable understanding of how objects and hands interact from large‑scale video data — just as language models acquire reasoning from text.
The visual below brings this entire narrative into a single, compact diagram. The left panel depicts a clear VLA success: a robot arm holding a Coke can and placing it beside a photo of Taylor Swift, with a green check mark labeled Semantic understanding (VLM priors). The right panel shows the bitter failure: a tangled, misaligned attempt to untie a shoelace, marked with a red cross and Physical execution (novel motion). A dashed arrow intended to connect these two regimes is broken by a prominent red barrier labeled “Generalization Gap.” Behind that barrier, a stylized world‑model icon — a globe with an eye — suggests the missing piece. At the bottom of the slide, the central equation p(ol:l+H,al:l+H∣o0:l,c)p(o_{l:l+H}, a_{l:l+H} \mid o_{0:l}, c)p(ol:l+H​,al:l+H​∣o0:l​,c) sits as the mathematical encapsulation of what must be learned. The diagram thus functions as a visual mnemonic: language‑to‑action mappings succeed when the instruction stays within the data‑supported “semantic” region, but they shatter when novel motion physics is required; a world action model that jointly generates video and action sequences promises to bridge that chasm by grounding language in a generative dynamics of perception and action.

2. VLA vs

The instinctive answer to the generalization gap described earlier has been to scale up multimodal policies. Vision-Language-Action (VLA) models, inspired by the success of large vision-language models, attempt to absorb internet-scale visual and semantic knowledge and directly regress action commands. A VLA embodies a policy of the form
π(at∣ot,c)\pi(a_t \mid o_t, c)π(at​∣ot​,c)
where the current observation oto_tot​ and a language instruction ccc are encoded by a frozen or lightly fine-tuned VLM backbone, and a lightweight action head outputs the next motor command ata_tat​. The appeal is immediate: the policy inherits the representational richness of a model pre-trained on billions of image–text pairs, promising to recognise objects, scenes, and task intent without exhaustive robot-specific collection.
Yet that same inheritance introduces a structural mismatch. VLMs are trained on static frames and text; they lack any built-in notion of temporal coherence or physical causality. When a VLA maps a single RGB image to a continuous action, it must implicitly infer dynamics—how objects will move, how contacts evolve, how inertia and friction act—from a snapshot. The model is forced to compress all that information into a single latent vector, with no mechanism to simulate the future. Consequently, VLAs learn brittle correlations between image appearance and action labels. They need dense action supervision: every training sample must contain a (frame, action) pair, typically captured via teleoperation. The very signal on which they rely—human-collected action labels—is expensive, noisy, and severely limits the diversity of motions they can experience during training.
The crux of the problem is that a VLA never learns a forward dynamics model. It never predicts what the world will look like one second later if the robot moves its end-effector in a certain way. Without such a model, generalization to a novel embodiment or an unfamiliar object demands that the network somehow extrapolate a direct observation-to-action mapping without understanding the intervening physical process. That is a fragile hope at best.
The World Action Model (WAM) introduced by DreamZero re-frames the problem entirely. Instead of learning a reactive policy, DreamZero learns a joint generative model over future video frames and future actions conditioned on history. The central object is the conditional distribution
πθ(oℓ:ℓ+H, aℓ:ℓ+H  ∣  o0:ℓ, c, qℓ),\pi_{\theta}\big(o_{\ell:\ell+H},\, a_{\ell:\ell+H} \;\mid\; o_{0:\ell},\, c,\, q_\ell\big),πθ​(oℓ:ℓ+H​,aℓ:ℓ+H​∣o0:ℓ​,c,qℓ​),
where o0:ℓo_{0:\ell}o0:ℓ​ denotes a short clip of past observations, ccc is a language command, and qℓq_\ellqℓ​ captures proprioceptive state at time ℓ\ellℓ. The model jointly predicts the next HHH video frames oℓ+1,…,oℓ+Ho_{\ell+1},\dots,o_{\ell+H}oℓ+1​,…,oℓ+H​ and the corresponding action sequence aℓ+1,…,aℓ+Ha_{\ell+1},\dots,a_{\ell+H}aℓ+1​,…,aℓ+H​. The key insight is that future frames constitute a visual plan: a sequence of images showing how the scene should evolve. Given that imagined future, extracting the actions amounts to solving an inverse dynamics problem—inferring the motor commands that would produce the observed visual changes. Because the model is forced to paint the entire moving picture before it commits to an action, it learns a deep, causal understanding of how the physical world responds to intervention.
DreamZero implements this WAM with a video diffusion backbone pre-trained on vast corpora of web videos. That pre-training imbues the model with a rich generative prior over how real-world imagery transforms over time: objects fall, fluids slosh, hands push, shadows shift. When the model later observes a handful of robot-specific frames, it can condition on that history and on the task instruction to generate physically coherent future rollouts. Critically, the training objective does not require extra action labels beyond what is already present in any robot dataset. Every consecutive pair of frames implicitly contains the action that caused the change; the diffusion model learns to associate that visual transition with the corresponding control signal. This means that the model can harvest powerful physical intuition from every frame pair, even from data collected for other tasks or from other robots, effectively turning passive observation logs into a self-supervised dynamics curriculum.
The contrast between the two paradigms is sharp. VLA policies are high-capacity static mappers that view the future as a black box. A WAM treats the future as a visual entity to be explicitly imagined. The accompanying diagram places these approaches side by side for comparison. On the left, the VLA path is linear and compact: an observation and a language bubble feed into a frozen VLM encoder; an action head emits a single action token. The annotation highlights its essence: direct action mapping, no dynamics. On the right, the WAM panel unfolds a richer pipeline. A short video history, the instruction, and proprioceptive state all stream into a warm-orange video diffusion backbone. From it emerge two parallel outputs: a filmstrip of predicted future frames and a sequence of action vectors. The feedback loop underneath—a dashed curve returning predicted frames as history for the next chunk—captures the closed-loop autoregressive rollout that turns generation into a continuous policy. The visual makes plain that WAM replaces the blind, single-step mapping of a VLA with a generative process that explicitly simulates what will happen next, then decides what to do.

3. World Action Model (WAM) Concept

Vision–language–action (VLA) models map an observation stream directly to a sequence of motor commands. While this can work well when the training data covers the desired motion repertoire, the mapping becomes brittle as soon as the robot faces a novel physical maneuver—something as simple as pushing an object from an unfamiliar angle or executing a multi‑step rearrangement with unobserved intermediate poses. The root cause is that the model never learns why the world should respond a certain way; it only learns that these pixels usually lead to those joint velocities. Generalization to unseen motions therefore requires an internal model of the world that can be mentally simulated, allowing the agent to imagine how a scene ought to evolve and then back out the actions that make that evolution happen. This is exactly the role of a World Action Model (WAM).
A WAM is a joint generative model over future video frames ol:l+H\mathbf{o}_{l:l+H}ol:l+H​ and actions al:l+H\mathbf{a}_{l:l+H}al:l+H​, conditioned on a visual history o0:l\mathbf{o}_{0:l}o0:l​, a language instruction ccc, and the current proprioceptive state ql\mathbf{q}_lql​. Formally, we write
π0(ol:l+H, al:l+H∣o0:l, c, ql)\pi_0\big(\mathbf{o}_{l:l+H}, \,\mathbf{a}_{l:l+H} \mid \mathbf{o}_{0:l},\, c,\,\mathbf{q}_l\big)π0​(ol:l+H​,al:l+H​∣o0:l​,c,ql​)
where the horizon HHH spans the next several time steps. By modelling vision and action together, the distribution captures both what the future should look like and how to get there, rather than collapsing the two into a single opaque mapping.
The key structural insight is that this joint distribution factorises into two complementary components. Because future actions are conditionally independent of the language instruction given the full observation trajectory and the current proprioception—intuitively, the language tells us what to visualize, and the actions are recovered from that visualization—we can write the following exact decomposition:
π0(ol:l+H, al:l+H∣o0:l, c, ql)  =  π0(ol:l+H∣o0:l, c, ql)⏟Video prediction  π0(al:l+H∣o0:l+H, ql)⏟Inverse dynamics model (IDM).\pi_0\big(\mathbf{o}_{l:l+H}, \,\mathbf{a}_{l:l+H} \mid \mathbf{o}_{0:l},\, c,\,\mathbf{q}_l\big)
\;=\;
\underbrace{\pi_0\big(\mathbf{o}_{l:l+H} \mid \mathbf{o}_{0:l},\, c,\,\mathbf{q}_l\big)}_{\text{Video prediction}}
\;
\underbrace{\pi_0\big(\mathbf{a}_{l:l+H} \mid \mathbf{o}_{0:l+H},\,\mathbf{q}_l\big)}_{\text{Inverse dynamics model (IDM)}}.π0​(ol:l+H​,al:l+H​∣o0:l​,c,ql​)=Video predictionπ0​(ol:l+H​∣o0:l​,c,ql​)​​Inverse dynamics model (IDM)π0​(al:l+H​∣o0:l+H​,ql​)​​.
The first factor, the video prediction module, takes past observations, the language command, and the proprioceptive state, and outputs a plausible future video sequence—an implicit visual plan. It answers the question, “What should the scene look like over the next HHH frames?” Notably, the language instruction only feeds into this predictor; the model uses the instruction to shape the desired visual outcome, not to directly constrain the actions. The second factor, the inverse dynamics model (IDM), receives the entire observation sequence up to time l+Hl+Hl+H together with the current proprioceptive state and predicts the action chunk al:l+H\mathbf{a}_{l:l+H}al:l+H​ that would produce exactly that visual evolution. Because the IDM conditions on the realised future frames, it can be completely language‑agnostic—the high‑level intent is already encoded in the visual plan.
This separation confers a powerful training advantage: we can learn a general understanding of physical interaction through video prediction, while the IDM learns a low‑level mapping from visual change to action, both from the same demonstration data. Training can proceed in two stages (first the video predictor, then the IDM on top of frozen features) or end‑to‑end. The DreamZero framework opts for end‑to‑end training, because back‑propagating the IDM loss through the video predictor encourages the predictor to produce frames that are not only visually realistic but also maximally informative for the downstream action inference—a tight vision‑action alignment that pays off when the model is later asked to generalize.
The accompanying slide image crystallizes this definition into a single diagram. It shows the joint distribution as a central equation, then the factorised form with visible underbraces labelling the “Video prediction” and the “Inverse dynamics model (IDM)” parts. Bullet points summarise that the first factor plans the visual future and the second recovers the actions, with a final annotation highlighting DreamZero’s end‑to‑end training choice. This compact visual serves as a quick reference that the remainder of the lecture will build upon—reminding us that a WAM’s power lies not in a single monolithic mapping, but in the deliberate split between what the world should become and how the robot should act to get there.

4. Why WAMs?

The previous section introduced the concept of a world action model (WAM) – an agent that learns to imagine both future video frames and the actions that bring them about. But why advocate for this new family of models when modern vision-language-action (VLA) policies have already demonstrated impressive performance on dozens of robotic tasks? The answer lies in a fundamental limitation: VLAs, no matter how large their training corpora, remain reactive mappings from observations directly to actions. They do not model how the world evolves under their own interventions, and that absence cripples them the moment they encounter a physical situation that deviates from the distribution of their training episodes.
Consider a VLA trained on thousands of demonstrations of pick-and-place tasks. When shown a novel object – say an irregularly shaped piece of fruit – the model may produce a plausible grasping action, but it has no way to anticipate whether that action will succeed, slip, or damage the object. It cannot simulate “what if I squeeze slightly harder?” because its forward-inference pipeline is a single feed-forward pass: a=π(o)a = \pi(o)a=π(o). This brittleness is especially acute for novel physical motions – actions or contact sequences that rarely appear in the data – making VLAs poor zero-shot generalizers outside their narrow experience.
DreamZero reframes the problem by swapping the direct policy for a generative model of joint video-action trajectories. Instead of learning p(a∣o)p(a \mid o)p(a∣o), the model learns the distribution
pθ(x1:T,a0:T−1∣x0),p_\theta(\mathbf{x}_{1:T}, \mathbf{a}_{0:T-1} \mid \mathbf{x}_0),pθ​(x1:T​,a0:T−1​∣x0​),
where xt\mathbf{x}_txt​ is the frame at time ttt and at\mathbf{a}_tat​ the action that transitions from frame ttt to t+1t+1t+1. During training, the objective maximizes the autoregressive likelihood of future frames and actions conditioned on the past:
LWAM=−E(x,a)∑tlog⁡pθ(xt+1,at∣x≤t,a<t).\mathcal{L}_{\text{WAM}} = -\mathbb{E}_{(\mathbf{x},\mathbf{a})}\sum_{t} \log p_\theta(\mathbf{x}_{t+1}, \mathbf{a}_t \mid \mathbf{x}_{\leq t}, \mathbf{a}_{<t}).LWAM​=−E(x,a)​t∑​logpθ​(xt+1​,at​∣x≤t​,a<t​).
The model’s causal architecture processes a sequence of frame patches and action tokens interleaved, forcing it to represent the consequences of actions in pixel space. This design yields a tight coupling: the predicted video is inseparable from the actions that produced it.
This joint modeling connects directly to inverse dynamics. In a standard forward model p(xt+1∣xt,at)p(\mathbf{x}_{t+1} \mid \mathbf{x}_t, \mathbf{a}_t)p(xt+1​∣xt​,at​), recovering the action from a desired transition requires solving an optimization problem at∗=arg⁡max⁡p(xt+1goal∣xt,at)\mathbf{a}_t^* = \arg\max p(\mathbf{x}_{t+1}^\text{goal} \mid \mathbf{x}_t, \mathbf{a}_t)at∗​=argmaxp(xt+1goal​∣xt​,at​). A WAM, by learning the joint distribution, implicitly encodes an inverse dynamics model p(at∣xt,xt+1)p(\mathbf{a}_t \mid \mathbf{x}_t, \mathbf{x}_{t+1})p(at​∣xt​,xt+1​) as a byproduct of the generative process. If you can “dream” a realistic future frame sequence that arranges the world into a target configuration, the model can read off the corresponding action tokens directly from the same autoregressive decoding. The action becomes a latent variable inferred from the video plan, not an externally prescribed label.
This property is the engine behind zero-shot generalization. A robot with a novel morphology – say a dual-arm setup instead of a single arm – brings a different action space, but the visual consequences of moving an end-effector remain broadly similar. By simply swapping in the new action token vocabulary and fine-tuning the action embedding layer (or even performing few-shot adaptation), the WAM can still generate physically plausible videos and thereby induce sensible actions without exhaustive task-specific retraining. The world knowledge remains largely intact because the heavy lifting is done by the shared video-diffusion backbone.
The visual below captures this contrast in a single clean diagram. On one side, a VLA model appears as a brittle pipeline: an observation enters, an action exits, with no feedback loop or reasoning about the future. On the other side, the DreamZero dream loop is depicted as a cycle where the model generates a sequence of imagined frames and interleaved actions, allowing it to plan by sampling alternative futures. The loop symbolizes the core insight: a world action model does not merely predict; it simulates. That simulation turns the model into a zero-shot policy, capable of adapting to new bodies and new tasks simply by changing the dream it pursues.

5. DreamZero Architecture Overview

Building a policy that can control a robot across tasks and embodiments without any in-domain training data demands a model that already knows an enormous amount about how the physical world behaves. The insight of World Action Models (WAMs) is that a large video diffusion model, pretrained on vast and diverse video, already possesses rich priors over plausible futures—how objects move, how hands interact, how scenes evolve. The challenge lies in steering that visual generation with minimal robot-specific information so that the resulting video corresponds not just to any plausible future, but to one where the robot successfully executes the given instruction. DreamZero’s architecture answers this challenge with remarkable economy: it grafts a thin control layer onto a frozen video backbone, treating the world model as a zero-shot policy.
The design rests on a simple but powerful rule: freeze the video model, and only learn what the robot truly needs. DreamZero builds on Wan2.1‑I2V‑480P, a 14B‑parameter image‑to‑video diffusion model that excels at generating temporally coherent clips from a single starting frame. Its internal representations encode a wealth of visual dynamics—shadows, contact points, object permanence—that would be prohibitively expensive to learn from scratch on robot data. By keeping the video backbone frozen, the architecture preserves these generalization capabilities intact. The only new parameters are a tiny state encoder for the robot’s proprioception and a lightweight action encoder/decoder pair that map between the diffusion latents and the action space. In total, the task‑specific components amount to a few million parameters—negligible relative to the 14B‑parameter vision model.
Three information streams converge to condition the diffusion process, each through a dedicated encoder. The visual history o0o_0o0​—a single image capturing the current scene—passes through the pretrained variational autoencoder (VAE) that was used during the video model’s original training. The VAE compresses the image into latent codes z0z_0z0​, a low‑dimensional representation that the diffusion transformer already understands natively. The language instruction ccc (e.g., “close the drawer”) is encoded by the model’s own frozen text encoder, providing a semantic goal that aligns with the model’s image‑language priors. Finally, the proprioceptive state qlq_lql​—joint positions, gripper status, or end‑effector pose—goes through a small learned multi‑layer perceptron, the only component that translates raw robot measurements into a representation consumable by the video model. This separation ensures that the visual and linguistic understanding remains frozen and generic, while the robot‑specific bottleneck is as lean as possible.
These encoded signals flow into an autoregressive diffusion transformer (DiT) uθu_\thetauθ​. At its core, uθu_\thetauθ​ predicts a joint velocity field for both the future video latents and the normalized actions:
(v^z, v^a)=uθ(z0, c, ql).(\hat{v}^z,\, \hat{v}^a) = u_\theta(z_0,\, c,\, q_l).(v^z,v^a)=uθ​(z0​,c,ql​).
The term v^z\hat{v}^zv^z is a denoising velocity in the latent space of the VAE; v^a\hat{v}^av^a is the corresponding velocity in a normalized action space. By predicting velocities rather than absolute states, the model can operate in a flow‑matching framework that will be detailed in the next section. These predicted velocities are then integrated numerically over the diffusion timesteps to produce clean future latents zl:l+Hz_{l:l+H}zl:l+H​ and a sequence of actions al:l+Ha_{l:l+H}al:l+H​ for a horizon HHH. Conceptually, the model is performing a closed‑form of inverse dynamics: given “what the world will look like” and “what the robot is currently doing,” it infers the actions that bring that world about.
The action decoder deserves a brief spotlight. In the diagram we will show it as a block labeled “Action Decoder” that receives the action velocities v^a\hat{v}^av^a and yields the final action trajectory a^l:l+H\hat{a}_{l:l+H}a^l:l+H​. In practice, this decoder is just a linear or small MLP head that maps the normalized action predictions back to the robot’s specific actuation space—joint angles, gripper width, or Cartesian deltas. Because action spaces vary across embodiments, keeping the decoder separate and small makes it easy to swap when moving to a different robot. The vast majority of the computation happens in the shared, frozen video backbone, which naturally handles visual variation and scene semantics regardless of the downstream embodiment.
A subtle but critical detail is the multi‑view handling. Many robotic setups use two or more cameras, such as left and right wrist cameras on a bimanual platform. DreamZero avoids adding any new encoders or fusion layers. Instead, the frames from all cameras are concatenated side‑by‑side into a single input image, effectively treating them as a wide panoramic view. The VAE and the subsequent DiT process this larger image just as they would a single camera frame. This preserves the model’s ability to use spatial relationships between views while introducing zero additional parameters. It also means the same architecture can handle an arbitrary number of cameras as long as they fit within the image canvas, a nice property for cross‑embodiment deployment.
The entire pipeline is remarkably simple, yet it achieves state‑of‑the‑art generalization precisely because it respects the division of labor between the visual world model and the robot‑specific policy. The visual below consolidates this architecture into a single diagram. On the left, the three inputs—visual history, language, and proprioception—flow through their respective encoders. Blue boxes mark the frozen components (VAE, text encoder, video diffusion backbone), while red boxes mark the few trained modules (state encoder, action decoder, latent update). All signals converge onto the central autoregressive DiT, which outputs predicted velocity fields for video latents and actions. Finally, those velocities are integrated into clean future latents and executable action trajectories. This picture is not merely a summary; it makes visible the design principle that DreamZero gets generalization from the large frozen model and embodiment‑awareness from the tiny learned additions.

6. Training Objective: Flow Matching with Joint Denoising

Building a model that can imagine future video frames and simultaneously decide what actions to take is a deceptively hard learning problem. The architecture described previously gives us a powerful backbone—a transformer that ingests visual context, language instructions, and proprioceptive state. But the critical question remains: how do we train a single network to produce both high-dimensional pixel-like latents and low-dimensional continuous control signals in a coherent, temporally aligned fashion? The answer in DreamZero lies in a flow matching objective with joint denoising, a formulation that treats video prediction and action generation not as separate tasks but as a unified modeling problem over a shared stochastic process.
Flow matching, and specifically the rectified flow variant used here, rethinks generation as learning a velocity field that transports samples from a simple noise distribution to the data distribution along straight-line paths. For a given training chunk of $K$ future frames, we start with clean latent-action pairs $(\mathbf{z}^{(k)}_1, \mathbf{a}^{(k)}_1)$ that come from the real trajectory. We also sample independent Gaussian noise $\mathbf{z}^{(k)}_0, \mathbf{a}^{(k)}_0 \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$. The key insight is to define a linear interpolation between noise and clean data for each modality using the same scalar timestep $t_k$, drawn uniformly from $(0,1)$:
ztk(k)=tkz1(k)+(1−tk)z0(k),atk(k)=tka1(k)+(1−tk)a0(k).\mathbf{z}^{(k)}_{t_k} = t_k \mathbf{z}^{(k)}_1 + (1-t_k) \mathbf{z}^{(k)}_0, \qquad
\mathbf{a}^{(k)}_{t_k} = t_k \mathbf{a}^{(k)}_1 + (1-t_k) \mathbf{a}^{(k)}_0.ztk​(k)​=tk​z1(k)​+(1−tk​)z0(k)​,atk​(k)​=tk​a1(k)​+(1−tk​)a0(k)​.
This shared $t_k$ couples the two modalities: at a given noise level, both latents and actions are corrupted by the same fractional amount. Doing so forces the model to learn a joint representation where visual futures and control signals evolve together, rather than drifting into independent—and possibly contradictory—predictions.
From this interpolated state, we define the target velocity vector that the model must predict. The true velocity is simply the difference between the clean sample and the noise, which points directly from the corrupted state back toward the data:
vk=[z1(k)−z0(k),  a1(k)−a0(k)].\mathbf{v}_k = [\mathbf{z}^{(k)}_1 - \mathbf{z}^{(k)}_0,\; \mathbf{a}^{(k)}_1 - \mathbf{a}^{(k)}_0].vk​=[z1(k)​−z0(k)​,a1(k)​−a0(k)​].
If the model can accurately estimate this vector field at every intermediate timestep, then during inference we can start from pure noise and follow the predicted velocities numerically (e.g., via an Euler solver) to arrive at realistic, coordinated video–action trajectories. The network $u_\theta$ receives the concatenated noisy state $[\mathbf{z}^{(k)}{t_k}, \mathbf{a}^{(k)}{t_k}]$, along with conditioning signals—the visual context chunk $\mathcal{C}_k$, the language embedding $c$, the proprioceptive history $\mathbf{q}k$, and the timestep itself—and outputs the predicted velocity $\mathbf{v}{\text{pred}}$.
The training loss is a weighted mean squared error between the predicted and true velocity, averaged over all $K$ chunks in the trajectory and over sampled timesteps:
L(θ)=Ez,a,{tk}[1K∑k=1Kw(tk) ∥uθ([ztk(k),atk(k)];Ck,c,qk,tk)−vk∥2].\mathcal{L}(\theta) = \mathbb{E}_{\mathbf{z},\mathbf{a},\{t_k\}} \left[ \frac{1}{K} \sum_{k=1}^K w(t_k)\, \| u_\theta([\mathbf{z}^{(k)}_{t_k}, \mathbf{a}^{(k)}_{t_k}]; \mathcal{C}_k, c, \mathbf{q}_k, t_k) - \mathbf{v}_k \|^2 \right].L(θ)=Ez,a,{tk​}​[K1​k=1∑K​w(tk​)∥uθ​([ztk​(k)​,atk​(k)​];Ck​,c,qk​,tk​)−vk​∥2].
The weighting function $w(t_k)$ plays an important practical role. Typically, $w(t_k)$ is chosen to be proportional to $1/t_k$, which down-weights samples with large timesteps (i.e., high noise levels). This reduces the loss contribution from nearly pure noise states where predicting a precise velocity is nearly impossible, and instead focuses the model’s capacity on the more informative intermediate- and low-noise regimes. Empirically, this weighting stabilizes training and yields better final generation quality.
Why is this flow matching formulation superior to, say, a standard diffusion objective that predicts the noise itself? In diffusion, the target is the added noise $\epsilon$, scaling with $1/\sqrt{1-\bar{\alpha}_t}$ can become unstable near $t=1$, and the denoising process often requires many steps. Rectified flow predicts the velocity directly, which corresponds to a straight-line path that can be integrated in far fewer steps—sometimes even a single step with good enough approximation. For robotics, this translates into faster, more consistent closed-loop inference, crucial for real-time control. Moreover, by jointly denoising latents and actions with a shared timestep, the model internalizes the natural causal relation: actions cause changes in the visual scene, so their trajectories must be interlocked. If we used independent noise schedules, the two signals could quickly desynchronize, harming both prediction accuracy and downstream control performance.
The visual below offers a compact summary of this flow. It places the interpolation equations centrally, showing how the same $t_k$ mixes noise and clean data for both $\mathbf{z}$ and $\mathbf{a}$. An arrow leads from the interpolation to the target velocity definition, emphasizing that the true “direction” is the straight line from noise to clean data. The loss equation is then highlighted with a colored border, underscoring its role as the quantitative training signal; inside it, the explicit dependence on the shared timestep and the weighting factor $w(t_k)$ is visible. A small note box at the bottom calls out the $1/t_k$ weighting heuristic and the practical benefit of co-evolving latents and actions. Together, these elements distill the mathematical core of DreamZero’s objective into a single glance, reinforcing the idea that joint flow matching is the engine that turns a world action model into a zero-shot policy.

7. Teacher Forcing and Chunk-wise Denoising

If you’ve followed the previous section on flow matching with joint denoising, you know that DreamZero learns to transform noise into coherent video–action velocities. But having a powerful generative model isn’t enough: we need it to function as a policy that can be rolled out autoregressively over time while avoiding the usual pitfall of compounding errors. This is where teacher forcing with a careful chunk-wise setup becomes essential.
The core insight is that during training, the model should be conditioned on clean, ground-truth context, never on its own potentially flawed predictions. For each chunk kkk, the context Ck\mathcal{C}_kCk​ is defined as the set of all previous clean chunks:
Ck={(z1j,a1j)}j=1k−1\mathcal{C}_k = \{ (\mathbf{z}_1^j, \mathbf{a}_1^j) \}_{j=1}^{k-1}Ck​={(z1j​,a1j​)}j=1k−1​
Here every (z1j,a1j)(\mathbf{z}_1^j, \mathbf{a}_1^j)(z1j​,a1j​) is a chunk of K=2K=2K=2 latent frames from the past episode, taken at the clean (noise‑free) level. This clean context acts like an oracle history that tells the model exactly what happened before – a luxury that we can only afford in the training phase.
To make the model respect the sequential nature of the task, DreamZero’s DiT backbone uθu_\thetauθ​ is fed the noisy current chunk [ztkk,atkk][\mathbf{z}_{t_k}^k, \mathbf{a}_{t_k}^k][ztk​k​,atk​k​] together with the clean context, but attention is constrained by a causal QKV mask. In the multi‑head self‑attention layers, each noisy token from chunk kkk can attend only to tokens from chunks j<kj < kj<k (the clean context) and to itself; it is forbidden from looking at any future chunk or even at other tokens within the same noisy chunk that are later in sequence. This causal mask is visually akin to a lower‑triangular attention pattern, where the upper triangle for future segments is zeroed out.
Why is this so important? Because during training, the model never sees its own predictions as context. It always learns to predict the clean velocity from clean history. This eliminates the “exposure bias” that would otherwise cause the learned policy to drift when it must rely on its own imperfect outputs at inference time. The causal mask additionally prevents any information leakage from the current noisy region back into the clean history, enforcing a strict temporal ordering. In effect, the architecture is trained to perform autoregressive generation without experiencing the consequences of its own errors – a classic teacher‑forcing trick that stabilizes learning.
The chunk size is fixed at K=2K=2K=2 latent frames, which corresponds to a specific temporal step in the compressed latent space. The number of chunks MMM in an episode can vary up to 444, giving the model a visual context of roughly 6.66.66.6 seconds of real‑world video. This relatively short window forces the policy to focus on recent, actionable history rather than trying to memorize long‑range correlations that may not generalize.
Concretely, for each noisy–clean pair, the model predicts an unconditional velocity vector:
vpredk=uθ([ztkk,atkk]; Ck,c,qk,tk)\mathbf{v}_\text{pred}^k = u_\theta\big([\mathbf{z}_{t_k}^k, \mathbf{a}_{t_k}^k]; \, \mathcal{C}_k, \mathbf{c}, \mathbf{q}_k, t_k \big)vpredk​=uθ​([ztk​k​,atk​k​];Ck​,c,qk​,tk​)
where c\mathbf{c}c is a task embedding, qk\mathbf{q}_kqk​ is a chunk‑position representation, and tk∼U(0,1)t_k \sim \mathcal{U}(0,1)tk​∼U(0,1) is the noise level. The loss is then computed between vpredk\mathbf{v}_\text{pred}^kvpredk​ and the true velocity vk=(z1k−z0k,a1k−a0k)\mathbf{v}^k = (\mathbf{z}_1^k - \mathbf{z}_0^k, \mathbf{a}_1^k - \mathbf{a}_0^k)vk=(z1k​−z0k​,a1k​−a0k​), exactly as described in the flow‑matching objective.
At inference time, the setup flips: the clean context Ck\mathcal{C}_kCk​ is populated by the model’s own previous predictions, which are now frozen and used autoregressively. To avoid recomputing the keys and values for the entire history at every step, DreamZero reuses a key–value cache (KV\mathcal{KV}KV) across chunks. This makes closed‑loop control possible at 7 Hz, a detail we’ll unpack in the next section.
The visual below consolidates these design decisions into a compact diagram. On the left you see a stack of green boxes, each labeled with a pair (z1j,a1j)(\mathbf{z}_1^j, \mathbf{a}_1^j)(z1j​,a1j​) for j=1,…,k−1j=1,\dots,k-1j=1,…,k−1 – these are the clean context chunks. To their right, a single grey block represents the current noisy chunk [ztkk,atkk][\mathbf{z}_{t_k}^k, \mathbf{a}_{t_k}^k][ztk​k​,atk​k​], created by noising the clean chunk with tk∼U(0,1)t_k \sim \mathcal{U}(0,1)tk​∼U(0,1). A blue Transformer block receives both inputs, but the clean history enters through a causal attention mask (shown as a lower‑triangular matrix diagram) that forces each row kkk to attend only to columns j<kj < kj<k. The Transformer’s output, an orange arrow pointing right, is the predicted velocity vpredk\mathbf{v}_\text{pred}^kvpredk​, which is then compared against the ground‑truth velocity via a loss. Small annotations remind us of the key parameters: K=2K=2K=2, M≤4M \le 4M≤4 yielding a ~6.6 s context, and the note that inference reuses the same architecture with past predictions as Ck\mathcal{C}_kCk​ and a KV cache for speed. This snapshot captures why DreamZero can train to be a zero‑shot policy without ever seeing its own future mistakes during learning.

8. Closed-Loop Inference with KV Cache

When a world action model learns to plan actions by imagining future video frames, the training recipe is inherently teacher forced: during optimization the model sees ground-truth observation sequences and is trained to predict the next chunk of video and actions conditioned on the real history. This works beautifully in a supervised setting, but at test time the model must generate its own visual future—and if left unchecked, the inevitable small errors in each predicted frame compound into wild hallucinations that quickly render the resulting action plan useless. The DreamZero inference algorithm sidesteps this compounding-error problem by operating in a closed‑loop regime, where real sensory observations are injected back into the model’s context after every action chunk, resetting the world state to reality and preventing the imagined video from drifting away.
The core challenge is to maintain an autoregressive generation process that conditions each new chunk on all previous observations, but to avoid ever feeding the model its own predicted video latents, which are the source of error accumulation. DreamZero solves this by maintaining a key–value (KV) cache that stores the attended features of all past real observations. The cache is prefilled with encoded latents from the initial observation window o0:lo_{0:l}o0:l​ and thereafter is extended exclusively with fresh, real observations that arrive after the robot executes a chunk of actions. The predicted video latent from the denoising process—the model’s imagination of what should happen—is deliberately discarded; it never pollutes the cache. The result is a tight feedback loop: the model is always grounded in the true physical state of the world, yet it still benefits from the rich visual planning capabilities of a diffusion-based video generator.
At the heart of the inference loop is a flow‑matching denoiser that operates on a joint latent variable xxx concatenating the next video chunk tokens z0kz_0^kz0k​ and the corresponding action tokens a0ka_0^ka0k​. In each planning chunk kkk the algorithm starts from scratch with pure Gaussian noise
x0=[z0k,a0k]∼N(0,I),x_0 = [z_0^k, a_0^k] \sim \mathcal{N}(0, I),x0​=[z0k​,a0k​]∼N(0,I),
then iteratively refines this latent vector through NNN denoising steps using the learned velocity field uθu_\thetauθ​. The denoising step is conditioned not only on the task prompt ccc and the low-level instruction qlq_lql​, but crucially on the entire KV cache of real history:
v=uθ(x,ti,c,ql,KV),v = u_\theta\bigl(x, t_i, c, q_l, \mathcal{KV}\bigr),v=uθ​(x,ti​,c,ql​,KV),
where ti=(i−1)/Nt_i = (i-1)/Nti​=(i−1)/N schedules the noise level. A simple Euler integration x←x+v dtx \leftarrow x + v \, dtx←x+vdt evolves the latent toward a clean sample. This formulation elegantly folds the inverse‑dynamics problem into the generative model: the denoiser simultaneously produces a plausible future video snippet and the actions that would cause it, all while respecting the constraints imposed by the physical past stored in the KV cache.
After the denoising sweep, the clean action tokens are extracted from the latent vector, and the robot executes them asynchronously—meaning the model does not block on action completion but continues to plan the next chunk while the robot moves. Immediately upon receiving the next real observation, the visual encoder transforms it into latents zrealz_{\text{real}}zreal​, and these are appended to the KV cache, replacing the imagined video tokens. The key insight is that the predicted video latent zcleanz_{\text{clean}}zclean​ is never used to condition future chunks; only the fresh, real observation enters the cache. This closed‑loop injection ensures that the model’s world model never has to rely on its own flawed predictions, which could otherwise snowball into a completely fictitious state after a handful of chunks.
The autoregressive loop continues for M=⌈H/K⌉M = \lceil H/K \rceilM=⌈H/K⌉ chunks, where HHH is the total horizon and KKK the chunk size. Because each denoising pass starts from pure noise and is conditioned on the full cache, the model can recover gracefully from unexpected real-world outcomes: a nudge, a slipped grasp, or an unforeseen obstacle will simply appear in the next real observation and immediately inform the subsequent planning chunk. There is no need for explicit replanning or closed‑loop controllers; the architecture naturally turns perception into action in a continuous feedback cycle.
The accompanying diagram condenses this entire inference procedure into a clean pseudocode block. The function header DREAMZERO_INFERENCE(o_{0:l}, c, q_l, H) is highlighted, and the structured indentation makes the two‑phase loop—prefill then chunk‑wise autoregression—immediately legible. Inside the inner denoising loop the flow‑matching velocity call and Euler step are displayed with terse clarity. A small annotation bubble drawn beside the cache‑update line reads “closed‑loop,” drawing attention to the pivotal moment when reality overrides imagination. Beneath the code box, two bullet points remind the reader that the predicted video latent is discarded and that this closed‑loop strategy prevents compounding video‑prediction errors—a compact summary of the argument that the preceding paragraphs have built.

9. The Reactivity Gap: Why WAMs Are Slow

In the previous section, we saw how DreamZero leverages an autoregressive transformer with a persistent KV cache to perform closed‑loop inference: the model generates action chunks conditioned on past video frames and robot states, enabling the kind of reactive, history‑aware policy needed for real‑world manipulation. Yet moving from a conceptual closed‑loop design to live robot control exposes a harsh reality: the raw latency of world‑action model inference is orders of magnitude too slow for physical deployment. Even with caching, a naive DreamZero rollout on a single GPU requires roughly 5.7 seconds to produce a single action chunk—a duration that dwarfs the chunk’s own execution window and freezes any robot trying to use it.
To appreciate the severity of this gap, recall the inference pipeline. DreamZero is built around a large diffusion transformer (DiT) with about 14 billion parameters. During action generation, the model iteratively refines a noisy action trajectory through a learned denoising process, much like a standard diffusion model, but conditioned on visual and proprioceptive history. Each denoising step involves one full forward pass of that massive transformer, and the standard setup uses 16 denoising steps per chunk. Each such forward pass costs approximately 350 ms on a modern accelerator, thanks to the huge model size and the overhead of attending to high‑dimensional multimodal embeddings. Multiply by 16 steps, and you are already close to 5.6 seconds. Adding a modest 0.1 s for KV‑cache input/output and framework overhead pushes the naive total to 5.7 s.
That 5.7‑second latency is catastrophic for two intertwined reasons. First, it is purely sequential: the robot cannot begin executing the new chunk until the entire denoising chain finishes, so the arm effectively stalls while the model thinks. Second, the robot’s own execution rate is far faster. DreamZero plans in fixed action horizons—typically 48 steps of low‑level joint commands, spanning 1.6 seconds at a 30 Hz control frequency. Once the chunk is ready, the robot streams those 48 actions over the next 1.6 s, after which it needs the next chunk immediately to avoid jerky stops. Consequently, the model’s inference latency must be comfortably lower than that 1.6 s window; in practice, to maintain smooth, reactive motion, the pipeline demands a per‑chunk latency below roughly 200 ms. That margin allows the planner to absorb variability in compute time, communication delays, and simple safety checks. But at 5.7 s, naive inference overshoots the target by more than a factor of 28. The reactivity gap is staggering: the robot can execute an entire action chunk without the model finishing even one denoising step for its successor.
We can decompose the bottleneck into three principal components, each of which must be tackled by any practical speedup scheme:
Iterative denoising — 16 serial forward passes, each through the 14 B DiT. Removing or drastically reducing the number of steps is the most direct lever, but it must be done without destroying the quality of the planned actions.
Massive per‑step cost — each 350 ms forward pass is dominated by the immense parameter count and the large context size (visual tokens plus state history). Even if we cut the number of steps, a single pass would still be unacceptably slow.
Sequential execution — the robot waits for the full chunk. Any acceleration that is still per‑chunk will still leave the robot idle; a truly reactive system requires interleaving or prediction while acting.
Putting these numbers together gives the core equation that motivates the entire next stage of DreamZero’s architecture:
Speedup required  =  TnaiveTtarget  =  5.7 s0.2 s  ≈  28.5×\text{Speedup required} \;=\; \frac{T_{\text{naive}}}{T_{\text{target}}} \;=\; \frac{5.7\ \text{s}}{0.2\ \text{s}} \;\approx\; 28.5\timesSpeedup required=Ttarget​Tnaive​​=0.2 s5.7 s​≈28.5×
A 28‑fold acceleration is not an incremental optimization target; it mandates a fundamental rethinking of how the diffusion model produces actions under a real‑time budget. The upcoming DreamZero‑Flash design achieves precisely this by decoupling noise schedules and exploiting a streaming inference regime—but before we dive into that technical remedy, it is worth pausing to let the scale of the problem sink in. The visual below distills the latency breakdown and the speedup factor into a single, stark comparison. The table lists each latency component and its contribution, contrasting the naive 5.7 s total against the 0.2 s real‑time target, while the equation at the bottom encapsulates the 28.5× gap in a form that echoes the central challenge of deploying large generative models on physical hardware. Seeing these numbers side‑by‑side makes it clear why a world‑action model that generates excellent policies in simulation can still be unusable on a real robot unless its inference is re‑engineered from the ground up.

10. DreamZero-Flash: Decoupled Noise Schedules

When we last examined the reactivity gap, it became clear that conventional world-action models incur a steep penalty at the control interface: generating video frames forces the model to commit a substantial fraction of the denoising budget to pixel-space synthesis before the first action vector even begins to crystallize. The natural engineering instinct is to shorten the denoising trajectory—run fewer steps across the entire joint input—so that the network produces a usable action faster. Armed with the unified noise schedule from the initial DreamZero formulation, one might simply reduce the number of Euler steps from, say, four to one. Yet this naive acceleration collapses action quality. The reason illuminates a subtle distributional mismatch in the training of dual-modal diffusion.
In a typical joint diffusion, every sample index kkk within a context window receives the same noise level. That is, both the predicted video latent zk\mathbf{z}^kzk and the corresponding action chunk ak\mathbf{a}^kak are corrupted to an identical timestep tkt_ktk​, drawn uniformly from (0,1](0,1](0,1]. During multi-step inference, the cascaded denoising ensures that the video and action pathways evolve in lockstep: they see a spectrum of coupling, from pure noise to clean signal. When you compress this to a single forward pass, the model is forced to jump from t=0t=0t=0 (completely clean video latent) to t=1t=1t=1 (prediction) in one shot. But in training it almost never encountered the situation where the video branch is heavily corrupted while the action branch is simultaneously almost clean. That particular cross-modal state—high noise on the visual side coupled with low noise on the action side—lies outside the manifold explored by a uniform joint schedule. The consequence is that a single-step model hallucinates action snippets that are plausible in isolation but disconnected from the visual goal, erasing the careful alignment that zero-shot policies demand.
DreamZero-Flash surgically resolves this mismatch by introducing decoupled noise schedules. During training, the timestep for the video component is deliberately biased toward high noise, while the action timestep remains uniform. Concretely, for each context index kkk we sample an auxiliary variable η∼Beta(α,β)\eta \sim \text{Beta}(\alpha, \beta)η∼Beta(α,β) with hyperparameters chosen so that α>β\alpha > \betaα>β (for instance, α=7,β=1\alpha=7, \beta=1α=7,β=1, giving E[η]≈0.875\mathbb{E}[\eta] \approx 0.875E[η]≈0.875). We then set the video timestep as
tvideo,k=1−η,t_{\text{video},k} = 1 - \eta,tvideo,k​=1−η,
which concentrates probability mass near zero—intuitively, the video stream is almost always “early” in the diffusion process, hence drenched in noise. The action timestep remains governed by a standard uniform:
taction,k∼U(0,1).t_{\text{action},k} \sim U(0,1).taction,k​∼U(0,1).
From these per-modality timesteps we construct the noisy inputs via the standard forward noising formulas:
ztvideo,kk=tvideo,kz1k+(1−tvideo,k)z0k,ataction,kk=taction,ka1k+(1−taction,k)a0k.\mathbf{z}^k_{t_{\text{video},k}} = t_{\text{video},k} \mathbf{z}^k_1 + (1 - t_{\text{video},k}) \mathbf{z}^k_0, \qquad
\mathbf{a}^k_{t_{\text{action},k}} = t_{\text{action},k} \mathbf{a}^k_1 + (1 - t_{\text{action},k}) \mathbf{a}^k_0.ztvideo,k​k​=tvideo,k​z1k​+(1−tvideo,k​)z0k​,ataction,k​k​=taction,k​a1k​+(1−taction,k​)a0k​.
The training objective becomes a simple modification of the flow-matching loss, where the model uθ\mathbf{u}_\thetauθ​ now receives two separate timestep scalars—one for video, one for actions—concatenated as an additional conditioning vector:
L(θ)=E[1K∑k=1Kw(tk)∥uθ ⁣([ztvideo,kkataction,kk]; Ck,c,qk,[tvideo,ktaction,k])−vk∥2],L(\theta) = \mathbb{E} \Bigg[ \frac{1}{K} \sum_{k=1}^{K} w(t_k) \Big\|
\mathbf{u}_\theta\!\left(
\begin{bmatrix}\mathbf{z}^k_{t_{\text{video},k}}\\ \mathbf{a}^k_{t_{\text{action},k}}\end{bmatrix};\,
\mathcal{C}_k, c, \mathbf{q}_k, \begin{bmatrix}t_{\text{video},k} \\ t_{\text{action},k}\end{bmatrix}
\right)
- \mathbf{v}^k
\Big\|^2 \Bigg],L(θ)=E[K1​k=1∑K​w(tk​)​uθ​([ztvideo,k​k​ataction,k​k​​];Ck​,c,qk​,[tvideo,k​taction,k​​])−vk​2],
where vk=[z1k−z0k;a1k−a0k]\mathbf{v}^k = [\mathbf{z}^k_1 - \mathbf{z}^k_0; \mathbf{a}^k_1 - \mathbf{a}^k_0]vk=[z1k​−z0k​;a1k​−a0k​] is the target velocity. The crucial insight is that the biased sampling actively exposes the model to the exact regime that single-step inference requires: the video branch is at a low timestep (heavily corrupted), while the action branch is at various levels, including nearly clean. The Beta distribution effectively stretches the denoising curriculum so that the network must learn to infer crisp actions from severely degraded visual context. In probabilistic terms, we are densifying the cross-modal joint density in the region (tvideo≪taction)(t_{\text{video}} \ll t_{\text{action}})(tvideo​≪taction​), which was sparsely visited before.
This design yields a remarkable practical payoff. At deployment time, we can run a single Euler step from t=0t=0t=0 to t=1t=1t=1 for the actions, while the video component is clamped to a partially noisy state or optionally refined with a few inexpensive operations (the latter is tolerated because video quality is not the bottleneck for control frequency). Because training has already saturated the model on the tvideo≈0.125t_{\text{video}} \approx 0.125tvideo​≈0.125 condition, the action prediction at inference is statistically indistinguishable from a multi-step denoised action—quality is retained while effective action latency is halved compared to the four-step regime. The maneuver does not require architectural surgery; it is a pure data-schedule intervention that exploits the asymmetry between the two modalities’ tolerance to noise. Video, being a conditioning signal for the policy, can remain stochastic as long as the recovery of the action trajectory is accurate and temporally coherent.
A diagrammatic view captures this asymmetry succinctly. On the left, the training setup contrasts two vertical noise bars: the video bar is almost entirely filled—dominated by high noise due to the Beta-bias—while the action bar exhibits a clean, uniform mix. A small schematic of the skewed Beta(7,1) density feeds into the video timestep, underlining that most training samples force the model to resolve actions from a barely recognizable visual scaffold. On the right, the inference panel shows a single-step leap: the action bar transitions from full gray to solid green in one stride, whereas the video bar retains its partial noise, reflecting the reality that visual perfection is not required for robust control. The sequence of arrows from the training diagram to the inference result reinforces the core message: decoupling noise schedules during training aligns the model’s experience with the one-step regime, making the reactive latency drop from multi-step denoising without sacrificing the action prediction fidelity that defines a zero-shot policy.

11. Inference Speedup Stack (38×)

With a decoupled noise schedule, DreamZero‑Flash already trims the diffusion chain from hundreds of denoising steps to a handful, but the raw compute cost of even a short chain on a contemporary GPU still sits far above the tight latency budget of a real‑time control loop. A vision‑language‑action model that must run at the robot’s natural update rate—commonly 5–10 Hz—cannot afford frame‑by‑frame amortized inference times measured in seconds. To close the gap from a promising research prototype to a deployable zero‑shot policy, the DreamZero team engineered a holistic inference speedup stack that delivers a cumulative 38× throughput gain relative to the original video‑action diffusion baseline, ultimately achieving stable closed‑loop control at 7 Hz. Understanding this stack means appreciating how system‑level pragmatics and model‑level algorithmic shortcuts can compound without sacrificing the generative quality that makes zero‑shot generalization possible.
The optimization journey begins with the observation that the most compute‑intensive component of the DreamZero architecture is the autoregressive video‑conditioned denoiser. Each denoising step queries a large transformer to refine the action chunk while cross‑attending to the history of video latent features. If every step recomputes the full attention over the same conditioning context, the overhead swells linearly with the number of denoising steps. Attention caching is therefore the first low‑hanging fruit: the keys and values derived from the static video conditioning are computed once per control cycle and reused across all denoising steps. In practice, this avoids roughly half the FLOPs inside the cross‑attention blocks because the context embeddings never change between steps. Combining caching with FlashAttention‑2 kernels eliminates the memory‑bound bottleneck, giving a straightforward 2× latency reduction on modern hardware.
But even with cached context, the absolute step count matters greatly. The decoupled noise schedule from the previous section already slashes the number of denoising passes from a default 1000‑step DDPM schedule down to 8–16 steps without visible degradation of action quality. That single algorithmic redesign contributes a ~4× speedup. Yet the denoiser itself, in its original training precision and eager execution mode, still leaves abundant room for acceleration. Moving the entire inference graph to mixed precision (FP16) with selective INT8 quantization of the heaviest linear layers—while keeping the diffusion latents in bfloat16—yields another factor of 2.5× in raw matmul throughput on tensor cores, without any noticeable loss of action accuracy. The quantized weights are pre‑calibrated on a small validation set, and the straight‑through gradient that was used during training ensures the model is inherently robust to numerical noise.
Beyond data‑type optimization, a combination of kernel fusion and graph compilation tightens the execution. The DreamZero denoiser contains many small operations—layer normalization, activation functions, residual additions—that, executed naively, fry the GPU’s memory bandwidth and scheduling efficiency. By hand‑writing custom CUDA kernels for the fused group‑norm‑and‑silu pattern and then applying torch.compile to the entire autoregressive loop with full‑graph capture, the framework cuts kernel launch overhead by over 80%. The result is roughly a 1.8× wall‑clock speedup. Moreover, within the closed‑loop rollout, the model can reuse the KV cache not only across denoising steps but also across consecutive control cycles when the video context overlaps; a sliding‑window caching scheme lifts the amortized speed to a cumulative 38× compared to the unoptimized research code.
Importantly, none of these optimizations are independent; their gains partially multiply but also exhibit diminishing returns because some bottlenecks shift from compute to memory to launch overhead as each is removed. The final stack, carefully profiled on the target deployment GPU, balances these factors. The visual that follows captures the essence of the 38× speedup as a layered bar chart, where each horizontal segment corresponds to one optimization tier—attention caching, step reduction via decoupled noise, quantization, and kernel+graph optimizations—annotated with its approximate contribution. Together they form a pragmatic recipe that transforms a video‑conditioned diffusion model from a heavy academic artifact into a policy that can react to novel scenes seven times each second.

12. Zero-shot Generalization to Unseen Tasks and Environments

The previous section demonstrated how DreamZero achieves a 38× inference speedup through an optimized inference stack, enabling closed-loop control at 7 Hz. Speed alone, however, offers little value if the actions themselves are brittle. The defining promise of world action models is that they can act in worlds they have never physically visited. This section tests that claim directly: can DreamZero perform tasks that involve entirely new verbs, objects, and motion patterns, without any task-specific finetuning?
Vision-Language-Action models (VLAs) have dominated recent robot learning by directly mapping observations and language instructions to motor commands. Their training objective is to maximize the likelihood of expert actions conditioned on vision and text. When confronted with a task whose motion vocabulary lies outside the training distribution—say, “slide the mug across the table” when the robot has only seen picking and placing—a VLA must extrapolate a control policy from a latent space shaped entirely by labeled action data. The result is often catastrophic: the model produces arbitrary joint movements, because it has no mechanism to imagine the physical consequences of a motion before executing it. DreamZero sidesteps this brittleness by decoupling the “what” from the “how.” It first predicts a future video of the desired outcome, then extracts actions via an inverse dynamics model that is trained only on universal physical interaction data. This means that even if a task verb has never been coupled with a specific robot embodiment, the system can still generate a plausible video of the goal (thanks to its pretrained video model’s semantic and physical knowledge) and then faithfully realize it.
The joint video-action generation objective underlies this capability. DreamZero models the conditional distribution p(v1:T,a1:T∣o0,l)p(v_{1:T}, a_{1:T} \mid o_0, l)p(v1:T​,a1:T​∣o0​,l), where v1:Tv_{1:T}v1:T​ is a sequence of future frames, a1:Ta_{1:T}a1:T​ the action chunk, o0o_0o0​ the initial observation, and lll the language instruction. By factorizing this as p(v1:T∣o0,l)⋅p(a1:T∣v1:T,o0)p(v_{1:T} \mid o_0, l) \cdot p(a_{1:T} \mid v_{1:T}, o_0)p(v1:T​∣o0​,l)⋅p(a1:T​∣v1:T​,o0​), the system first samples a plausible video rollout from the world model and then conditions action selection on that rollout. The inverse dynamics model p(at∣ot,ot−1)p(a_t \mid o_t, o_{t-1})p(at​∣ot​,ot−1​) is learned from a diverse corpus of robot play data that covers a wide range of physical contact events but not specific task semantics. Because physical dynamics—friction, object displacement, collision—are largely invariant across tasks, this module generalizes well as long as the video it receives is physically coherent. In zero-shot scenarios, the video model may occasionally produce implausible sequences, but the inverse model will still attempt to track them; thus any performance gap originates upstream in video quality, not in action decoding.
The experimental validation was designed to stress-test exactly this decoupling. Two distinct robot embodiments—the AgiBot G1 arm and the DROID-Franka platform—were evaluated on a suite of manipulation tasks split into “seen” and “unseen” categories. The ten seen tasks involve novel combinations of environments and objects but use motions that appear in the training corpus. The ten unseen tasks are strictly zero-shot: they demand new verbs (e.g., “tilt the cup,” “scoop the beans”) that were never paired with robot actions during training. This split exposes whether a model merely composes existing skills or truly invents new physical behaviors. Performance was measured by average task progress, a continuous score from 0% (failure) to 100% (complete success), judged by a motion-capture system and human verification.
DreamZero’s results on the AgiBot G1 immediately separate it from the VLA baselines. On seen tasks, DreamZero achieved a mean progress of 62.2%62.2\%62.2%, while the best pretrained VLA (π\piπ0.5) reached only 27.4%27.4\%27.4%. The gap widened on unseen tasks: DreamZero managed 39.5%39.5\%39.5% versus the VLA’s 16.3%16.3\%16.3%. VLAs trained from scratch on the same data collapsed to near 0%0\%0% on unseen tasks, confirming that action-only supervision provides no inductive bias for composing new motions. On the DROID-Franka platform, which has different kinematics and gripper dynamics, DreamZero attained 49%49\%49% on unseen tasks compared to 33%33\%33% for the VLA—a substantial margin, though the absolute numbers are lower due to the platform’s different control challenges. Crucially, DreamZero’s drop from seen to unseen was about 232323 percentage points on AgiBot, while the VLA’s drop was 111111 points; this might misleadingly suggest that the VLA degrades less. In reality, the VLA’s seen-task performance was already poor, so its residual variance is compressed. DreamZero’s high ceiling and graceful degradation reflect a genuine zero-shot reasoning capability, not a floor effect.
The diagnostic that cinches the argument is the failure analysis. Detailed inspection of rollouts showed that DreamZero’s execution errors stemmed almost entirely from video generation artifacts: frames where an object temporarily vanishes, a gripper penetrates a table, or a liquid fails to flow naturally. In every such case, the inverse dynamics model tracked the flawed video with high fidelity, producing actions that were physically correct with respect to the hallucinated visual input. When a stronger video backbone (as discussed earlier in Slide 4) was substituted, task progress improved proportionally. Conversely, no amount of action-level tuning in VLAs could overcome the inability to imagine the desired motion in the first place. This finding solidifies the central thesis: world action models inherit zero-shot generalization from their video generation foundation, and their action extraction is near-optimal given the predicted video.
The visual below summarizes these comparisons as a grouped bar chart. The chart organizes results by embodiment and task category. For the AgiBot G1, two groups—Seen Tasks and Unseen Tasks—each contain a blue bar for DreamZero and an orange bar for the best VLA. The heights mirror the reported percentages, immediately conveying the 2–3× advantage of DreamZero. The DROID-Franka cluster repeats the same color coding for unseen tasks. A dashed horizontal line at 0% marks the scratch VLA collapse, reinforcing that from-scratch action-only learning fails completely on novel verbs. An inset annotation points to DreamZero’s bars and notes that remaining errors trace back to video generation quality, not action inference. This compact diagram transforms the numerical evidence into a single visual argument: zero-shot generalization is a property of the world model, and DreamZero leverages it to act where VLAs cannot.

13. Cross-Embodiment Transfer and Few-shot Adaptation

The previous section demonstrated that DreamZero, pretrained on a single embodiment (AgiBot) with language-annotated demonstrations, can handle a broad set of unseen tasks and environments—achieving 38.3% average task progress across nine novel scenarios. That baseline already reflects the power of a world action model (WAM) to generalise when the training distribution is diverse enough. But real-world deployment demands more than generalising within the same robot morphology. A truly capable policy should absorb visual information from entirely different bodies—other robots, or even humans—and quickly adapt to a new embodiment with minimal interaction. The results on cross-embodiment transfer and few-shot adaptation make a compelling case that DreamZero’s architecture is uniquely suited to this challenge.
The key mechanism lies in the decoupled nature of the world action model. At its core, WAM is a video predictor: given past observed frames o<to_{<t}o<t​ and a language instruction lll, it autoregressively forecasts future frames o^t:t+H\hat{o}_{t:t+H}o^t:t+H​. Actions, when provided, condition this generation via cross-attention, but they are not required for the model to imagine plausible visual futures. Consequently, the world model can be trained on video-only demonstrations—sequences where actions are missing—and still absorb the visual dynamics, object interactions, and task semantics present in those recordings. This is fundamentally different from a standard VLA (Vision–Language–Action) model, which would be blind to data that lacks action labels.
Suppose we have a new embodiment, YAM, with different kinematics and a different control interface. We record 12–20 minutes of video demonstrations of various tasks, but without any corresponding action signals. By co-training DreamZero on a 1:1 mixture of the original AgiBot action-labelled data and the YAM video-only data, the model’s latent space begins to encode the visual appearance of task progress on the YAM body—how its gripper approaches an object, how it pours a liquid, or how it opens a drawer. The action head remains supervised only on the AgiBot portion, yet the shared visual backbone and the autoregressive video prior now carry a much richer notion of what successful behaviour looks like regardless of embodiment. When we later query this co-trained model on the YAM robot, the implicit inverse dynamics—the mapping from the predicted visual trajectory to the actions that would bring it about—can exploit this strong visual prior to produce better actions, even though the model has never seen a single YAM action during training.
The quantitative evidence underscores the effectiveness. Co-training with YAM video-only data lifts task progress from the baseline 38.3% to 55.4%. Remarkably, substituting human video demonstrations (with their dramatic morphological gap) yields a near-equivalent 54.3%. The fact that human video, with its five-fingered hands and entirely different joint configurations, can almost match robot–robot transfer suggests that the world model primarily captures high-level visual dynamics of objects and environments, not low-level kinematics. The human data provides a strong prior about how a cup should tilt, where an item should be placed, or how a cloth should be folded—information that seamlessly transfers to the robotic execution policy.
The morphological gap is not fully bridged, of course: the model still needs to learn the specific action mapping of the new body. That is where few-shot embodiment adaptation shines. Starting from the AgiBot-pretrained WAM, the team post-trained on only 30 minutes of YAM play data—11 tasks performed naturally, without costly task‑specific labelling. After this brief adaptation phase, the policy retained its robust language following and, crucially, generalised to novel objects that were never seen during those 30 minutes of play. The sample efficiency arises because the world model’s video predictor already knows what should happen; the fine‑tuning merely teaches the minimal inverse dynamics required to translate that internal visual plan into the new embodiment’s action space. In essence, the model has learned an implicit inverse dynamics at=f(o<t,l,o^t:t+H)a_t = f(o_{<t}, l, \hat{o}_{t:t+H})at​=f(o<t​,l,o^t:t+H​) where the heavy lifting is done by the predicted future frames o^\hat{o}o^, and only a lightweight adjustment is needed to connect those predictions to the new joint space.
The visual below takes these abstract mechanisms and grounds them in the concrete experimental outcomes we just discussed. It lays out the core transfer pipeline—pretraining on AgiBot, co-training with video from a different source, and a separate few‑shot adaptation branch—and then solidifies the numbers with a comparison table and a succinct callout. The table side‑by‑side for YAM and human sources, with the bold performance jumps, reinforces the message that world‑model priors from video alone dramatically close the gap toward fully trained in‑domain policies. The lower callout for the 30‑minute adaptation highlights that it is the implicit inverse dynamics, nurtured by the video prediction objective, that makes such rapid embodiment switching possible. Together, the diagram serves as a compact summary of why DreamZero’s decoupled architecture turns video‑only data from any source into a zero‑shot policy amplifier.

14. Ablations: Data Diversity, Scale, and Architecture

The previous section established that DreamZero can transfer its video-based planning strategy across robot morphologies and adapt to new tasks with just a handful of demonstrations. These are remarkable feats for a single learned world-action model, but they immediately raise a practical question: what actually makes this possible? Is it the sheer amount of training data, the choice of architecture, or something more subtle about how the model is trained? To dissect these factors, the authors run a systematic set of ablations on a pared-down benchmark — PnP Easy tasks — training each variant for 50,000 steps with a batch size of 32. The results paint a clear picture: the success of DreamZero rests on two pillars, data diversity and autoregressive modeling, while scaling model capacity alone is a surprisingly weak lever when the data itself is ill-structured.
The first ablation tackles data diversity. Training a 14B-parameter DreamZero model on 500 hours of repetitive demonstration data (sampled from a narrow distribution of behaviors) yields a task progress of only 33%. In contrast, the same model trained on a diverse set of 500 hours — covering a wide variety of objects, scenes, and motion patterns — reaches a solid 50% progress. Why does this gap exist? DreamZero learns a joint distribution over future frames and actions, which implicitly encodes an inverse dynamics model: given the current observation and a candidate future frame, it must infer the action that bridges them. When the training data is repetitive, the mapping from a desired visual change to an action is under-constrained and ambiguous; the model never sees enough distinct state-action pairings to learn a robust general inverse dynamics. Diverse data, on the other hand, exposes the model to myriad situations where the same visual subgoal demands different actions depending on context, forcing it to build richer conditional representations. In short, repetitive data starves the inverse dynamics model of the variety it needs.
The second factor is model scale. Moving from a 5B-parameter to a 14B-parameter DreamZero (both trained on diverse data) lifts task progress from 21% to 50%. The smaller model generates videos with noticeable visual artifacts and hallucinations — objects morphing, grippers disappearing, or physically impossible transitions — which corrupt the planning process. The larger model produces cleaner future frames, which in turn yield more reliable action inference. This is a standard scaling effect: more capacity reduces the video prediction error, and better visual foresight directly improves the zero-shot policy. However, note that even the 14B model only reaches 50% in this controlled setting, underscoring that capacity helps but does not magically solve the problem without the right data recipe.
The third ablation contrasts the autoregressive (AR) architecture of DreamZero with a bidirectional (BD) attention mask over the video-action sequence. Both architectures achieve 50% task progress — no accuracy difference — but the autoregressive version runs 3 to 4 times faster during inference. This speedup comes from KV caching: because AR generation only attends to past tokens, the key-value representations can be cached and reused, avoiding quadratic recomputation at each generation step. Bidirectional models must reprocess the entire sequence for every new token, making real-time control infeasible. Moreover, the AR model produces noticeably smoother motions; its causal inductive bias aligns more naturally with the sequential decision problem, preventing the planner from “looking ahead” and cheating on future frames.
Perhaps the most striking finding sits in a small callout box below the main table: Vision-Language-Action (VLA) models fail completely on heterogeneous data. In the same PnP Easy setting, both 5B and 14B VLA variants achieve 0% task progress when trained on diverse demonstrations. This is a sobering result. It confirms that simply scaling a VLA — even to the same parameter count as DreamZero — cannot overcome the fundamental difficulty of modeling highly heterogeneous action distributions. VLAs map raw observations directly to motor commands without an explicit world-modeling component; when the data mixes wildly different behaviors and embodiments, the action prediction head collapses, unable to disentangle the many-to-many relationship between observations and actions. DreamZero’s world-action factorization, which separates visual foresight from action inference, proves essential for digesting this diversity.
The visual that accompanies this section distills these ablation results into a clean, at-a-glance table. Each of the three ablation factors gets two sub-rows (the alternative condition and the baseline), with the best-performing condition bolded for immediate comparison. A fourth column, “Key Insight,” spells out the why behind each number: the need for diverse state-action correspondences, the role of scaling in reducing hallucinations, and the speed advantage of autoregressive generation. Below the table, a bordered callout box with a red left edge draws the eye to the devastating VLA failure: “5B and 14B VLAs achieve 0% task progress on diverse demonstrations, confirming that capacity alone cannot overcome the difficulty of modelling heterogeneous action distributions.” This layout allows the reader to absorb the empirical story in seconds — the two winning conditions (diverse data + AR architecture) standing out against the weaker alternatives, and the VLA collapse serving as a sharp reminder that architecture and representation matter far more than raw parameter count when the data is messy.

15. Lessons, Future Directions, and Open Challenges

After dissecting the individual knobs—data diversity, scale, and architectural choices—it is time to step back and ask what the whole DreamZero experiment tells us about world action models (WAMs) as a policy class. The ablation studies confirmed that performance lives and dies by the richness of the training data and that scale matters, but the deeper picture is more provocative: a model trained simply to imagine future scenes and the actions that cause them can emerge as a surprisingly generalist robotic policy. That shift from bespoke action-labeled datasets to large, diverse video corpora has implications that go well beyond a single benchmark.
The central lesson is that joint video-action prediction functions as an implicit inverse dynamics model. When the network is forced to predict the next observation ot+1\mathbf{o}_{t+1}ot+1​ alongside the action at\mathbf{a}_tat​, it must learn a consistent mapping from current state and desired future state to the action that bridges them. Crucially, this connection is learned not from explicit goal–action pairs but from observing how the world changes under continuous motion. In practice, the model becomes adept at answering the question: “What action would carry me from where I am to the future frame I just hallucinated?” That is the essence of zero-shot policy extraction—no task-specific reward or hand-crafted planner is needed; the world model itself generates a sequence of imagined futures, and the action head converts each imagined transition into a motor command. This is why diverse, non-repetitive data is so critical: only when the training videos contain a rich soup of motion patterns can the implicit inverse dynamics generalize to unseen behaviors and environments.
The empirical record justifies the excitement. In zero-shot generalization tests, DreamZero attains 62% task progress on seen task families and 40% on entirely unseen tasks, while the best Vision-Language-Action (VLA) baseline manages only 27% and 16% respectively. These numbers reflect more than raw performance; they reveal a qualitative difference in how the system transfers. A VLA that maps language instructions to actions struggles when a command requires a motion never seen during training, because the action space is gated by the narrow distribution of its annotated demonstrations. DreamZero’s world-action model, by contrast, can imagine a future that satisfies the language goal and then invert that imagined trajectory into actions—even if that precise motion never appeared in its training set. The result is a policy that can climb stairs, slide objects, or reconfigure furniture in novel layouts without having been explicitly taught to do so.
Equally striking is the model’s capacity for cross-embodiment transfer. DreamZero leverages video-only demonstrations—from human hands or from robots with different kinematics—and translates them into effective behavior on a target robot. There is no need for paired action labels on the source data; the world model simply observes the visual motion and learns to reproduce it through its own body. This bootstrapping effect was shown to boost performance when human videos are added to the training mix, and even when only robot videos from another embodiment are available. The model implicitly disentangles the “what” (the visual trajectory) from the “how” (the specific motor commands), enabling a primitive form of cross-morphology imitation. Moreover, with just 30 minutes of free play data on the target setup, DreamZero adapts rapidly to follow language commands and generalize to novel objects—a property that positions WAMs as strong candidates for fast deployment in new environments.
All of this, however, is gated by one bottleneck: video generation fidelity. Action extraction in DreamZero is a deterministic, faithful conversion of predicted future frames into motor signals. If the imagined video diverges from a physically plausible sequence—objects warp, contacts are missed, occlusions hallucinate—the corresponding actions will be erroneous. The policy’s accuracy is therefore bounded by the raw realism of its generative rollouts. This explains why the ablations found that improved video prediction quality (through larger models, richer data, or temporal smoothing) directly raised task success. It also hints that the path forward lies less in better action heads and more in ever higher-fidelity world simulators learned from pixels.
With those takeaways in mind, the open challenges form a natural roadmap. First, we lack scaling laws for WAMs: how do model size, data volume, and compute interplay to determine video quality and downstream policy performance? Answering this is essential to invest resources efficiently. Second, the internet is brimming with egocentric human video, but harnessing it for robot control requires bridging the embodiment gap and inferring plausible actions from observation alone; world action models are promising, but we need reliable ways to map human motion onto robot affordances. Third, inference speed remains a practical hurdle—DreamZero’s original pipeline demands high-end GPUs, and even the optimized DreamZero-Flash variant pushes toward real-time 7 Hz; closing the gap to the sub‑10 ms loops needed for reactive manipulation is non-trivial on consumer hardware. Fourth, while autoregressive rollout works for moderate horizons, truly long-horizon reasoning likely calls for hierarchical planners (System 2) or drastically longer context windows to avoid compounding errors. Finally, the generalization‑dexterity trade-off remains unsolved: policies that excel at broad, open-vocabulary tasks often fumble on high‑precision insertion or in-hand manipulation, while expert demonstrations for dexterity sacrifice the diversity that fuels generalization. Finding the sweet spot will define the next generation of WAMs.
The visual that accompanies this closing section distills these twin perspectives into a clear contrast. On one side, the key takeaways crystallize as settled insights: the implicit inverse dynamics, the zero-shot and cross-embodiment numbers, the few-shot agility, and the primacy of video quality. On the other, the open challenges are set out as questions needing answers, each a frontier that must be crossed for world action models to mature from promising demos into robust everyday tools. Together they form not a triumphal conclusion but a checkpoint—a candid snapshot of what is now understood and what remains to be invented.

2. VLA vs

The instinctive answer to the generalization gap described earlier has been to scale up multimodal policies. Vision-Language-Action (VLA) models, inspired by the success of large vision-language models, attempt to absorb internet-scale visual and semantic knowledge and directly regress action commands. A VLA embodies a policy of the form
π(at∣ot,c)\pi(a_t \mid o_t, c)π(at​∣ot​,c)
where the current observation oto_tot​ and a language instruction ccc are encoded by a frozen or lightly fine-tuned VLM backbone, and a lightweight action head outputs the next motor command ata_tat​. The appeal is immediate: the policy inherits the representational richness of a model pre-trained on billions of image–text pairs, promising to recognise objects, scenes, and task intent without exhaustive robot-specific collection.
Yet that same inheritance introduces a structural mismatch. VLMs are trained on static frames and text; they lack any built-in notion of temporal coherence or physical causality. When a VLA maps a single RGB image to a continuous action, it must implicitly infer dynamics—how objects will move, how contacts evolve, how inertia and friction act—from a snapshot. The model is forced to compress all that information into a single latent vector, with no mechanism to simulate the future. Consequently, VLAs learn brittle correlations between image appearance and action labels. They need dense action supervision: every training sample must contain a (frame, action) pair, typically captured via teleoperation. The very signal on which they rely—human-collected action labels—is expensive, noisy, and severely limits the diversity of motions they can experience during training.
The crux of the problem is that a VLA never learns a forward dynamics model. It never predicts what the world will look like one second later if the robot moves its end-effector in a certain way. Without such a model, generalization to a novel embodiment or an unfamiliar object demands that the network somehow extrapolate a direct observation-to-action mapping without understanding the intervening physical process. That is a fragile hope at best.
The World Action Model (WAM) introduced by DreamZero re-frames the problem entirely. Instead of learning a reactive policy, DreamZero learns a joint generative model over future video frames and future actions conditioned on history. The central object is the conditional distribution
πθ(oℓ:ℓ+H, aℓ:ℓ+H  ∣  o0:ℓ, c, qℓ),\pi_{\theta}\big(o_{\ell:\ell+H},\, a_{\ell:\ell+H} \;\mid\; o_{0:\ell},\, c,\, q_\ell\big),πθ​(oℓ:ℓ+H​,aℓ:ℓ+H​∣o0:ℓ​,c,qℓ​),
where o0:ℓo_{0:\ell}o0:ℓ​ denotes a short clip of past observations, ccc is a language command, and qℓq_\ellqℓ​ captures proprioceptive state at time ℓ\ellℓ. The model jointly predicts the next HHH video frames oℓ+1,…,oℓ+Ho_{\ell+1},\dots,o_{\ell+H}oℓ+1​,…,oℓ+H​ and the corresponding action sequence aℓ+1,…,aℓ+Ha_{\ell+1},\dots,a_{\ell+H}aℓ+1​,…,aℓ+H​. The key insight is that future frames constitute a visual plan: a sequence of images showing how the scene should evolve. Given that imagined future, extracting the actions amounts to solving an inverse dynamics problem—inferring the motor commands that would produce the observed visual changes. Because the model is forced to paint the entire moving picture before it commits to an action, it learns a deep, causal understanding of how the physical world responds to intervention.
DreamZero implements this WAM with a video diffusion backbone pre-trained on vast corpora of web videos. That pre-training imbues the model with a rich generative prior over how real-world imagery transforms over time: objects fall, fluids slosh, hands push, shadows shift. When the model later observes a handful of robot-specific frames, it can condition on that history and on the task instruction to generate physically coherent future rollouts. Critically, the training objective does not require extra action labels beyond what is already present in any robot dataset. Every consecutive pair of frames implicitly contains the action that caused the change; the diffusion model learns to associate that visual transition with the corresponding control signal. This means that the model can harvest powerful physical intuition from every frame pair, even from data collected for other tasks or from other robots, effectively turning passive observation logs into a self-supervised dynamics curriculum.
The contrast between the two paradigms is sharp. VLA policies are high-capacity static mappers that view the future as a black box. A WAM treats the future as a visual entity to be explicitly imagined. The accompanying diagram places these approaches side by side for comparison. On the left, the VLA path is linear and compact: an observation and a language bubble feed into a frozen VLM encoder; an action head emits a single action token. The annotation highlights its essence: direct action mapping, no dynamics. On the right, the WAM panel unfolds a richer pipeline. A short video history, the instruction, and proprioceptive state all stream into a warm-orange video diffusion backbone. From it emerge two parallel outputs: a filmstrip of predicted future frames and a sequence of action vectors. The feedback loop underneath—a dashed curve returning predicted frames as history for the next chunk—captures the closed-loop autoregressive rollout that turns generation into a continuous policy. The visual makes plain that WAM replaces the blind, single-step mapping of a VLA with a generative process that explicitly simulates what will happen next, then decides what to do.

3. World Action Model (WAM) Concept

Vision–language–action (VLA) models map an observation stream directly to a sequence of motor commands. While this can work well when the training data covers the desired motion repertoire, the mapping becomes brittle as soon as the robot faces a novel physical maneuver—something as simple as pushing an object from an unfamiliar angle or executing a multi‑step rearrangement with unobserved intermediate poses. The root cause is that the model never learns why the world should respond a certain way; it only learns that these pixels usually lead to those joint velocities. Generalization to unseen motions therefore requires an internal model of the world that can be mentally simulated, allowing the agent to imagine how a scene ought to evolve and then back out the actions that make that evolution happen. This is exactly the role of a World Action Model (WAM).
A WAM is a joint generative model over future video frames ol:l+H\mathbf{o}_{l:l+H}ol:l+H​ and actions al:l+H\mathbf{a}_{l:l+H}al:l+H​, conditioned on a visual history o0:l\mathbf{o}_{0:l}o0:l​, a language instruction ccc, and the current proprioceptive state ql\mathbf{q}_lql​. Formally, we write
π0(ol:l+H, al:l+H∣o0:l, c, ql)\pi_0\big(\mathbf{o}_{l:l+H}, \,\mathbf{a}_{l:l+H} \mid \mathbf{o}_{0:l},\, c,\,\mathbf{q}_l\big)π0​(ol:l+H​,al:l+H​∣o0:l​,c,ql​)
where the horizon HHH spans the next several time steps. By modelling vision and action together, the distribution captures both what the future should look like and how to get there, rather than collapsing the two into a single opaque mapping.
The key structural insight is that this joint distribution factorises into two complementary components. Because future actions are conditionally independent of the language instruction given the full observation trajectory and the current proprioception—intuitively, the language tells us what to visualize, and the actions are recovered from that visualization—we can write the following exact decomposition:
π0(ol:l+H, al:l+H∣o0:l, c, ql)  =  π0(ol:l+H∣o0:l, c, ql)⏟Video prediction  π0(al:l+H∣o0:l+H, ql)⏟Inverse dynamics model (IDM).\pi_0\big(\mathbf{o}_{l:l+H}, \,\mathbf{a}_{l:l+H} \mid \mathbf{o}_{0:l},\, c,\,\mathbf{q}_l\big)
\;=\;
\underbrace{\pi_0\big(\mathbf{o}_{l:l+H} \mid \mathbf{o}_{0:l},\, c,\,\mathbf{q}_l\big)}_{\text{Video prediction}}
\;
\underbrace{\pi_0\big(\mathbf{a}_{l:l+H} \mid \mathbf{o}_{0:l+H},\,\mathbf{q}_l\big)}_{\text{Inverse dynamics model (IDM)}}.π0​(ol:l+H​,al:l+H​∣o0:l​,c,ql​)=Video predictionπ0​(ol:l+H​∣o0:l​,c,ql​)​​Inverse dynamics model (IDM)π0​(al:l+H​∣o0:l+H​,ql​)​​.
The first factor, the video prediction module, takes past observations, the language command, and the proprioceptive state, and outputs a plausible future video sequence—an implicit visual plan. It answers the question, “What should the scene look like over the next HHH frames?” Notably, the language instruction only feeds into this predictor; the model uses the instruction to shape the desired visual outcome, not to directly constrain the actions. The second factor, the inverse dynamics model (IDM), receives the entire observation sequence up to time l+Hl+Hl+H together with the current proprioceptive state and predicts the action chunk al:l+H\mathbf{a}_{l:l+H}al:l+H​ that would produce exactly that visual evolution. Because the IDM conditions on the realised future frames, it can be completely language‑agnostic—the high‑level intent is already encoded in the visual plan.
This separation confers a powerful training advantage: we can learn a general understanding of physical interaction through video prediction, while the IDM learns a low‑level mapping from visual change to action, both from the same demonstration data. Training can proceed in two stages (first the video predictor, then the IDM on top of frozen features) or end‑to‑end. The DreamZero framework opts for end‑to‑end training, because back‑propagating the IDM loss through the video predictor encourages the predictor to produce frames that are not only visually realistic but also maximally informative for the downstream action inference—a tight vision‑action alignment that pays off when the model is later asked to generalize.
The accompanying slide image crystallizes this definition into a single diagram. It shows the joint distribution as a central equation, then the factorised form with visible underbraces labelling the “Video prediction” and the “Inverse dynamics model (IDM)” parts. Bullet points summarise that the first factor plans the visual future and the second recovers the actions, with a final annotation highlighting DreamZero’s end‑to‑end training choice. This compact visual serves as a quick reference that the remainder of the lecture will build upon—reminding us that a WAM’s power lies not in a single monolithic mapping, but in the deliberate split between what the world should become and how the robot should act to get there.

4. Why WAMs?

The previous section introduced the concept of a world action model (WAM) – an agent that learns to imagine both future video frames and the actions that bring them about. But why advocate for this new family of models when modern vision-language-action (VLA) policies have already demonstrated impressive performance on dozens of robotic tasks? The answer lies in a fundamental limitation: VLAs, no matter how large their training corpora, remain reactive mappings from observations directly to actions. They do not model how the world evolves under their own interventions, and that absence cripples them the moment they encounter a physical situation that deviates from the distribution of their training episodes.
Consider a VLA trained on thousands of demonstrations of pick-and-place tasks. When shown a novel object – say an irregularly shaped piece of fruit – the model may produce a plausible grasping action, but it has no way to anticipate whether that action will succeed, slip, or damage the object. It cannot simulate “what if I squeeze slightly harder?” because its forward-inference pipeline is a single feed-forward pass: a=π(o)a = \pi(o)a=π(o). This brittleness is especially acute for novel physical motions – actions or contact sequences that rarely appear in the data – making VLAs poor zero-shot generalizers outside their narrow experience.
DreamZero reframes the problem by swapping the direct policy for a generative model of joint video-action trajectories. Instead of learning p(a∣o)p(a \mid o)p(a∣o), the model learns the distribution
pθ(x1:T,a0:T−1∣x0),p_\theta(\mathbf{x}_{1:T}, \mathbf{a}_{0:T-1} \mid \mathbf{x}_0),pθ​(x1:T​,a0:T−1​∣x0​),
where xt\mathbf{x}_txt​ is the frame at time ttt and at\mathbf{a}_tat​ the action that transitions from frame ttt to t+1t+1t+1. During training, the objective maximizes the autoregressive likelihood of future frames and actions conditioned on the past:
LWAM=−E(x,a)∑tlog⁡pθ(xt+1,at∣x≤t,a<t).\mathcal{L}_{\text{WAM}} = -\mathbb{E}_{(\mathbf{x},\mathbf{a})}\sum_{t} \log p_\theta(\mathbf{x}_{t+1}, \mathbf{a}_t \mid \mathbf{x}_{\leq t}, \mathbf{a}_{<t}).LWAM​=−E(x,a)​t∑​logpθ​(xt+1​,at​∣x≤t​,a<t​).
The model’s causal architecture processes a sequence of frame patches and action tokens interleaved, forcing it to represent the consequences of actions in pixel space. This design yields a tight coupling: the predicted video is inseparable from the actions that produced it.
This joint modeling connects directly to inverse dynamics. In a standard forward model p(xt+1∣xt,at)p(\mathbf{x}_{t+1} \mid \mathbf{x}_t, \mathbf{a}_t)p(xt+1​∣xt​,at​), recovering the action from a desired transition requires solving an optimization problem at∗=arg⁡max⁡p(xt+1goal∣xt,at)\mathbf{a}_t^* = \arg\max p(\mathbf{x}_{t+1}^\text{goal} \mid \mathbf{x}_t, \mathbf{a}_t)at∗​=argmaxp(xt+1goal​∣xt​,at​). A WAM, by learning the joint distribution, implicitly encodes an inverse dynamics model p(at∣xt,xt+1)p(\mathbf{a}_t \mid \mathbf{x}_t, \mathbf{x}_{t+1})p(at​∣xt​,xt+1​) as a byproduct of the generative process. If you can “dream” a realistic future frame sequence that arranges the world into a target configuration, the model can read off the corresponding action tokens directly from the same autoregressive decoding. The action becomes a latent variable inferred from the video plan, not an externally prescribed label.
This property is the engine behind zero-shot generalization. A robot with a novel morphology – say a dual-arm setup instead of a single arm – brings a different action space, but the visual consequences of moving an end-effector remain broadly similar. By simply swapping in the new action token vocabulary and fine-tuning the action embedding layer (or even performing few-shot adaptation), the WAM can still generate physically plausible videos and thereby induce sensible actions without exhaustive task-specific retraining. The world knowledge remains largely intact because the heavy lifting is done by the shared video-diffusion backbone.
The visual below captures this contrast in a single clean diagram. On one side, a VLA model appears as a brittle pipeline: an observation enters, an action exits, with no feedback loop or reasoning about the future. On the other side, the DreamZero dream loop is depicted as a cycle where the model generates a sequence of imagined frames and interleaved actions, allowing it to plan by sampling alternative futures. The loop symbolizes the core insight: a world action model does not merely predict; it simulates. That simulation turns the model into a zero-shot policy, capable of adapting to new bodies and new tasks simply by changing the dream it pursues.

5. DreamZero Architecture Overview

Building a policy that can control a robot across tasks and embodiments without any in-domain training data demands a model that already knows an enormous amount about how the physical world behaves. The insight of World Action Models (WAMs) is that a large video diffusion model, pretrained on vast and diverse video, already possesses rich priors over plausible futures—how objects move, how hands interact, how scenes evolve. The challenge lies in steering that visual generation with minimal robot-specific information so that the resulting video corresponds not just to any plausible future, but to one where the robot successfully executes the given instruction. DreamZero’s architecture answers this challenge with remarkable economy: it grafts a thin control layer onto a frozen video backbone, treating the world model as a zero-shot policy.
The design rests on a simple but powerful rule: freeze the video model, and only learn what the robot truly needs. DreamZero builds on Wan2.1‑I2V‑480P, a 14B‑parameter image‑to‑video diffusion model that excels at generating temporally coherent clips from a single starting frame. Its internal representations encode a wealth of visual dynamics—shadows, contact points, object permanence—that would be prohibitively expensive to learn from scratch on robot data. By keeping the video backbone frozen, the architecture preserves these generalization capabilities intact. The only new parameters are a tiny state encoder for the robot’s proprioception and a lightweight action encoder/decoder pair that map between the diffusion latents and the action space. In total, the task‑specific components amount to a few million parameters—negligible relative to the 14B‑parameter vision model.
Three information streams converge to condition the diffusion process, each through a dedicated encoder. The visual history o0o_0o0​—a single image capturing the current scene—passes through the pretrained variational autoencoder (VAE) that was used during the video model’s original training. The VAE compresses the image into latent codes z0z_0z0​, a low‑dimensional representation that the diffusion transformer already understands natively. The language instruction ccc (e.g., “close the drawer”) is encoded by the model’s own frozen text encoder, providing a semantic goal that aligns with the model’s image‑language priors. Finally, the proprioceptive state qlq_lql​—joint positions, gripper status, or end‑effector pose—goes through a small learned multi‑layer perceptron, the only component that translates raw robot measurements into a representation consumable by the video model. This separation ensures that the visual and linguistic understanding remains frozen and generic, while the robot‑specific bottleneck is as lean as possible.
These encoded signals flow into an autoregressive diffusion transformer (DiT) uθu_\thetauθ​. At its core, uθu_\thetauθ​ predicts a joint velocity field for both the future video latents and the normalized actions:
(v^z, v^a)=uθ(z0, c, ql).(\hat{v}^z,\, \hat{v}^a) = u_\theta(z_0,\, c,\, q_l).(v^z,v^a)=uθ​(z0​,c,ql​).
The term v^z\hat{v}^zv^z is a denoising velocity in the latent space of the VAE; v^a\hat{v}^av^a is the corresponding velocity in a normalized action space. By predicting velocities rather than absolute states, the model can operate in a flow‑matching framework that will be detailed in the next section. These predicted velocities are then integrated numerically over the diffusion timesteps to produce clean future latents zl:l+Hz_{l:l+H}zl:l+H​ and a sequence of actions al:l+Ha_{l:l+H}al:l+H​ for a horizon HHH. Conceptually, the model is performing a closed‑form of inverse dynamics: given “what the world will look like” and “what the robot is currently doing,” it infers the actions that bring that world about.
The action decoder deserves a brief spotlight. In the diagram we will show it as a block labeled “Action Decoder” that receives the action velocities v^a\hat{v}^av^a and yields the final action trajectory a^l:l+H\hat{a}_{l:l+H}a^l:l+H​. In practice, this decoder is just a linear or small MLP head that maps the normalized action predictions back to the robot’s specific actuation space—joint angles, gripper width, or Cartesian deltas. Because action spaces vary across embodiments, keeping the decoder separate and small makes it easy to swap when moving to a different robot. The vast majority of the computation happens in the shared, frozen video backbone, which naturally handles visual variation and scene semantics regardless of the downstream embodiment.
A subtle but critical detail is the multi‑view handling. Many robotic setups use two or more cameras, such as left and right wrist cameras on a bimanual platform. DreamZero avoids adding any new encoders or fusion layers. Instead, the frames from all cameras are concatenated side‑by‑side into a single input image, effectively treating them as a wide panoramic view. The VAE and the subsequent DiT process this larger image just as they would a single camera frame. This preserves the model’s ability to use spatial relationships between views while introducing zero additional parameters. It also means the same architecture can handle an arbitrary number of cameras as long as they fit within the image canvas, a nice property for cross‑embodiment deployment.
The entire pipeline is remarkably simple, yet it achieves state‑of‑the‑art generalization precisely because it respects the division of labor between the visual world model and the robot‑specific policy. The visual below consolidates this architecture into a single diagram. On the left, the three inputs—visual history, language, and proprioception—flow through their respective encoders. Blue boxes mark the frozen components (VAE, text encoder, video diffusion backbone), while red boxes mark the few trained modules (state encoder, action decoder, latent update). All signals converge onto the central autoregressive DiT, which outputs predicted velocity fields for video latents and actions. Finally, those velocities are integrated into clean future latents and executable action trajectories. This picture is not merely a summary; it makes visible the design principle that DreamZero gets generalization from the large frozen model and embodiment‑awareness from the tiny learned additions.

6. Training Objective: Flow Matching with Joint Denoising

Building a model that can imagine future video frames and simultaneously decide what actions to take is a deceptively hard learning problem. The architecture described previously gives us a powerful backbone—a transformer that ingests visual context, language instructions, and proprioceptive state. But the critical question remains: how do we train a single network to produce both high-dimensional pixel-like latents and low-dimensional continuous control signals in a coherent, temporally aligned fashion? The answer in DreamZero lies in a flow matching objective with joint denoising, a formulation that treats video prediction and action generation not as separate tasks but as a unified modeling problem over a shared stochastic process.
Flow matching, and specifically the rectified flow variant used here, rethinks generation as learning a velocity field that transports samples from a simple noise distribution to the data distribution along straight-line paths. For a given training chunk of $K$ future frames, we start with clean latent-action pairs $(\mathbf{z}^{(k)}_1, \mathbf{a}^{(k)}_1)$ that come from the real trajectory. We also sample independent Gaussian noise $\mathbf{z}^{(k)}_0, \mathbf{a}^{(k)}_0 \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$. The key insight is to define a linear interpolation between noise and clean data for each modality using the same scalar timestep $t_k$, drawn uniformly from $(0,1)$:
ztk(k)=tkz1(k)+(1−tk)z0(k),atk(k)=tka1(k)+(1−tk)a0(k).\mathbf{z}^{(k)}_{t_k} = t_k \mathbf{z}^{(k)}_1 + (1-t_k) \mathbf{z}^{(k)}_0, \qquad
\mathbf{a}^{(k)}_{t_k} = t_k \mathbf{a}^{(k)}_1 + (1-t_k) \mathbf{a}^{(k)}_0.ztk​(k)​=tk​z1(k)​+(1−tk​)z0(k)​,atk​(k)​=tk​a1(k)​+(1−tk​)a0(k)​.
This shared $t_k$ couples the two modalities: at a given noise level, both latents and actions are corrupted by the same fractional amount. Doing so forces the model to learn a joint representation where visual futures and control signals evolve together, rather than drifting into independent—and possibly contradictory—predictions.
From this interpolated state, we define the target velocity vector that the model must predict. The true velocity is simply the difference between the clean sample and the noise, which points directly from the corrupted state back toward the data:
vk=[z1(k)−z0(k),  a1(k)−a0(k)].\mathbf{v}_k = [\mathbf{z}^{(k)}_1 - \mathbf{z}^{(k)}_0,\; \mathbf{a}^{(k)}_1 - \mathbf{a}^{(k)}_0].vk​=[z1(k)​−z0(k)​,a1(k)​−a0(k)​].
If the model can accurately estimate this vector field at every intermediate timestep, then during inference we can start from pure noise and follow the predicted velocities numerically (e.g., via an Euler solver) to arrive at realistic, coordinated video–action trajectories. The network $u_\theta$ receives the concatenated noisy state $[\mathbf{z}^{(k)}{t_k}, \mathbf{a}^{(k)}{t_k}]$, along with conditioning signals—the visual context chunk $\mathcal{C}_k$, the language embedding $c$, the proprioceptive history $\mathbf{q}k$, and the timestep itself—and outputs the predicted velocity $\mathbf{v}{\text{pred}}$.
The training loss is a weighted mean squared error between the predicted and true velocity, averaged over all $K$ chunks in the trajectory and over sampled timesteps:
L(θ)=Ez,a,{tk}[1K∑k=1Kw(tk) ∥uθ([ztk(k),atk(k)];Ck,c,qk,tk)−vk∥2].\mathcal{L}(\theta) = \mathbb{E}_{\mathbf{z},\mathbf{a},\{t_k\}} \left[ \frac{1}{K} \sum_{k=1}^K w(t_k)\, \| u_\theta([\mathbf{z}^{(k)}_{t_k}, \mathbf{a}^{(k)}_{t_k}]; \mathcal{C}_k, c, \mathbf{q}_k, t_k) - \mathbf{v}_k \|^2 \right].L(θ)=Ez,a,{tk​}​[K1​k=1∑K​w(tk​)∥uθ​([ztk​(k)​,atk​(k)​];Ck​,c,qk​,tk​)−vk​∥2].
The weighting function $w(t_k)$ plays an important practical role. Typically, $w(t_k)$ is chosen to be proportional to $1/t_k$, which down-weights samples with large timesteps (i.e., high noise levels). This reduces the loss contribution from nearly pure noise states where predicting a precise velocity is nearly impossible, and instead focuses the model’s capacity on the more informative intermediate- and low-noise regimes. Empirically, this weighting stabilizes training and yields better final generation quality.
Why is this flow matching formulation superior to, say, a standard diffusion objective that predicts the noise itself? In diffusion, the target is the added noise $\epsilon$, scaling with $1/\sqrt{1-\bar{\alpha}_t}$ can become unstable near $t=1$, and the denoising process often requires many steps. Rectified flow predicts the velocity directly, which corresponds to a straight-line path that can be integrated in far fewer steps—sometimes even a single step with good enough approximation. For robotics, this translates into faster, more consistent closed-loop inference, crucial for real-time control. Moreover, by jointly denoising latents and actions with a shared timestep, the model internalizes the natural causal relation: actions cause changes in the visual scene, so their trajectories must be interlocked. If we used independent noise schedules, the two signals could quickly desynchronize, harming both prediction accuracy and downstream control performance.
The visual below offers a compact summary of this flow. It places the interpolation equations centrally, showing how the same $t_k$ mixes noise and clean data for both $\mathbf{z}$ and $\mathbf{a}$. An arrow leads from the interpolation to the target velocity definition, emphasizing that the true “direction” is the straight line from noise to clean data. The loss equation is then highlighted with a colored border, underscoring its role as the quantitative training signal; inside it, the explicit dependence on the shared timestep and the weighting factor $w(t_k)$ is visible. A small note box at the bottom calls out the $1/t_k$ weighting heuristic and the practical benefit of co-evolving latents and actions. Together, these elements distill the mathematical core of DreamZero’s objective into a single glance, reinforcing the idea that joint flow matching is the engine that turns a world action model into a zero-shot policy.

7. Teacher Forcing and Chunk-wise Denoising

If you’ve followed the previous section on flow matching with joint denoising, you know that DreamZero learns to transform noise into coherent video–action velocities. But having a powerful generative model isn’t enough: we need it to function as a policy that can be rolled out autoregressively over time while avoiding the usual pitfall of compounding errors. This is where teacher forcing with a careful chunk-wise setup becomes essential.
The core insight is that during training, the model should be conditioned on clean, ground-truth context, never on its own potentially flawed predictions. For each chunk kkk, the context Ck\mathcal{C}_kCk​ is defined as the set of all previous clean chunks:
Ck={(z1j,a1j)}j=1k−1\mathcal{C}_k = \{ (\mathbf{z}_1^j, \mathbf{a}_1^j) \}_{j=1}^{k-1}Ck​={(z1j​,a1j​)}j=1k−1​
Here every (z1j,a1j)(\mathbf{z}_1^j, \mathbf{a}_1^j)(z1j​,a1j​) is a chunk of K=2K=2K=2 latent frames from the past episode, taken at the clean (noise‑free) level. This clean context acts like an oracle history that tells the model exactly what happened before – a luxury that we can only afford in the training phase.
To make the model respect the sequential nature of the task, DreamZero’s DiT backbone uθu_\thetauθ​ is fed the noisy current chunk [ztkk,atkk][\mathbf{z}_{t_k}^k, \mathbf{a}_{t_k}^k][ztk​k​,atk​k​] together with the clean context, but attention is constrained by a causal QKV mask. In the multi‑head self‑attention layers, each noisy token from chunk kkk can attend only to tokens from chunks j<kj < kj<k (the clean context) and to itself; it is forbidden from looking at any future chunk or even at other tokens within the same noisy chunk that are later in sequence. This causal mask is visually akin to a lower‑triangular attention pattern, where the upper triangle for future segments is zeroed out.
Why is this so important? Because during training, the model never sees its own predictions as context. It always learns to predict the clean velocity from clean history. This eliminates the “exposure bias” that would otherwise cause the learned policy to drift when it must rely on its own imperfect outputs at inference time. The causal mask additionally prevents any information leakage from the current noisy region back into the clean history, enforcing a strict temporal ordering. In effect, the architecture is trained to perform autoregressive generation without experiencing the consequences of its own errors – a classic teacher‑forcing trick that stabilizes learning.
The chunk size is fixed at K=2K=2K=2 latent frames, which corresponds to a specific temporal step in the compressed latent space. The number of chunks MMM in an episode can vary up to 444, giving the model a visual context of roughly 6.66.66.6 seconds of real‑world video. This relatively short window forces the policy to focus on recent, actionable history rather than trying to memorize long‑range correlations that may not generalize.
Concretely, for each noisy–clean pair, the model predicts an unconditional velocity vector:
vpredk=uθ([ztkk,atkk]; Ck,c,qk,tk)\mathbf{v}_\text{pred}^k = u_\theta\big([\mathbf{z}_{t_k}^k, \mathbf{a}_{t_k}^k]; \, \mathcal{C}_k, \mathbf{c}, \mathbf{q}_k, t_k \big)vpredk​=uθ​([ztk​k​,atk​k​];Ck​,c,qk​,tk​)
where c\mathbf{c}c is a task embedding, qk\mathbf{q}_kqk​ is a chunk‑position representation, and tk∼U(0,1)t_k \sim \mathcal{U}(0,1)tk​∼U(0,1) is the noise level. The loss is then computed between vpredk\mathbf{v}_\text{pred}^kvpredk​ and the true velocity vk=(z1k−z0k,a1k−a0k)\mathbf{v}^k = (\mathbf{z}_1^k - \mathbf{z}_0^k, \mathbf{a}_1^k - \mathbf{a}_0^k)vk=(z1k​−z0k​,a1k​−a0k​), exactly as described in the flow‑matching objective.
At inference time, the setup flips: the clean context Ck\mathcal{C}_kCk​ is populated by the model’s own previous predictions, which are now frozen and used autoregressively. To avoid recomputing the keys and values for the entire history at every step, DreamZero reuses a key–value cache (KV\mathcal{KV}KV) across chunks. This makes closed‑loop control possible at 7 Hz, a detail we’ll unpack in the next section.
The visual below consolidates these design decisions into a compact diagram. On the left you see a stack of green boxes, each labeled with a pair (z1j,a1j)(\mathbf{z}_1^j, \mathbf{a}_1^j)(z1j​,a1j​) for j=1,…,k−1j=1,\dots,k-1j=1,…,k−1 – these are the clean context chunks. To their right, a single grey block represents the current noisy chunk [ztkk,atkk][\mathbf{z}_{t_k}^k, \mathbf{a}_{t_k}^k][ztk​k​,atk​k​], created by noising the clean chunk with tk∼U(0,1)t_k \sim \mathcal{U}(0,1)tk​∼U(0,1). A blue Transformer block receives both inputs, but the clean history enters through a causal attention mask (shown as a lower‑triangular matrix diagram) that forces each row kkk to attend only to columns j<kj < kj<k. The Transformer’s output, an orange arrow pointing right, is the predicted velocity vpredk\mathbf{v}_\text{pred}^kvpredk​, which is then compared against the ground‑truth velocity via a loss. Small annotations remind us of the key parameters: K=2K=2K=2, M≤4M \le 4M≤4 yielding a ~6.6 s context, and the note that inference reuses the same architecture with past predictions as Ck\mathcal{C}_kCk​ and a KV cache for speed. This snapshot captures why DreamZero can train to be a zero‑shot policy without ever seeing its own future mistakes during learning.

8. Closed-Loop Inference with KV Cache

When a world action model learns to plan actions by imagining future video frames, the training recipe is inherently teacher forced: during optimization the model sees ground-truth observation sequences and is trained to predict the next chunk of video and actions conditioned on the real history. This works beautifully in a supervised setting, but at test time the model must generate its own visual future—and if left unchecked, the inevitable small errors in each predicted frame compound into wild hallucinations that quickly render the resulting action plan useless. The DreamZero inference algorithm sidesteps this compounding-error problem by operating in a closed‑loop regime, where real sensory observations are injected back into the model’s context after every action chunk, resetting the world state to reality and preventing the imagined video from drifting away.
The core challenge is to maintain an autoregressive generation process that conditions each new chunk on all previous observations, but to avoid ever feeding the model its own predicted video latents, which are the source of error accumulation. DreamZero solves this by maintaining a key–value (KV) cache that stores the attended features of all past real observations. The cache is prefilled with encoded latents from the initial observation window o0:lo_{0:l}o0:l​ and thereafter is extended exclusively with fresh, real observations that arrive after the robot executes a chunk of actions. The predicted video latent from the denoising process—the model’s imagination of what should happen—is deliberately discarded; it never pollutes the cache. The result is a tight feedback loop: the model is always grounded in the true physical state of the world, yet it still benefits from the rich visual planning capabilities of a diffusion-based video generator.
At the heart of the inference loop is a flow‑matching denoiser that operates on a joint latent variable xxx concatenating the next video chunk tokens z0kz_0^kz0k​ and the corresponding action tokens a0ka_0^ka0k​. In each planning chunk kkk the algorithm starts from scratch with pure Gaussian noise
x0=[z0k,a0k]∼N(0,I),x_0 = [z_0^k, a_0^k] \sim \mathcal{N}(0, I),x0​=[z0k​,a0k​]∼N(0,I),
then iteratively refines this latent vector through NNN denoising steps using the learned velocity field uθu_\thetauθ​. The denoising step is conditioned not only on the task prompt ccc and the low-level instruction qlq_lql​, but crucially on the entire KV cache of real history:
v=uθ(x,ti,c,ql,KV),v = u_\theta\bigl(x, t_i, c, q_l, \mathcal{KV}\bigr),v=uθ​(x,ti​,c,ql​,KV),
where ti=(i−1)/Nt_i = (i-1)/Nti​=(i−1)/N schedules the noise level. A simple Euler integration x←x+v dtx \leftarrow x + v \, dtx←x+vdt evolves the latent toward a clean sample. This formulation elegantly folds the inverse‑dynamics problem into the generative model: the denoiser simultaneously produces a plausible future video snippet and the actions that would cause it, all while respecting the constraints imposed by the physical past stored in the KV cache.
After the denoising sweep, the clean action tokens are extracted from the latent vector, and the robot executes them asynchronously—meaning the model does not block on action completion but continues to plan the next chunk while the robot moves. Immediately upon receiving the next real observation, the visual encoder transforms it into latents zrealz_{\text{real}}zreal​, and these are appended to the KV cache, replacing the imagined video tokens. The key insight is that the predicted video latent zcleanz_{\text{clean}}zclean​ is never used to condition future chunks; only the fresh, real observation enters the cache. This closed‑loop injection ensures that the model’s world model never has to rely on its own flawed predictions, which could otherwise snowball into a completely fictitious state after a handful of chunks.
The autoregressive loop continues for M=⌈H/K⌉M = \lceil H/K \rceilM=⌈H/K⌉ chunks, where HHH is the total horizon and KKK the chunk size. Because each denoising pass starts from pure noise and is conditioned on the full cache, the model can recover gracefully from unexpected real-world outcomes: a nudge, a slipped grasp, or an unforeseen obstacle will simply appear in the next real observation and immediately inform the subsequent planning chunk. There is no need for explicit replanning or closed‑loop controllers; the architecture naturally turns perception into action in a continuous feedback cycle.
The accompanying diagram condenses this entire inference procedure into a clean pseudocode block. The function header DREAMZERO_INFERENCE(o_{0:l}, c, q_l, H) is highlighted, and the structured indentation makes the two‑phase loop—prefill then chunk‑wise autoregression—immediately legible. Inside the inner denoising loop the flow‑matching velocity call and Euler step are displayed with terse clarity. A small annotation bubble drawn beside the cache‑update line reads “closed‑loop,” drawing attention to the pivotal moment when reality overrides imagination. Beneath the code box, two bullet points remind the reader that the predicted video latent is discarded and that this closed‑loop strategy prevents compounding video‑prediction errors—a compact summary of the argument that the preceding paragraphs have built.

9. The Reactivity Gap: Why WAMs Are Slow

In the previous section, we saw how DreamZero leverages an autoregressive transformer with a persistent KV cache to perform closed‑loop inference: the model generates action chunks conditioned on past video frames and robot states, enabling the kind of reactive, history‑aware policy needed for real‑world manipulation. Yet moving from a conceptual closed‑loop design to live robot control exposes a harsh reality: the raw latency of world‑action model inference is orders of magnitude too slow for physical deployment. Even with caching, a naive DreamZero rollout on a single GPU requires roughly 5.7 seconds to produce a single action chunk—a duration that dwarfs the chunk’s own execution window and freezes any robot trying to use it.
To appreciate the severity of this gap, recall the inference pipeline. DreamZero is built around a large diffusion transformer (DiT) with about 14 billion parameters. During action generation, the model iteratively refines a noisy action trajectory through a learned denoising process, much like a standard diffusion model, but conditioned on visual and proprioceptive history. Each denoising step involves one full forward pass of that massive transformer, and the standard setup uses 16 denoising steps per chunk. Each such forward pass costs approximately 350 ms on a modern accelerator, thanks to the huge model size and the overhead of attending to high‑dimensional multimodal embeddings. Multiply by 16 steps, and you are already close to 5.6 seconds. Adding a modest 0.1 s for KV‑cache input/output and framework overhead pushes the naive total to 5.7 s.
That 5.7‑second latency is catastrophic for two intertwined reasons. First, it is purely sequential: the robot cannot begin executing the new chunk until the entire denoising chain finishes, so the arm effectively stalls while the model thinks. Second, the robot’s own execution rate is far faster. DreamZero plans in fixed action horizons—typically 48 steps of low‑level joint commands, spanning 1.6 seconds at a 30 Hz control frequency. Once the chunk is ready, the robot streams those 48 actions over the next 1.6 s, after which it needs the next chunk immediately to avoid jerky stops. Consequently, the model’s inference latency must be comfortably lower than that 1.6 s window; in practice, to maintain smooth, reactive motion, the pipeline demands a per‑chunk latency below roughly 200 ms. That margin allows the planner to absorb variability in compute time, communication delays, and simple safety checks. But at 5.7 s, naive inference overshoots the target by more than a factor of 28. The reactivity gap is staggering: the robot can execute an entire action chunk without the model finishing even one denoising step for its successor.
We can decompose the bottleneck into three principal components, each of which must be tackled by any practical speedup scheme:
Iterative denoising — 16 serial forward passes, each through the 14 B DiT. Removing or drastically reducing the number of steps is the most direct lever, but it must be done without destroying the quality of the planned actions.
Massive per‑step cost — each 350 ms forward pass is dominated by the immense parameter count and the large context size (visual tokens plus state history). Even if we cut the number of steps, a single pass would still be unacceptably slow.
Sequential execution — the robot waits for the full chunk. Any acceleration that is still per‑chunk will still leave the robot idle; a truly reactive system requires interleaving or prediction while acting.
Putting these numbers together gives the core equation that motivates the entire next stage of DreamZero’s architecture:
Speedup required  =  TnaiveTtarget  =  5.7 s0.2 s  ≈  28.5×\text{Speedup required} \;=\; \frac{T_{\text{naive}}}{T_{\text{target}}} \;=\; \frac{5.7\ \text{s}}{0.2\ \text{s}} \;\approx\; 28.5\timesSpeedup required=Ttarget​Tnaive​​=0.2 s5.7 s​≈28.5×
A 28‑fold acceleration is not an incremental optimization target; it mandates a fundamental rethinking of how the diffusion model produces actions under a real‑time budget. The upcoming DreamZero‑Flash design achieves precisely this by decoupling noise schedules and exploiting a streaming inference regime—but before we dive into that technical remedy, it is worth pausing to let the scale of the problem sink in. The visual below distills the latency breakdown and the speedup factor into a single, stark comparison. The table lists each latency component and its contribution, contrasting the naive 5.7 s total against the 0.2 s real‑time target, while the equation at the bottom encapsulates the 28.5× gap in a form that echoes the central challenge of deploying large generative models on physical hardware. Seeing these numbers side‑by‑side makes it clear why a world‑action model that generates excellent policies in simulation can still be unusable on a real robot unless its inference is re‑engineered from the ground up.

10. DreamZero-Flash: Decoupled Noise Schedules

When we last examined the reactivity gap, it became clear that conventional world-action models incur a steep penalty at the control interface: generating video frames forces the model to commit a substantial fraction of the denoising budget to pixel-space synthesis before the first action vector even begins to crystallize. The natural engineering instinct is to shorten the denoising trajectory—run fewer steps across the entire joint input—so that the network produces a usable action faster. Armed with the unified noise schedule from the initial DreamZero formulation, one might simply reduce the number of Euler steps from, say, four to one. Yet this naive acceleration collapses action quality. The reason illuminates a subtle distributional mismatch in the training of dual-modal diffusion.
In a typical joint diffusion, every sample index kkk within a context window receives the same noise level. That is, both the predicted video latent zk\mathbf{z}^kzk and the corresponding action chunk ak\mathbf{a}^kak are corrupted to an identical timestep tkt_ktk​, drawn uniformly from (0,1](0,1](0,1]. During multi-step inference, the cascaded denoising ensures that the video and action pathways evolve in lockstep: they see a spectrum of coupling, from pure noise to clean signal. When you compress this to a single forward pass, the model is forced to jump from t=0t=0t=0 (completely clean video latent) to t=1t=1t=1 (prediction) in one shot. But in training it almost never encountered the situation where the video branch is heavily corrupted while the action branch is simultaneously almost clean. That particular cross-modal state—high noise on the visual side coupled with low noise on the action side—lies outside the manifold explored by a uniform joint schedule. The consequence is that a single-step model hallucinates action snippets that are plausible in isolation but disconnected from the visual goal, erasing the careful alignment that zero-shot policies demand.
DreamZero-Flash surgically resolves this mismatch by introducing decoupled noise schedules. During training, the timestep for the video component is deliberately biased toward high noise, while the action timestep remains uniform. Concretely, for each context index kkk we sample an auxiliary variable η∼Beta(α,β)\eta \sim \text{Beta}(\alpha, \beta)η∼Beta(α,β) with hyperparameters chosen so that α>β\alpha > \betaα>β (for instance, α=7,β=1\alpha=7, \beta=1α=7,β=1, giving E[η]≈0.875\mathbb{E}[\eta] \approx 0.875E[η]≈0.875). We then set the video timestep as
tvideo,k=1−η,t_{\text{video},k} = 1 - \eta,tvideo,k​=1−η,
which concentrates probability mass near zero—intuitively, the video stream is almost always “early” in the diffusion process, hence drenched in noise. The action timestep remains governed by a standard uniform:
taction,k∼U(0,1).t_{\text{action},k} \sim U(0,1).taction,k​∼U(0,1).
From these per-modality timesteps we construct the noisy inputs via the standard forward noising formulas:
ztvideo,kk=tvideo,kz1k+(1−tvideo,k)z0k,ataction,kk=taction,ka1k+(1−taction,k)a0k.\mathbf{z}^k_{t_{\text{video},k}} = t_{\text{video},k} \mathbf{z}^k_1 + (1 - t_{\text{video},k}) \mathbf{z}^k_0, \qquad
\mathbf{a}^k_{t_{\text{action},k}} = t_{\text{action},k} \mathbf{a}^k_1 + (1 - t_{\text{action},k}) \mathbf{a}^k_0.ztvideo,k​k​=tvideo,k​z1k​+(1−tvideo,k​)z0k​,ataction,k​k​=taction,k​a1k​+(1−taction,k​)a0k​.
The training objective becomes a simple modification of the flow-matching loss, where the model uθ\mathbf{u}_\thetauθ​ now receives two separate timestep scalars—one for video, one for actions—concatenated as an additional conditioning vector:
L(θ)=E[1K∑k=1Kw(tk)∥uθ ⁣([ztvideo,kkataction,kk]; Ck,c,qk,[tvideo,ktaction,k])−vk∥2],L(\theta) = \mathbb{E} \Bigg[ \frac{1}{K} \sum_{k=1}^{K} w(t_k) \Big\|
\mathbf{u}_\theta\!\left(
\begin{bmatrix}\mathbf{z}^k_{t_{\text{video},k}}\\ \mathbf{a}^k_{t_{\text{action},k}}\end{bmatrix};\,
\mathcal{C}_k, c, \mathbf{q}_k, \begin{bmatrix}t_{\text{video},k} \\ t_{\text{action},k}\end{bmatrix}
\right)
- \mathbf{v}^k
\Big\|^2 \Bigg],L(θ)=E[K1​k=1∑K​w(tk​)​uθ​([ztvideo,k​k​ataction,k​k​​];Ck​,c,qk​,[tvideo,k​taction,k​​])−vk​2],
where vk=[z1k−z0k;a1k−a0k]\mathbf{v}^k = [\mathbf{z}^k_1 - \mathbf{z}^k_0; \mathbf{a}^k_1 - \mathbf{a}^k_0]vk=[z1k​−z0k​;a1k​−a0k​] is the target velocity. The crucial insight is that the biased sampling actively exposes the model to the exact regime that single-step inference requires: the video branch is at a low timestep (heavily corrupted), while the action branch is at various levels, including nearly clean. The Beta distribution effectively stretches the denoising curriculum so that the network must learn to infer crisp actions from severely degraded visual context. In probabilistic terms, we are densifying the cross-modal joint density in the region (tvideo≪taction)(t_{\text{video}} \ll t_{\text{action}})(tvideo​≪taction​), which was sparsely visited before.
This design yields a remarkable practical payoff. At deployment time, we can run a single Euler step from t=0t=0t=0 to t=1t=1t=1 for the actions, while the video component is clamped to a partially noisy state or optionally refined with a few inexpensive operations (the latter is tolerated because video quality is not the bottleneck for control frequency). Because training has already saturated the model on the tvideo≈0.125t_{\text{video}} \approx 0.125tvideo​≈0.125 condition, the action prediction at inference is statistically indistinguishable from a multi-step denoised action—quality is retained while effective action latency is halved compared to the four-step regime. The maneuver does not require architectural surgery; it is a pure data-schedule intervention that exploits the asymmetry between the two modalities’ tolerance to noise. Video, being a conditioning signal for the policy, can remain stochastic as long as the recovery of the action trajectory is accurate and temporally coherent.
A diagrammatic view captures this asymmetry succinctly. On the left, the training setup contrasts two vertical noise bars: the video bar is almost entirely filled—dominated by high noise due to the Beta-bias—while the action bar exhibits a clean, uniform mix. A small schematic of the skewed Beta(7,1) density feeds into the video timestep, underlining that most training samples force the model to resolve actions from a barely recognizable visual scaffold. On the right, the inference panel shows a single-step leap: the action bar transitions from full gray to solid green in one stride, whereas the video bar retains its partial noise, reflecting the reality that visual perfection is not required for robust control. The sequence of arrows from the training diagram to the inference result reinforces the core message: decoupling noise schedules during training aligns the model’s experience with the one-step regime, making the reactive latency drop from multi-step denoising without sacrificing the action prediction fidelity that defines a zero-shot policy.

11. Inference Speedup Stack (38×)

With a decoupled noise schedule, DreamZero‑Flash already trims the diffusion chain from hundreds of denoising steps to a handful, but the raw compute cost of even a short chain on a contemporary GPU still sits far above the tight latency budget of a real‑time control loop. A vision‑language‑action model that must run at the robot’s natural update rate—commonly 5–10 Hz—cannot afford frame‑by‑frame amortized inference times measured in seconds. To close the gap from a promising research prototype to a deployable zero‑shot policy, the DreamZero team engineered a holistic inference speedup stack that delivers a cumulative 38× throughput gain relative to the original video‑action diffusion baseline, ultimately achieving stable closed‑loop control at 7 Hz. Understanding this stack means appreciating how system‑level pragmatics and model‑level algorithmic shortcuts can compound without sacrificing the generative quality that makes zero‑shot generalization possible.
The optimization journey begins with the observation that the most compute‑intensive component of the DreamZero architecture is the autoregressive video‑conditioned denoiser. Each denoising step queries a large transformer to refine the action chunk while cross‑attending to the history of video latent features. If every step recomputes the full attention over the same conditioning context, the overhead swells linearly with the number of denoising steps. Attention caching is therefore the first low‑hanging fruit: the keys and values derived from the static video conditioning are computed once per control cycle and reused across all denoising steps. In practice, this avoids roughly half the FLOPs inside the cross‑attention blocks because the context embeddings never change between steps. Combining caching with FlashAttention‑2 kernels eliminates the memory‑bound bottleneck, giving a straightforward 2× latency reduction on modern hardware.
But even with cached context, the absolute step count matters greatly. The decoupled noise schedule from the previous section already slashes the number of denoising passes from a default 1000‑step DDPM schedule down to 8–16 steps without visible degradation of action quality. That single algorithmic redesign contributes a ~4× speedup. Yet the denoiser itself, in its original training precision and eager execution mode, still leaves abundant room for acceleration. Moving the entire inference graph to mixed precision (FP16) with selective INT8 quantization of the heaviest linear layers—while keeping the diffusion latents in bfloat16—yields another factor of 2.5× in raw matmul throughput on tensor cores, without any noticeable loss of action accuracy. The quantized weights are pre‑calibrated on a small validation set, and the straight‑through gradient that was used during training ensures the model is inherently robust to numerical noise.
Beyond data‑type optimization, a combination of kernel fusion and graph compilation tightens the execution. The DreamZero denoiser contains many small operations—layer normalization, activation functions, residual additions—that, executed naively, fry the GPU’s memory bandwidth and scheduling efficiency. By hand‑writing custom CUDA kernels for the fused group‑norm‑and‑silu pattern and then applying torch.compile to the entire autoregressive loop with full‑graph capture, the framework cuts kernel launch overhead by over 80%. The result is roughly a 1.8× wall‑clock speedup. Moreover, within the closed‑loop rollout, the model can reuse the KV cache not only across denoising steps but also across consecutive control cycles when the video context overlaps; a sliding‑window caching scheme lifts the amortized speed to a cumulative 38× compared to the unoptimized research code.
Importantly, none of these optimizations are independent; their gains partially multiply but also exhibit diminishing returns because some bottlenecks shift from compute to memory to launch overhead as each is removed. The final stack, carefully profiled on the target deployment GPU, balances these factors. The visual that follows captures the essence of the 38× speedup as a layered bar chart, where each horizontal segment corresponds to one optimization tier—attention caching, step reduction via decoupled noise, quantization, and kernel+graph optimizations—annotated with its approximate contribution. Together they form a pragmatic recipe that transforms a video‑conditioned diffusion model from a heavy academic artifact into a policy that can react to novel scenes seven times each second.

12. Zero-shot Generalization to Unseen Tasks and Environments

The previous section demonstrated how DreamZero achieves a 38× inference speedup through an optimized inference stack, enabling closed-loop control at 7 Hz. Speed alone, however, offers little value if the actions themselves are brittle. The defining promise of world action models is that they can act in worlds they have never physically visited. This section tests that claim directly: can DreamZero perform tasks that involve entirely new verbs, objects, and motion patterns, without any task-specific finetuning?
Vision-Language-Action models (VLAs) have dominated recent robot learning by directly mapping observations and language instructions to motor commands. Their training objective is to maximize the likelihood of expert actions conditioned on vision and text. When confronted with a task whose motion vocabulary lies outside the training distribution—say, “slide the mug across the table” when the robot has only seen picking and placing—a VLA must extrapolate a control policy from a latent space shaped entirely by labeled action data. The result is often catastrophic: the model produces arbitrary joint movements, because it has no mechanism to imagine the physical consequences of a motion before executing it. DreamZero sidesteps this brittleness by decoupling the “what” from the “how.” It first predicts a future video of the desired outcome, then extracts actions via an inverse dynamics model that is trained only on universal physical interaction data. This means that even if a task verb has never been coupled with a specific robot embodiment, the system can still generate a plausible video of the goal (thanks to its pretrained video model’s semantic and physical knowledge) and then faithfully realize it.
The joint video-action generation objective underlies this capability. DreamZero models the conditional distribution p(v1:T,a1:T∣o0,l)p(v_{1:T}, a_{1:T} \mid o_0, l)p(v1:T​,a1:T​∣o0​,l), where v1:Tv_{1:T}v1:T​ is a sequence of future frames, a1:Ta_{1:T}a1:T​ the action chunk, o0o_0o0​ the initial observation, and lll the language instruction. By factorizing this as p(v1:T∣o0,l)⋅p(a1:T∣v1:T,o0)p(v_{1:T} \mid o_0, l) \cdot p(a_{1:T} \mid v_{1:T}, o_0)p(v1:T​∣o0​,l)⋅p(a1:T​∣v1:T​,o0​), the system first samples a plausible video rollout from the world model and then conditions action selection on that rollout. The inverse dynamics model p(at∣ot,ot−1)p(a_t \mid o_t, o_{t-1})p(at​∣ot​,ot−1​) is learned from a diverse corpus of robot play data that covers a wide range of physical contact events but not specific task semantics. Because physical dynamics—friction, object displacement, collision—are largely invariant across tasks, this module generalizes well as long as the video it receives is physically coherent. In zero-shot scenarios, the video model may occasionally produce implausible sequences, but the inverse model will still attempt to track them; thus any performance gap originates upstream in video quality, not in action decoding.
The experimental validation was designed to stress-test exactly this decoupling. Two distinct robot embodiments—the AgiBot G1 arm and the DROID-Franka platform—were evaluated on a suite of manipulation tasks split into “seen” and “unseen” categories. The ten seen tasks involve novel combinations of environments and objects but use motions that appear in the training corpus. The ten unseen tasks are strictly zero-shot: they demand new verbs (e.g., “tilt the cup,” “scoop the beans”) that were never paired with robot actions during training. This split exposes whether a model merely composes existing skills or truly invents new physical behaviors. Performance was measured by average task progress, a continuous score from 0% (failure) to 100% (complete success), judged by a motion-capture system and human verification.
DreamZero’s results on the AgiBot G1 immediately separate it from the VLA baselines. On seen tasks, DreamZero achieved a mean progress of 62.2%62.2\%62.2%, while the best pretrained VLA (π\piπ0.5) reached only 27.4%27.4\%27.4%. The gap widened on unseen tasks: DreamZero managed 39.5%39.5\%39.5% versus the VLA’s 16.3%16.3\%16.3%. VLAs trained from scratch on the same data collapsed to near 0%0\%0% on unseen tasks, confirming that action-only supervision provides no inductive bias for composing new motions. On the DROID-Franka platform, which has different kinematics and gripper dynamics, DreamZero attained 49%49\%49% on unseen tasks compared to 33%33\%33% for the VLA—a substantial margin, though the absolute numbers are lower due to the platform’s different control challenges. Crucially, DreamZero’s drop from seen to unseen was about 232323 percentage points on AgiBot, while the VLA’s drop was 111111 points; this might misleadingly suggest that the VLA degrades less. In reality, the VLA’s seen-task performance was already poor, so its residual variance is compressed. DreamZero’s high ceiling and graceful degradation reflect a genuine zero-shot reasoning capability, not a floor effect.
The diagnostic that cinches the argument is the failure analysis. Detailed inspection of rollouts showed that DreamZero’s execution errors stemmed almost entirely from video generation artifacts: frames where an object temporarily vanishes, a gripper penetrates a table, or a liquid fails to flow naturally. In every such case, the inverse dynamics model tracked the flawed video with high fidelity, producing actions that were physically correct with respect to the hallucinated visual input. When a stronger video backbone (as discussed earlier in Slide 4) was substituted, task progress improved proportionally. Conversely, no amount of action-level tuning in VLAs could overcome the inability to imagine the desired motion in the first place. This finding solidifies the central thesis: world action models inherit zero-shot generalization from their video generation foundation, and their action extraction is near-optimal given the predicted video.
The visual below summarizes these comparisons as a grouped bar chart. The chart organizes results by embodiment and task category. For the AgiBot G1, two groups—Seen Tasks and Unseen Tasks—each contain a blue bar for DreamZero and an orange bar for the best VLA. The heights mirror the reported percentages, immediately conveying the 2–3× advantage of DreamZero. The DROID-Franka cluster repeats the same color coding for unseen tasks. A dashed horizontal line at 0% marks the scratch VLA collapse, reinforcing that from-scratch action-only learning fails completely on novel verbs. An inset annotation points to DreamZero’s bars and notes that remaining errors trace back to video generation quality, not action inference. This compact diagram transforms the numerical evidence into a single visual argument: zero-shot generalization is a property of the world model, and DreamZero leverages it to act where VLAs cannot.

13. Cross-Embodiment Transfer and Few-shot Adaptation

The previous section demonstrated that DreamZero, pretrained on a single embodiment (AgiBot) with language-annotated demonstrations, can handle a broad set of unseen tasks and environments—achieving 38.3% average task progress across nine novel scenarios. That baseline already reflects the power of a world action model (WAM) to generalise when the training distribution is diverse enough. But real-world deployment demands more than generalising within the same robot morphology. A truly capable policy should absorb visual information from entirely different bodies—other robots, or even humans—and quickly adapt to a new embodiment with minimal interaction. The results on cross-embodiment transfer and few-shot adaptation make a compelling case that DreamZero’s architecture is uniquely suited to this challenge.
The key mechanism lies in the decoupled nature of the world action model. At its core, WAM is a video predictor: given past observed frames o<to_{<t}o<t​ and a language instruction lll, it autoregressively forecasts future frames o^t:t+H\hat{o}_{t:t+H}o^t:t+H​. Actions, when provided, condition this generation via cross-attention, but they are not required for the model to imagine plausible visual futures. Consequently, the world model can be trained on video-only demonstrations—sequences where actions are missing—and still absorb the visual dynamics, object interactions, and task semantics present in those recordings. This is fundamentally different from a standard VLA (Vision–Language–Action) model, which would be blind to data that lacks action labels.
Suppose we have a new embodiment, YAM, with different kinematics and a different control interface. We record 12–20 minutes of video demonstrations of various tasks, but without any corresponding action signals. By co-training DreamZero on a 1:1 mixture of the original AgiBot action-labelled data and the YAM video-only data, the model’s latent space begins to encode the visual appearance of task progress on the YAM body—how its gripper approaches an object, how it pours a liquid, or how it opens a drawer. The action head remains supervised only on the AgiBot portion, yet the shared visual backbone and the autoregressive video prior now carry a much richer notion of what successful behaviour looks like regardless of embodiment. When we later query this co-trained model on the YAM robot, the implicit inverse dynamics—the mapping from the predicted visual trajectory to the actions that would bring it about—can exploit this strong visual prior to produce better actions, even though the model has never seen a single YAM action during training.
The quantitative evidence underscores the effectiveness. Co-training with YAM video-only data lifts task progress from the baseline 38.3% to 55.4%. Remarkably, substituting human video demonstrations (with their dramatic morphological gap) yields a near-equivalent 54.3%. The fact that human video, with its five-fingered hands and entirely different joint configurations, can almost match robot–robot transfer suggests that the world model primarily captures high-level visual dynamics of objects and environments, not low-level kinematics. The human data provides a strong prior about how a cup should tilt, where an item should be placed, or how a cloth should be folded—information that seamlessly transfers to the robotic execution policy.
The morphological gap is not fully bridged, of course: the model still needs to learn the specific action mapping of the new body. That is where few-shot embodiment adaptation shines. Starting from the AgiBot-pretrained WAM, the team post-trained on only 30 minutes of YAM play data—11 tasks performed naturally, without costly task‑specific labelling. After this brief adaptation phase, the policy retained its robust language following and, crucially, generalised to novel objects that were never seen during those 30 minutes of play. The sample efficiency arises because the world model’s video predictor already knows what should happen; the fine‑tuning merely teaches the minimal inverse dynamics required to translate that internal visual plan into the new embodiment’s action space. In essence, the model has learned an implicit inverse dynamics at=f(o<t,l,o^t:t+H)a_t = f(o_{<t}, l, \hat{o}_{t:t+H})at​=f(o<t​,l,o^t:t+H​) where the heavy lifting is done by the predicted future frames o^\hat{o}o^, and only a lightweight adjustment is needed to connect those predictions to the new joint space.
The visual below takes these abstract mechanisms and grounds them in the concrete experimental outcomes we just discussed. It lays out the core transfer pipeline—pretraining on AgiBot, co-training with video from a different source, and a separate few‑shot adaptation branch—and then solidifies the numbers with a comparison table and a succinct callout. The table side‑by‑side for YAM and human sources, with the bold performance jumps, reinforces the message that world‑model priors from video alone dramatically close the gap toward fully trained in‑domain policies. The lower callout for the 30‑minute adaptation highlights that it is the implicit inverse dynamics, nurtured by the video prediction objective, that makes such rapid embodiment switching possible. Together, the diagram serves as a compact summary of why DreamZero’s decoupled architecture turns video‑only data from any source into a zero‑shot policy amplifier.

14. Ablations: Data Diversity, Scale, and Architecture

The previous section established that DreamZero can transfer its video-based planning strategy across robot morphologies and adapt to new tasks with just a handful of demonstrations. These are remarkable feats for a single learned world-action model, but they immediately raise a practical question: what actually makes this possible? Is it the sheer amount of training data, the choice of architecture, or something more subtle about how the model is trained? To dissect these factors, the authors run a systematic set of ablations on a pared-down benchmark — PnP Easy tasks — training each variant for 50,000 steps with a batch size of 32. The results paint a clear picture: the success of DreamZero rests on two pillars, data diversity and autoregressive modeling, while scaling model capacity alone is a surprisingly weak lever when the data itself is ill-structured.
The first ablation tackles data diversity. Training a 14B-parameter DreamZero model on 500 hours of repetitive demonstration data (sampled from a narrow distribution of behaviors) yields a task progress of only 33%. In contrast, the same model trained on a diverse set of 500 hours — covering a wide variety of objects, scenes, and motion patterns — reaches a solid 50% progress. Why does this gap exist? DreamZero learns a joint distribution over future frames and actions, which implicitly encodes an inverse dynamics model: given the current observation and a candidate future frame, it must infer the action that bridges them. When the training data is repetitive, the mapping from a desired visual change to an action is under-constrained and ambiguous; the model never sees enough distinct state-action pairings to learn a robust general inverse dynamics. Diverse data, on the other hand, exposes the model to myriad situations where the same visual subgoal demands different actions depending on context, forcing it to build richer conditional representations. In short, repetitive data starves the inverse dynamics model of the variety it needs.
The second factor is model scale. Moving from a 5B-parameter to a 14B-parameter DreamZero (both trained on diverse data) lifts task progress from 21% to 50%. The smaller model generates videos with noticeable visual artifacts and hallucinations — objects morphing, grippers disappearing, or physically impossible transitions — which corrupt the planning process. The larger model produces cleaner future frames, which in turn yield more reliable action inference. This is a standard scaling effect: more capacity reduces the video prediction error, and better visual foresight directly improves the zero-shot policy. However, note that even the 14B model only reaches 50% in this controlled setting, underscoring that capacity helps but does not magically solve the problem without the right data recipe.
The third ablation contrasts the autoregressive (AR) architecture of DreamZero with a bidirectional (BD) attention mask over the video-action sequence. Both architectures achieve 50% task progress — no accuracy difference — but the autoregressive version runs 3 to 4 times faster during inference. This speedup comes from KV caching: because AR generation only attends to past tokens, the key-value representations can be cached and reused, avoiding quadratic recomputation at each generation step. Bidirectional models must reprocess the entire sequence for every new token, making real-time control infeasible. Moreover, the AR model produces noticeably smoother motions; its causal inductive bias aligns more naturally with the sequential decision problem, preventing the planner from “looking ahead” and cheating on future frames.
Perhaps the most striking finding sits in a small callout box below the main table: Vision-Language-Action (VLA) models fail completely on heterogeneous data. In the same PnP Easy setting, both 5B and 14B VLA variants achieve 0% task progress when trained on diverse demonstrations. This is a sobering result. It confirms that simply scaling a VLA — even to the same parameter count as DreamZero — cannot overcome the fundamental difficulty of modeling highly heterogeneous action distributions. VLAs map raw observations directly to motor commands without an explicit world-modeling component; when the data mixes wildly different behaviors and embodiments, the action prediction head collapses, unable to disentangle the many-to-many relationship between observations and actions. DreamZero’s world-action factorization, which separates visual foresight from action inference, proves essential for digesting this diversity.
The visual that accompanies this section distills these ablation results into a clean, at-a-glance table. Each of the three ablation factors gets two sub-rows (the alternative condition and the baseline), with the best-performing condition bolded for immediate comparison. A fourth column, “Key Insight,” spells out the why behind each number: the need for diverse state-action correspondences, the role of scaling in reducing hallucinations, and the speed advantage of autoregressive generation. Below the table, a bordered callout box with a red left edge draws the eye to the devastating VLA failure: “5B and 14B VLAs achieve 0% task progress on diverse demonstrations, confirming that capacity alone cannot overcome the difficulty of modelling heterogeneous action distributions.” This layout allows the reader to absorb the empirical story in seconds — the two winning conditions (diverse data + AR architecture) standing out against the weaker alternatives, and the VLA collapse serving as a sharp reminder that architecture and representation matter far more than raw parameter count when the data is messy.

15. Lessons, Future Directions, and Open Challenges

After dissecting the individual knobs—data diversity, scale, and architectural choices—it is time to step back and ask what the whole DreamZero experiment tells us about world action models (WAMs) as a policy class. The ablation studies confirmed that performance lives and dies by the richness of the training data and that scale matters, but the deeper picture is more provocative: a model trained simply to imagine future scenes and the actions that cause them can emerge as a surprisingly generalist robotic policy. That shift from bespoke action-labeled datasets to large, diverse video corpora has implications that go well beyond a single benchmark.
The central lesson is that joint video-action prediction functions as an implicit inverse dynamics model. When the network is forced to predict the next observation ot+1\mathbf{o}_{t+1}ot+1​ alongside the action at\mathbf{a}_tat​, it must learn a consistent mapping from current state and desired future state to the action that bridges them. Crucially, this connection is learned not from explicit goal–action pairs but from observing how the world changes under continuous motion. In practice, the model becomes adept at answering the question: “What action would carry me from where I am to the future frame I just hallucinated?” That is the essence of zero-shot policy extraction—no task-specific reward or hand-crafted planner is needed; the world model itself generates a sequence of imagined futures, and the action head converts each imagined transition into a motor command. This is why diverse, non-repetitive data is so critical: only when the training videos contain a rich soup of motion patterns can the implicit inverse dynamics generalize to unseen behaviors and environments.
The empirical record justifies the excitement. In zero-shot generalization tests, DreamZero attains 62% task progress on seen task families and 40% on entirely unseen tasks, while the best Vision-Language-Action (VLA) baseline manages only 27% and 16% respectively. These numbers reflect more than raw performance; they reveal a qualitative difference in how the system transfers. A VLA that maps language instructions to actions struggles when a command requires a motion never seen during training, because the action space is gated by the narrow distribution of its annotated demonstrations. DreamZero’s world-action model, by contrast, can imagine a future that satisfies the language goal and then invert that imagined trajectory into actions—even if that precise motion never appeared in its training set. The result is a policy that can climb stairs, slide objects, or reconfigure furniture in novel layouts without having been explicitly taught to do so.
Equally striking is the model’s capacity for cross-embodiment transfer. DreamZero leverages video-only demonstrations—from human hands or from robots with different kinematics—and translates them into effective behavior on a target robot. There is no need for paired action labels on the source data; the world model simply observes the visual motion and learns to reproduce it through its own body. This bootstrapping effect was shown to boost performance when human videos are added to the training mix, and even when only robot videos from another embodiment are available. The model implicitly disentangles the “what” (the visual trajectory) from the “how” (the specific motor commands), enabling a primitive form of cross-morphology imitation. Moreover, with just 30 minutes of free play data on the target setup, DreamZero adapts rapidly to follow language commands and generalize to novel objects—a property that positions WAMs as strong candidates for fast deployment in new environments.
All of this, however, is gated by one bottleneck: video generation fidelity. Action extraction in DreamZero is a deterministic, faithful conversion of predicted future frames into motor signals. If the imagined video diverges from a physically plausible sequence—objects warp, contacts are missed, occlusions hallucinate—the corresponding actions will be erroneous. The policy’s accuracy is therefore bounded by the raw realism of its generative rollouts. This explains why the ablations found that improved video prediction quality (through larger models, richer data, or temporal smoothing) directly raised task success. It also hints that the path forward lies less in better action heads and more in ever higher-fidelity world simulators learned from pixels.
With those takeaways in mind, the open challenges form a natural roadmap. First, we lack scaling laws for WAMs: how do model size, data volume, and compute interplay to determine video quality and downstream policy performance? Answering this is essential to invest resources efficiently. Second, the internet is brimming with egocentric human video, but harnessing it for robot control requires bridging the embodiment gap and inferring plausible actions from observation alone; world action models are promising, but we need reliable ways to map human motion onto robot affordances. Third, inference speed remains a practical hurdle—DreamZero’s original pipeline demands high-end GPUs, and even the optimized DreamZero-Flash variant pushes toward real-time 7 Hz; closing the gap to the sub‑10 ms loops needed for reactive manipulation is non-trivial on consumer hardware. Fourth, while autoregressive rollout works for moderate horizons, truly long-horizon reasoning likely calls for hierarchical planners (System 2) or drastically longer context windows to avoid compounding errors. Finally, the generalization‑dexterity trade-off remains unsolved: policies that excel at broad, open-vocabulary tasks often fumble on high‑precision insertion or in-hand manipulation, while expert demonstrations for dexterity sacrifice the diversity that fuels generalization. Finding the sweet spot will define the next generation of WAMs.
The visual that accompanies this closing section distills these twin perspectives into a clear contrast. On one side, the key takeaways crystallize as settled insights: the implicit inverse dynamics, the zero-shot and cross-embodiment numbers, the few-shot agility, and the primacy of video quality. On the other, the open challenges are set out as questions needing answers, each a frontier that must be crossed for world action models to mature from promising demos into robust everyday tools. Together they form not a triumphal conclusion but a checkpoint—a candid snapshot of what is now understood and what remains to be invented.