Imitation Bootstrapped Reinforcement Learning (IBRL) - FeynmanWiki

CONTENTS

Bookmark this paper

Save for later reading

MACHINE LEARNING, REINFORCEMENT LEARNING - 45 MIN READ

Imitation Bootstrapped Reinforcement Learning (IBRL)

1. Sparse-Reward Robotics: Why Vanilla RL Fails First

To understand why IBRL is needed at all, it helps to start with the uncomfortable baseline: in many robotic manipulation tasks, vanilla reinforcement learning does not merely learn slowly—it may fail to generate the first meaningful learning signal within the available interaction budget.
The core issue is that robot tasks often provide task-completion rewards rather than dense feedback. A policy may receive no reward for approaching the object, aligning the gripper, touching the handle, or partially lifting the cube. It receives reward only when the full task succeeds:
rt=R(st,at)∈{0,1},R(st,at)=1 only after success.r_t = R(s_t,a_t)\in\{0,1\},\qquad R(s_t,a_t)=1 \text{ only after success.}rt​=R(st​,at​)∈{0,1},R(st​,at​)=1 only after success.
This is a very different regime from simulated locomotion benchmarks where every timestep may reward forward velocity, posture, or energy efficiency. In sparse-reward manipulation, almost every transition collected by an untrained agent looks identical from the reward perspective: failure, failure, failure, failure.
The difficulty is amplified by the action space. Robot actions are usually continuous and high-dimensional: joint velocities, end-effector deltas, gripper commands, or hybrid combinations of these. A common exploration rule perturbs the actor’s output with Gaussian noise,
A=[−1,1]d,at=πθ(st)+ϵ,ϵ∼N(0,σ2).\mathcal{A}=[-1,1]^d,\qquad a_t=\pi_\theta(s_t)+\epsilon,\quad \epsilon\sim\mathcal{N}(0,\sigma^2).A=[−1,1]d,at​=πθ​(st​)+ϵ,ϵ∼N(0,σ2).
But in a high-dimensional continuous space, random noise is an extremely weak strategy for discovering a precise, temporally extended behavior. Opening a drawer, inserting a peg, lifting an object, or pressing a button may require many coordinated actions in sequence. The probability that a randomly initialized policy plus local Gaussian perturbations stumbles into a successful trajectory can be effectively negligible.
This is not just an exploration inconvenience. It directly corrupts the early learning dynamics of off-policy actor-critic methods such as TD3, DDPG-style algorithms, or SAC-like variants. These methods rely on a replay buffer B\mathcal{B}B of observed transitions and train a critic using temporal-difference targets such as
y(j)=rt+γQϕ′(st+1,a′),rt=0.y^{(j)}=r_t+\gamma Q_{\phi'}(s_{t+1},a'),\qquad r_t=0.y(j)=rt​+γQϕ′​(st+1​,a′),rt​=0.
If the buffer contains only failures, then the immediate reward term is almost always zero. The target depends entirely on bootstrapping from another learned value estimate, which itself was trained mostly on zero-reward data. The result is a kind of self-referential emptiness: the critic is asked to distinguish good actions from bad actions before it has ever observed evidence of success.
In principle, bootstrapping can propagate reward backward from rare successful transitions. But that assumes those transitions exist in the replay buffer. When they do not, the critic may learn a nearly flat value landscape. The actor then optimizes against this critic and receives little useful directional information:
if all actions appear equally bad, the actor gradient is uninformative;
if value estimates are dominated by approximation noise, the actor may exploit critic errors;
if exploration remains local around a poor policy, the buffer continues to contain mostly failures.
This creates a feedback loop: bad exploration produces an uninformative critic, and an uninformative critic produces a bad policy, which then continues collecting unhelpful data.
Modern off-policy RL algorithms are powerful partly because they reuse experience efficiently. But reuse does not solve the problem when the experience itself lacks task-relevant signal. A replay buffer with one million failed grasp attempts may still contain very little information about how to grasp. It tells the agent what did not work, but not necessarily which infinitesimal changes would move it toward success—especially when failures are all assigned the same scalar reward.
Robotics makes this failure mode especially costly. In simulation, one might brute-force exploration by running thousands of environments in parallel for millions or billions of steps. On real robots, that is rarely acceptable. Hardware wears out, resets are slow, human supervision may be required, and safety constraints limit reckless exploration. Sparse-reward robotic RL therefore faces a severe mismatch: the algorithm may require many exploratory failures before learning starts, while the real system can only afford a small number.
This is the motivation for bringing demonstrations into the picture. But before discussing how to use them well, we need to be precise about what they are fixing. Demonstrations are not merely “extra data”; they provide successful trajectories that break the zero-reward deadlock. They give the critic examples of states and actions near success, and they give the actor behavioral structure that random exploration is unlikely to discover quickly.
The visual below compresses this failure mode into two coupled views. On the left is the robot’s interaction problem: many intermediate states receive rt=0r_t=0rt​=0, while the green success condition is rare and hard to reach by chance. On the right is the learning system: a randomly initialized actor samples noisy continuous actions, the environment returns sparse rewards, and the critic update is dominated by targets with rt=0r_t=0rt​=0.
The key takeaway is that vanilla RL can fail before the usual optimization story begins. The critic is not yet learning a useful value function, because the replay buffer has not captured successful behavior. The actor is not improving meaningfully, because the critic cannot tell it which continuous actions lead toward success. This is the specific bottleneck IBRL is designed to address.

CONTENTS

Bookmark this paper

Save for later reading

MACHINE LEARNING, REINFORCEMENT LEARNING - 45 MIN READ

Imitation Bootstrapped Reinforcement Learning (IBRL)

1. Sparse-Reward Robotics: Why Vanilla RL Fails First

To understand why IBRL is needed at all, it helps to start with the uncomfortable baseline: in many robotic manipulation tasks, vanilla reinforcement learning does not merely learn slowly—it may fail to generate the first meaningful learning signal within the available interaction budget.
The core issue is that robot tasks often provide task-completion rewards rather than dense feedback. A policy may receive no reward for approaching the object, aligning the gripper, touching the handle, or partially lifting the cube. It receives reward only when the full task succeeds:
rt=R(st,at)∈{0,1},R(st,at)=1 only after success.r_t = R(s_t,a_t)\in\{0,1\},\qquad R(s_t,a_t)=1 \text{ only after success.}rt​=R(st​,at​)∈{0,1},R(st​,at​)=1 only after success.
This is a very different regime from simulated locomotion benchmarks where every timestep may reward forward velocity, posture, or energy efficiency. In sparse-reward manipulation, almost every transition collected by an untrained agent looks identical from the reward perspective: failure, failure, failure, failure.
The difficulty is amplified by the action space. Robot actions are usually continuous and high-dimensional: joint velocities, end-effector deltas, gripper commands, or hybrid combinations of these. A common exploration rule perturbs the actor’s output with Gaussian noise,
A=[−1,1]d,at=πθ(st)+ϵ,ϵ∼N(0,σ2).\mathcal{A}=[-1,1]^d,\qquad a_t=\pi_\theta(s_t)+\epsilon,\quad \epsilon\sim\mathcal{N}(0,\sigma^2).A=[−1,1]d,at​=πθ​(st​)+ϵ,ϵ∼N(0,σ2).
But in a high-dimensional continuous space, random noise is an extremely weak strategy for discovering a precise, temporally extended behavior. Opening a drawer, inserting a peg, lifting an object, or pressing a button may require many coordinated actions in sequence. The probability that a randomly initialized policy plus local Gaussian perturbations stumbles into a successful trajectory can be effectively negligible.
This is not just an exploration inconvenience. It directly corrupts the early learning dynamics of off-policy actor-critic methods such as TD3, DDPG-style algorithms, or SAC-like variants. These methods rely on a replay buffer B\mathcal{B}B of observed transitions and train a critic using temporal-difference targets such as
y(j)=rt+γQϕ′(st+1,a′),rt=0.y^{(j)}=r_t+\gamma Q_{\phi'}(s_{t+1},a'),\qquad r_t=0.y(j)=rt​+γQϕ′​(st+1​,a′),rt​=0.
If the buffer contains only failures, then the immediate reward term is almost always zero. The target depends entirely on bootstrapping from another learned value estimate, which itself was trained mostly on zero-reward data. The result is a kind of self-referential emptiness: the critic is asked to distinguish good actions from bad actions before it has ever observed evidence of success.
In principle, bootstrapping can propagate reward backward from rare successful transitions. But that assumes those transitions exist in the replay buffer. When they do not, the critic may learn a nearly flat value landscape. The actor then optimizes against this critic and receives little useful directional information:
if all actions appear equally bad, the actor gradient is uninformative;
if value estimates are dominated by approximation noise, the actor may exploit critic errors;
if exploration remains local around a poor policy, the buffer continues to contain mostly failures.
This creates a feedback loop: bad exploration produces an uninformative critic, and an uninformative critic produces a bad policy, which then continues collecting unhelpful data.
Modern off-policy RL algorithms are powerful partly because they reuse experience efficiently. But reuse does not solve the problem when the experience itself lacks task-relevant signal. A replay buffer with one million failed grasp attempts may still contain very little information about how to grasp. It tells the agent what did not work, but not necessarily which infinitesimal changes would move it toward success—especially when failures are all assigned the same scalar reward.
Robotics makes this failure mode especially costly. In simulation, one might brute-force exploration by running thousands of environments in parallel for millions or billions of steps. On real robots, that is rarely acceptable. Hardware wears out, resets are slow, human supervision may be required, and safety constraints limit reckless exploration. Sparse-reward robotic RL therefore faces a severe mismatch: the algorithm may require many exploratory failures before learning starts, while the real system can only afford a small number.
This is the motivation for bringing demonstrations into the picture. But before discussing how to use them well, we need to be precise about what they are fixing. Demonstrations are not merely “extra data”; they provide successful trajectories that break the zero-reward deadlock. They give the critic examples of states and actions near success, and they give the actor behavioral structure that random exploration is unlikely to discover quickly.
The visual below compresses this failure mode into two coupled views. On the left is the robot’s interaction problem: many intermediate states receive rt=0r_t=0rt​=0, while the green success condition is rare and hard to reach by chance. On the right is the learning system: a randomly initialized actor samples noisy continuous actions, the environment returns sparse rewards, and the critic update is dominated by targets with rt=0r_t=0rt​=0.
The key takeaway is that vanilla RL can fail before the usual optimization story begins. The critic is not yet learning a useful value function, because the replay buffer has not captured successful behavior. The actor is not improving meaningfully, because the critic cannot tell it which continuous actions lead toward success. This is the specific bottleneck IBRL is designed to address.

2. Demonstrations Help—but Prior Integrations Are Brittle

The sparse-reward problem is not just that rewards are rare; it is that early learning has almost no useful signal to amplify. If a robot only receives reward after a successful insertion, grasp, or placement, then an untrained policy πθ\pi_\thetaπθ​ may spend millions of transitions exploring behaviors that are physically plausible but task-irrelevant. In that setting, demonstrations are an obvious source of structure: a small dataset D={ξ1,…,ξn}\mathcal{D}=\{\xi_1,\ldots,\xi_n\}D={ξ1​,…,ξn​} contains trajectories that already reach the rewarding region of the state-action space.
But “using demonstrations” is not a single algorithmic choice. The brittle part is how the demonstrations are connected to reinforcement learning. A demonstration can be treated as a supervised learning target, as static replay data, or as a regularizer on the policy. Each choice helps with the sparse-reward cold start, but each also imposes a different failure mode.
The simplest option is behavior cloning. We train an imitation policy μψ\mu_\psiμψ​ to reproduce demonstrator actions:
L(ψ)=E(s,a)∼D[∥μψ(s)−a∥2].L(\psi)=\mathbb{E}_{(s,a)\sim\mathcal{D}}\left[\|\mu_\psi(s)-a\|^2\right].L(ψ)=E(s,a)∼D​[∥μψ​(s)−a∥2].
This can be surprisingly strong when the demonstrations are high-quality and the test-time state distribution stays close to the demonstration distribution. The problem is that robotic control rarely stays perfectly on-distribution. A tiny positioning error can push the gripper into states the demonstrator never visited; from there, the cloned policy has no mechanism for trial-and-error correction. Behavior cloning converts successful trajectories into a policy, but it does not convert deployment mistakes into learning signal.
A second option is to place demonstrations directly into the replay buffer, as in reinforcement learning with prior data. Here, D\mathcal{D}D contributes transitions (st,at,rt,st+1)(s_t,a_t,r_t,s_{t+1})(st​,at​,rt​,st+1​) to the off-policy learner’s replay buffer B\mathcal{B}B. This is attractive because it fits naturally into algorithms like TD3, SAC, or DDPG: the critic can bootstrap from successful transitions before the learned actor discovers them independently.
However, this turns demonstrations into static evidence. Once stored in B\mathcal{B}B, a demonstration trajectory is just another set of tuples. The policy is not explicitly asked to preserve the demonstrator’s competence, nor is it given a live imitation mechanism that can help it choose good actions in difficult states. The demonstrations may improve value estimates, but the actor can still drift toward poor actions if the critic is inaccurate early on, especially under sparse rewards where value extrapolation is fragile.
A third option is regularized fine-tuning. One first pretrains a policy from demonstrations, then continues RL while adding a behavior-cloning penalty:
L(θ)+αλLBC.L(\theta)+\alpha\lambda L_{\mathrm{BC}}.L(θ)+αλLBC​.
This looks like a reasonable compromise: imitate enough to remain near the expert, but optimize reward enough to improve beyond it. The difficulty is that the compromise is controlled by coefficients such as α\alphaα and λ\lambdaλ, and those coefficients encode a global tradeoff. Too much cloning prevents improvement; too little cloning loses the benefit of demonstrations. Worse, the same policy is being asked to do two jobs at once: stay close to the demonstrator and discover better behavior through RL.
That single-policy tension is the key conceptual weakness. In many states, the imitation policy μψ\mu_\psiμψ​ may be better than an early RL actor because it has seen successful behavior. In other states, μψ\mu_\psiμψ​ may be wrong because of covariate shift, suboptimal demonstrations, or task variation, while the RL actor has learned a corrective maneuver. A global decision—“trust demonstrations” or “trust RL”—is too coarse. What we really want is a per-state decision: when both an imitation action aILa^{IL}aIL and a reinforcement-learning action aRLa^{RL}aRL are available, use whichever appears better for the current state.
This is the motivation behind IBRL’s design. Instead of treating imitation as merely initialization, replay data, or a penalty term, IBRL treats imitation as an action proposer that can compete with the RL actor. The demonstrations are not frozen as past experience, and the actor is not forced to internalize all imitation behavior through a single regularized objective. The algorithm can ask a more local question: given this state, should we try the imitation action or the learned actor’s action?
The visual below compactly organizes this distinction. The three rows correspond to the standard ways demonstrations are commonly integrated: behavior cloning, replay-buffer prior data, and regularized fine-tuning. Each one uses D\mathcal{D}D productively, but each also commits too early to a brittle interface between demonstrations and RL.
The important takeaway is the line underneath the comparison: need per-state choice, not global commitment. That sentence is the bridge to IBRL. The next step is to formalize how a hybrid IL/RL agent can propose both aILa^{IL}aIL and aRLa^{RL}aRL, evaluate them with learned critics, and bootstrap exploration from whichever action currently looks more promising.

3. IBRL in One Picture: Treat IL as an Action Proposer

The brittleness of earlier demonstration integrations suggests a slightly different question: instead of asking demonstrations to become the reinforcement learning policy, can we use them as a reliable source of candidate actions while still letting RL decide what is actually valuable? This is the central move in IBRL. The imitation policy is not treated as a constraint, a permanent regularizer, or a static replay artifact. It is treated as an action proposer.
Concretely, IBRL first trains a behavioral cloning policy μψ\mu_\psiμψ​ from the demonstration dataset D\mathcal{D}D. After this pretraining step, μψ\mu_\psiμψ​ is held fixed. During online RL, the agent also maintains a learned actor πθ\pi_\thetaπθ​, trained from experience in the replay buffer B\mathcal{B}B, together with a critic QϕQ_\phiQϕ​. At a state sts_tst​, the two policies each propose an action:
aIL=μψ(st),aRL=πθ(st).a^{IL}=\mu_\psi(s_t),\qquad a^{RL}=\pi_\theta(s_t).aIL=μψ​(st​),aRL=πθ​(st​).
These are not averaged, mixed with a hand-tuned coefficient, or selected according to a fixed exploration schedule. Instead, they form a small candidate set:
Cθ,ψ(st)={aIL,aRL}.C_{\theta,\psi}(s_t)=\{a^{IL},a^{RL}\}.Cθ,ψ​(st​)={aIL,aRL}.
The key question becomes: which proposed action looks better under the current value estimate? IBRL answers this with critic-based arbitration. Using a target critic Qϕ′Q_{\phi'}Qϕ′​, the algorithm chooses the action with the larger estimated return:
a∗=hQ(st)={aIL,Qϕ′(st,aIL)≥Qϕ′(st,aRL)aRL,Qϕ′(st,aRL)>Qϕ′(st,aIL).a^*=h_Q(s_t)=
\begin{cases}
a^{IL}, & Q_{\phi'}(s_t,a^{IL})\ge Q_{\phi'}(s_t,a^{RL})\\
a^{RL}, & Q_{\phi'}(s_t,a^{RL})>Q_{\phi'}(s_t,a^{IL}).
\end{cases}a∗=hQ​(st​)={aIL,aRL,​Qϕ′​(st​,aIL)≥Qϕ′​(st​,aRL)Qϕ′​(st​,aRL)>Qϕ′​(st​,aIL).​
This looks simple, but it changes the role of imitation in an important way. Behavioral cloning is good at producing plausible expert-like actions in states that resemble the demonstrations. Sparse-reward RL, meanwhile, is good at improving a policy once it can collect useful reward-bearing trajectories. IBRL tries to exploit both properties without forcing either one to do the other’s job. The imitation policy gives the agent a competent fallback proposal in many states; the RL actor remains free to discover actions that outperform the demonstrator.
This is different from simply initializing an actor with BC and then fine-tuning it. In pure BC initialization, the demonstrator’s influence can disappear quickly as the actor updates, especially under noisy sparse rewards. In BC-regularized fine-tuning, the demonstrator remains influential, but often as a penalty that can prevent meaningful improvement beyond the demonstrations. IBRL’s design is more modular: the imitation policy is fixed, but it is not necessarily obeyed. It proposes; the critic judges.
It is also different from replay-buffer demonstration methods. Demonstration transitions in the replay buffer help the critic and actor see successful behavior, but they do not directly answer the question, “What should I do in this new state right now?” A fixed imitation policy can generalize from demonstrations to produce an action at the current state sts_tst​, even if that exact state was never present in D\mathcal{D}D. That makes the demonstrator an online source of action candidates, not just an offline source of transitions.
Of course, this arbitration only helps if the critic is informative enough. If Qϕ′Q_{\phi'}Qϕ′​ is badly miscalibrated, it may prefer a poor RL action over a good imitation action, or cling too long to imitation because it underestimates novel RL proposals. This is why IBRL is usually implemented with the stabilizing machinery of modern off-policy actor-critic methods: target networks, replay buffers, clipped or ensembled critics, delayed actor updates, and careful bootstrapping. The method’s elegance depends on a subtle assumption: the critic does not need to be perfect, but it must be good enough to rank a small set of plausible candidates more often than not.
That small-set ranking perspective is the intuition worth holding onto. IBRL does not ask the critic to solve a huge continuous-action maximization problem from scratch at every state. Instead, it gives the critic two meaningful options: one from imitation and one from the learned RL actor. The hybrid policy is therefore greedy over a restricted candidate set:
hQ(st)=arg⁡max⁡a∈Cθ,ψ(st)Qϕ′(st,a).h_Q(s_t)=\arg\max_{a\in C_{\theta,\psi}(s_t)} Q_{\phi'}(s_t,a).hQ​(st​)=arga∈Cθ,ψ​(st​)max​Qϕ′​(st​,a).
This is the conceptual bridge to the later algorithmic details. The actor πθ\pi_\thetaπθ​ will be trained to improve under the critic, the critic will be trained from replay, and the target networks will stabilize the bootstrapped targets. But the high-level behavior is already visible here: demonstrations keep injecting reasonable actions into the decision process, while RL supplies an adaptive competitor that can eventually win whenever it finds something better.
The visual below condenses this idea into a single pipeline. On the left, demonstrations train a fixed IL policy μψ\mu_\psiμψ​, while online interaction trains the RL actor πθ\pi_\thetaπθ​ and critic QϕQ_\phiQϕ​ through the replay buffer B\mathcal{B}B. At each state, both policies propose actions, forming Cθ,ψ(st)C_{\theta,\psi}(s_t)Cθ,ψ​(st​). The target critic then compares the two candidates and outputs the selected hybrid action a∗=hQ(st)a^*=h_Q(s_t)a∗=hQ​(st​).
The important thing to notice is the separation of responsibilities. IL is not the final controller, and RL is not left unaided in the sparse-reward wilderness. IL proposes; RL proposes; the critic arbitrates. That is the core mechanism by which IBRL bootstraps sparse-reward robotic reinforcement learning while preserving the possibility of improving beyond the demonstrations.

4. Background: MDP, TD3-Style Actor-Critic, and BC

Once we view imitation not as a one-time pretraining trick but as an action proposer, it helps to slow down and name the machinery that IBRL is built on. The paper is not introducing a new reinforcement-learning formalism from scratch; it is carefully modifying a familiar off-policy actor-critic pipeline so that demonstrations can remain useful throughout learning, especially when sparse rewards make ordinary exploration painfully inefficient.
The environment is modeled as a Markov decision process, where at each time step the robot observes a state sss, takes a continuous action aaa, transitions to a new state s′s's′, and receives reward rrr. In sparse-reward manipulation, the reward is often nearly binary: most behavior receives no useful feedback, and only successful task completion gives a positive signal. This is the core difficulty. Even if the optimal policy is simple in hindsight, the learner may need to accidentally solve the task many times before the critic can assign meaningful value to the actions that caused success.
Off-policy actor-critic methods such as TD3 are attractive here because they reuse old experience from a replay buffer. Instead of discarding failed attempts, TD3 repeatedly trains a critic Q(s,a)Q(s,a)Q(s,a) to estimate long-term return and an actor π(s)\pi(s)π(s) to choose actions that the critic thinks are valuable. The critic is trained using Bellman-style bootstrapping: the value of the current action is regressed toward the immediate reward plus the estimated value of a future action. The actor is then updated to maximize the critic’s estimate:
π←arg⁡max⁡πQ(s,π(s)).\pi \leftarrow \arg\max_\pi Q(s,\pi(s)).π←argπmax​Q(s,π(s)).
TD3 adds several stabilizing details that matter in continuous control. It uses two critics and takes the more pessimistic value estimate when forming targets, reducing overestimation bias. It uses target networks, slowly updated copies of the actor and critics, so that the bootstrap target does not move too violently. It also delays actor updates relative to critic updates, allowing the value estimates to become less noisy before the policy chases them. These details make TD3 much more stable than a naive deterministic actor-critic, but they do not solve the central sparse-reward problem: if the buffer contains mostly failures, the critic has little signal from which to learn.
Behavior cloning attacks the problem from the opposite direction. Given demonstration pairs (s,aE)(s,a_E)(s,aE​), a supervised learner trains a policy πBC\pi_{\text{BC}}πBC​ to imitate the expert action:
min⁡π  E(s,aE)∼DE[∥π(s)−aE∥2].\min_\pi \; \mathbb{E}_{(s,a_E)\sim D_E}
\left[\lVert \pi(s)-a_E\rVert^2\right].πmin​E(s,aE​)∼DE​​[∥π(s)−aE​∥2].
This can produce competent behavior with very few demonstrations, particularly when the demonstrations cover the states encountered at test time. But BC has its own familiar failure mode: distribution shift. If the cloned policy makes a small mistake and visits a state not represented in the expert dataset, it may not know how to recover. In robotics, these small deviations compound quickly: a gripper approaches from a slightly wrong angle, an object moves unexpectedly, or contact dynamics differ from the demonstration trajectory.
So TD3 and BC have complementary weaknesses. TD3 can, in principle, improve beyond the demonstrator and recover from off-distribution states, but it may never discover reward in the first place. BC can immediately suggest plausible task-directed actions, but it has no built-in mechanism for improving from trial-and-error reward. IBRL is motivated by the observation that we should not have to choose between these modes too early.
A common compromise is to place demonstrations into the replay buffer or to add a BC regularization term during RL fine-tuning. These help, but they are somewhat indirect. Replay-buffer demonstrations influence the critic only through Bellman updates, and their effect may dilute as the buffer fills with agent experience. BC regularization keeps the actor near the demonstrator, but that can become a liability if the learned value function discovers a better action than the human provided, or if the robot must adapt to states outside the demonstration distribution.
IBRL’s background assumptions are therefore quite specific:
We have an off-policy continuous-control backbone, TD3-style.
We have a small set of expert demonstrations.
We can train or maintain an imitation policy from those demonstrations.
We trust the critic enough to compare candidate actions, but not enough to learn useful actions without help.
We want demonstrations to guide exploration and bootstrapping without permanently constraining the final policy.
This setup prepares the key move in the next section: instead of asking whether the actor or the imitation policy should dominate globally, IBRL asks a local greedy question at each state: which proposed action currently looks better under the critic? That question turns imitation from a static prior into an active candidate generator.
The visual below condenses these ingredients into a single map: the MDP interaction loop supplies experience, TD3 learns actor and critic functions from replay, and BC supplies a demonstration-trained policy that can propose meaningful actions even when rewards are sparse. The important point is not that these are separate modules, but that IBRL will wire them together through the critic’s action evaluation.
Reading the diagram as a dependency graph, the critic is the bridge between reinforcement learning and imitation. TD3 gives us the mechanism for estimating value; BC gives us a source of expert-like candidate actions; demonstrations seed the process with behavior that reaches rewarding states. IBRL’s contribution begins once we let these pieces compete and cooperate through greedy value comparisons rather than treating imitation as merely pretraining or regularization.

5. Deriving Actor Proposal: Greedy Choice Over Two Experts

Now that we have the ingredients—an MDP, a TD3-style actor-critic, and a behavior cloning policy—we can ask the central IBRL question: how should the robot choose actions while learning? In vanilla off-policy actor-critic, the answer is simple: execute the current actor plus exploration noise. In pure imitation, the answer is also simple: execute the cloned policy. IBRL’s first key move is to avoid committing to either one globally. Instead, at every state, it asks both policies for a proposal and lets the critic decide which one currently looks better.
This is a small change mechanically, but it changes the exploration story substantially. Sparse-reward robotic RL often fails not because the actor update is wrong in principle, but because the agent spends most of its early interaction budget taking actions that never reach informative reward. A behavior-cloned policy, even if imperfect, may already know how to reach meaningful parts of the state space. But blindly following it has the opposite problem: the learner may inherit demonstration limitations and fail to improve beyond the demonstrator. IBRL treats imitation as an action proposal mechanism, not as a permanent constraint.
At state sts_tst​, IBRL queries two “experts.” The first is the fixed imitation policy, trained beforehand from demonstrations:
aIL∼μψ(st).a^{IL}\sim \mu_\psi(s_t).aIL∼μψ​(st​).
The second is the online RL actor, usually with exploration noise as in TD3:
aRL=πθ(st)+ϵ,ϵ∼N(0,σ2).a^{RL}=\pi_\theta(s_t)+\epsilon,\qquad \epsilon\sim\mathcal{N}(0,\sigma^2).aRL=πθ​(st​)+ϵ,ϵ∼N(0,σ2).
These two actions form a small candidate set:
Cθ,ψ(st)={aIL,aRL}.C_{\theta,\psi}(s_t)=\{a^{IL},a^{RL}\}.Cθ,ψ​(st​)={aIL,aRL}.
The important point is that IBRL is not mixing the two actions by averaging them. It is also not adding a behavior-cloning loss that continuously pulls the actor toward demonstrations. Instead, it performs a greedy arbitration: score both candidate actions under the current target critic, then execute the action with the larger predicted value:
a∗=argmax⁡a∈Cθ,ψ(st)Qϕ′(st,a).a^*=\operatorname*{argmax}_{a\in C_{\theta,\psi}(s_t)}Q_{\phi'}(s_t,a).a∗=a∈Cθ,ψ​(st​)argmax​Qϕ′​(st​,a).
This is the “actor proposal” rule. The term proposal matters: both μψ\mu_\psiμψ​ and πθ\pi_\thetaπθ​ merely suggest actions. The critic decides which action is more promising according to the value function learned so far.
There is a subtle assumption hiding here: the critic must be good enough to compare two nearby or plausible actions. IBRL does not require the critic to solve the entire control problem from scratch at the beginning. It only asks for a local ranking between a demonstration-like action and an RL actor action. That is a weaker and often more stable demand than asking the actor to discover successful behavior through noise alone. Early in training, the imitation proposal may frequently receive higher value because it leads toward demonstrated success trajectories. Later, if the RL actor discovers improvements, the same rule allows the actor proposal to win.
This is what makes the method different from two common demonstration-based strategies:
Replay-buffer demonstrations help the critic and actor learn from expert transitions, but during environment interaction the policy may still explore poorly.
BC-regularized fine-tuning keeps the actor close to demonstrations, which can stabilize learning but may also suppress beneficial deviations.
IBRL actor proposal uses demonstrations at decision time, while still allowing the RL actor to take over whenever the critic believes it is better.
So the imitation policy becomes a bootstrap for exploration rather than a leash on the final policy.
In practice, the greedy comparison must be protected against overestimation. TD3 already teaches us that a single learned critic can be optimistic about actions that exploit approximation error. If we use an ensemble of target critics, IBRL can make the comparison more conservative by scoring each candidate action with the minimum value over a selected critic subset KKK:
a∗=argmax⁡a∈Cθ,ψ(st)min⁡i∈KQϕi′(st,a).a^*=\operatorname*{argmax}_{a\in C_{\theta,\psi}(s_t)}
\min_{i\in K}Q_{\phi'_i}(s_t,a).a∗=a∈Cθ,ψ​(st​)argmax​i∈Kmin​Qϕi′​​(st​,a).
This mirrors the clipped-double-Q idea in TD3 and the more general conservative ensemble logic used in RED-Q-style methods. The selected action is not the one that looks spectacular to one critic; it is the one that survives a pessimistic comparison across critics. That matters because the arbitration step directly affects data collection. A bad optimistic choice does not just create a poor target—it sends the robot into unhelpful parts of the state space.
The resulting behavior is an automatic schedule without an explicit schedule parameter. Early on, μψ\mu_\psiμψ​ often wins because it proposes actions consistent with demonstrations. As πθ\pi_\thetaπθ​ improves, it can win more often, especially in states where the demonstrator is suboptimal or where reward feedback has revealed a better maneuver. In that sense, IBRL implements a greedy hybrid policy:
execute the best currently scored proposal from imitation and reinforcement learning.\text{execute the best currently scored proposal from imitation and reinforcement learning.}execute the best currently scored proposal from imitation and reinforcement learning.
The visual below condenses this arbitration mechanism. The state sts_tst​ fans out into two candidate-generating branches: a fixed imitation policy branch and a noisy online RL actor branch. Their outputs are collected into Cθ,ψ(st)C_{\theta,\psi}(s_t)Cθ,ψ​(st​), scored by the target critic, and reduced to a single executed action a∗a^*a∗.
The lower equations in the visual summarize the two versions of the selection rule: the basic target-critic argmax and the conservative ensemble variant. The main conceptual takeaway is that IBRL improves exploration not by replacing RL with imitation, but by letting imitation compete with RL at every interaction step under the critic’s current estimate of long-term value.

6. Deriving Bootstrap Proposal: Better Targets for Q-Learning

The same “choose the better expert” idea that helped us act more intelligently in the environment has a second, less obvious use: it can also make the critic learn from better imagined futures. In actor proposal, IBRL asked the critic to compare an imitation action and an RL actor action at the current state, then execute whichever looked more promising. But Q-learning is not only about what we do now. Every critic update also asks: what action will be taken next, and how valuable will that future be?
This matters because the TD target is the critic’s training signal. In a TD3-style algorithm, for a transition (st,at,rt,st+1)(s_t,a_t,r_t,s_{t+1})(st​,at​,rt​,st+1​), the target is usually built by querying a target actor at the next state:
rt+γQϕ′(st+1,πθ′(st+1)).r_t+\gamma Q_{\phi'}\bigl(s_{t+1},\pi_{\theta'}(s_{t+1})\bigr).rt​+γQϕ′​(st+1​,πθ′​(st+1​)).
This is a perfectly natural backup if we believe the future policy is just πθ′\pi_{\theta'}πθ′​. But in IBRL, that is no longer the policy we actually intend to use. The deployed behavior is not “always follow the RL actor.” It is closer to a greedy hybrid controller: at each state, compare an imitation proposal and an RL proposal, then take the one with higher estimated value. If the target still bootstraps only through πθ′\pi_{\theta'}πθ′​, the critic is being trained under the wrong assumption about future behavior.
The bootstrap proposal fixes this mismatch. At the next state st+1s_{t+1}st+1​, IBRL constructs the same two candidates that it considers during interaction. The imitation policy proposes
at+1IL∼μψ(st+1),a^{IL}_{t+1}\sim\mu_\psi(s_{t+1}),at+1IL​∼μψ​(st+1​),
while the RL side proposes a TD3-style target action with smoothing noise:
at+1RL=πθ′(st+1)+clip⁡(ϵ,−c,c),ϵ∼N(0,σ2).a^{RL}_{t+1}
=
\pi_{\theta'}(s_{t+1})
+
\operatorname{clip}(\epsilon,-c,c),
\qquad
\epsilon\sim\mathcal{N}(0,\sigma^2).at+1RL​=πθ′​(st+1​)+clip(ϵ,−c,c),ϵ∼N(0,σ2).
The noise term is inherited from TD3’s target policy smoothing. It prevents the critic from overfitting to narrow spikes in its own value landscape by evaluating actions in a small neighborhood around the target actor’s output. IBRL keeps that stabilizing trick, but expands the set of candidate future actions from one to two: the imitation proposal and the smoothed RL proposal.
The resulting target is:
y(j)=rt+γmax⁡a′∈{at+1IL,at+1RL}Qϕ′(st+1,a′).y^{(j)}
=
r_t
+
\gamma
\max_{a'\in\{a^{IL}_{t+1},a^{RL}_{t+1}\}}
Q_{\phi'}(s_{t+1},a').y(j)=rt​+γa′∈{at+1IL​,at+1RL​}max​Qϕ′​(st+1​,a′).
Conceptually, this is a Bellman backup for a future controller that says: “when I arrive at st+1s_{t+1}st+1​, I will again choose the better of imitation and reinforcement learning.” That is the key shift. The critic is no longer trained to evaluate a purely learned actor; it is trained to evaluate the hybrid policy class that IBRL actually executes.
This can be especially important in sparse-reward robotics. Early in learning, πθ\pi_\thetaπθ​ may produce actions that are locally plausible but globally useless: small arm motions that never reach the object, grasps that miss by a few centimeters, or recovery behaviors that fail after a partial success. A behavioral cloning policy, although imperfect, may already know how to enter useful regions of the state-action space. If the TD target ignores that option, then even states near demonstrator trajectories may be bootstrapped through weak RL actions. The value estimate becomes pessimistic or noisy exactly where the agent needs reliable learning signal.
The max over proposals is not free of danger, though. A naïve maximization over learned Q-values can amplify overestimation: if one candidate action is assigned an erroneously high value, the backup will eagerly propagate that error. This is why the practical IBRL target uses an ensemble and scores each candidate conservatively. Instead of selecting by a single critic, it evaluates each action using the minimum over a subset KKK of target critics:
y(j)=rt+γmax⁡a′∈{at+1IL,at+1RL}[min⁡i∈KQϕi′(st+1,a′)].y^{(j)}
=
r_t
+
\gamma
\max_{a'\in\{a^{IL}_{t+1},a^{RL}_{t+1}\}}
\left[
\min_{i\in K}Q_{\phi'_i}(s_{t+1},a')
\right].y(j)=rt​+γa′∈{at+1IL​,at+1RL​}max​[i∈Kmin​Qϕi′​​(st+1​,a′)].
There are two nested operations here with different purposes. The inner min⁡\minmin is a conservative value estimate: it asks, “how good does this action look under the least optimistic critic?” The outer max⁡\maxmax is the hybrid policy improvement step: after scoring the IL and RL candidates conservatively, choose the better one. This combination preserves the spirit of greedy improvement while reducing the chance that the backup is driven by a single critic’s hallucination.
A useful way to compare the methods is:
Replay-buffer demonstrations add expert transitions, but the bootstrap target may still assume future actions come only from the RL actor.
BC-regularized fine-tuning keeps the actor near demonstrations, but it can constrain improvement even when the RL policy discovers better actions.
IBRL bootstrap proposal lets imitation and RL compete inside the Bellman target, so the critic learns values for a controller that can exploit either source at future states.
The visual below compresses this derivation into the main computational move. Vanilla TD3 has a single next-action branch: query the target actor and bootstrap through that action. IBRL replaces that single branch with two proposals at st+1s_{t+1}st+1​: one from the imitation policy, one from the smoothed target actor. Both are scored by the target critic, and the target uses the better-scoring candidate.
The conservative ensemble version adds one more layer: each proposal is first evaluated by a pessimistic critic score, typically a minimum over target Q-functions, and only then compared. That is the practical recipe behind the bootstrap proposal: use imitation not only to explore now, but also to define stronger and more faithful TD targets for the future.

7. Theorem: Bootstrap Proposal Is a Bellman Backup for a Greedy Hybrid Policy

With the bootstrap proposal in place, the natural question is whether it is merely a useful engineering trick or whether it corresponds to a principled RL update. The key insight is that the proposal has a clean policy interpretation: it is exactly the Bellman backup you would write if the agent followed a greedy hybrid policy that can choose, at each demonstrated next state, between the learned actor’s action and the demonstrator’s action.
Recall the ordinary TD3-style target. Given a transition (s,a,r,s′)(s,a,r,s')(s,a,r,s′), the critic is trained toward something like
y  =  r+γQtarg(s′,πtarg(s′)).y \;=\; r + \gamma Q_{\text{targ}}(s', \pi_{\text{targ}}(s')).y=r+γQtarg​(s′,πtarg​(s′)).
This assumes that, after landing in s′s's′, the future will be generated by the current actor. But in sparse-reward manipulation, the current actor may still be very poor: it may not know how to grasp, align, insert, or recover. If we only bootstrap through that actor, then even demonstrated states can receive weak future-value estimates, because the target imagines that the agent will abandon the demonstrated behavior immediately after one step.
The bootstrap proposal changes the counterfactual future. Instead of asking only, “What if the actor acts next?”, it asks a more forgiving question:
What if, at the next demonstrated state, we could choose the better of the actor action and the demonstration action according to the critic?
That conceptual policy can be written as a hybrid selector. If aD(s′)a_D(s')aD​(s′) is the demonstrator’s next action and π(s′)\pi(s')π(s′) is the learned actor’s action, define
πH(s′)=arg⁡max⁡a∈{π(s′), aD(s′)}Q(s′,a).\pi_H(s')
=
\arg\max_{a \in \{\pi(s'),\, a_D(s')\}}
Q(s',a).πH​(s′)=arga∈{π(s′),aD​(s′)}max​Q(s′,a).
Then its Bellman backup is
r+γQ(s′,πH(s′))=r+γmax⁡(Q(s′,π(s′)),Q(s′,aD(s′))).r + \gamma Q(s', \pi_H(s'))
=
r + \gamma
\max\Big(
Q(s',\pi(s')),
Q(s',a_D(s'))
\Big).r+γQ(s′,πH​(s′))=r+γmax(Q(s′,π(s′)),Q(s′,aD​(s′))).
That right-hand side is precisely the bootstrap proposal. So the theorem is not saying that IBRL magically solves continuous-action maximization. It is saying something more modest and useful: the target is the Bellman backup for a policy that is greedy over a small proposal set containing one RL action and one imitation action.
This distinction matters. A replay-buffer demonstration method usually helps by changing what states and transitions the critic sees. A behavior-cloning regularizer helps by pulling the actor toward demonstrated actions. The bootstrap proposal does something different: it changes the future policy assumed by the TD target. Demonstrations are not just samples from the past; they become candidate actions for future value propagation.
There are a few assumptions hidden in this clean interpretation. First, the demonstrator action must be available at the next state being backed up, which is naturally true for demonstration transitions where s′s's′ has a recorded next action. Second, the critic’s ranking must be at least meaningful enough that “take the higher QQQ” is not systematically fooled. In practice, IBRL uses target networks and critic ensembles to make this selection less brittle, because a max over candidate actions can amplify overestimation error. Third, the hybrid policy is usually a training-time construct: the robot may not have a demonstrator action available at arbitrary deployment states.
The theorem also clarifies why suboptimal demonstrations are not necessarily fatal. If the demonstrator action is better than the actor’s current proposal, the backup can propagate value through the demonstrated continuation. If the actor has already discovered a better action, the max can ignore the demonstration. In that sense, the hybrid policy is not pure imitation and not pure reinforcement learning; it is a local greedy competition between the two.
The visual below should be read as a compact version of this argument. Starting from the next state s′s's′, there are two proposed continuations: the actor proposal and the demonstration proposal. The critic acts as a gate, selecting whichever proposal has larger estimated value. That selected action is then used inside the Bellman target.
The important takeaway from the diagram is the equivalence: bootstrap proposal = Bellman backup under the greedy hybrid policy. This equivalence is what makes the method feel less like an ad hoc bonus for demonstrations and more like a particular form of approximate policy improvement, restricted to a small but strategically chosen set of actions.

8. Proof: Expand the Hybrid Policy Definition

Having stated the theorem, the proof is almost disappointingly simple—which is exactly why the interpretation is useful. The IBRL bootstrap target is not an ad hoc trick that “sometimes picks the demonstrator and sometimes picks the actor.” It is the ordinary one-step Bellman evaluation target for a particular deterministic policy: the greedy hybrid policy that looks at both imitation and RL candidate actions, then chooses whichever one the critic currently prefers.
To make that statement precise, freeze the objects used inside the target: the imitation policy μψ\mu_\psiμψ​, the target actor πθ′\pi_{\theta'}πθ′​, and the target critic Qϕ′Q_{\phi'}Qϕ′​. This “fixed target” viewpoint matters because Bellman targets are evaluated with respect to a policy and value function held constant during the critic update. We are not differentiating through the candidate selection step here, and we are not claiming that the selected action is globally optimal over the continuous action space. The maximization is only over the candidate set
Cθ′,ψ(st+1),C_{\theta',\psi}(s_{t+1}),Cθ′,ψ​(st+1​),
which typically contains actions proposed by the RL actor and the imitation policy, possibly with small perturbations or multiple samples depending on the implementation.
The greedy hybrid policy is defined by choosing the candidate action with the largest target critic value:
hQϕ′(st+1)∈argmax⁡a′∈Cθ′,ψ(st+1)Qϕ′(st+1,a′).h_{Q_{\phi'}}(s_{t+1})
\in
\operatorname*{argmax}_{a'\in C_{\theta',\psi}(s_{t+1})}
Q_{\phi'}(s_{t+1},a').hQϕ′​​(st+1​)∈a′∈Cθ′,ψ​(st+1​)argmax​Qϕ′​(st+1​,a′).
This definition hides a few assumptions that are worth making explicit. If Cθ′,ψ(st+1)C_{\theta',\psi}(s_{t+1})Cθ′,ψ​(st+1​) is finite, an argmax exists as long as the critic returns finite values. If multiple candidates tie, any tie-breaking rule gives a valid deterministic greedy hybrid policy. The proof does not depend on whether the winning action came from the behavioral cloning policy or the RL actor; it only depends on the fact that the chosen action attains the maximum among the candidates.
Now recall the usual one-step Bellman evaluation target for a deterministic policy hhh. Given a transition (st,at,rt,st+1)(s_t,a_t,r_t,s_{t+1})(st​,at​,rt​,st+1​), the critic target is
rt+γQϕ′(st+1,h(st+1)).r_t+\gamma Q_{\phi'}(s_{t+1},h(s_{t+1})).rt​+γQϕ′​(st+1​,h(st+1​)).
If the policy being evaluated is the greedy hybrid policy hQϕ′h_{Q_{\phi'}}hQϕ′​​, this becomes
rt+γQϕ′(st+1,hQϕ′(st+1)).r_t+\gamma Q_{\phi'}\bigl(s_{t+1},h_{Q_{\phi'}}(s_{t+1})\bigr).rt​+γQϕ′​(st+1​,hQϕ′​​(st+1​)).
The only remaining step is substitution. Since hQϕ′(st+1)h_{Q_{\phi'}}(s_{t+1})hQϕ′​​(st+1​) was defined to be an action attaining the maximum critic value over the candidate set, we have
Qϕ′(st+1,hQϕ′(st+1))=max⁡a′∈Cθ′,ψ(st+1)Qϕ′(st+1,a′).Q_{\phi'}\bigl(s_{t+1},h_{Q_{\phi'}}(s_{t+1})\bigr)
=
\max_{a'\in C_{\theta',\psi}(s_{t+1})}
Q_{\phi'}(s_{t+1},a').Qϕ′​(st+1​,hQϕ′​​(st+1​))=a′∈Cθ′,ψ​(st+1​)max​Qϕ′​(st+1​,a′).
Plugging this equality into the Bellman target gives
rt+γmax⁡a′∈Cθ′,ψ(st+1)Qϕ′(st+1,a′),r_t+\gamma
\max_{a'\in C_{\theta',\psi}(s_{t+1})}
Q_{\phi'}(s_{t+1},a'),rt​+γa′∈Cθ′,ψ​(st+1​)max​Qϕ′​(st+1​,a′),
which is exactly the IBRL bootstrap proposal. In other words, the bootstrap target is simply asking: among the imitation and actor proposals available at the next state, which one does the target critic think leads to the best continuation value?
This is a subtle but important distinction from several nearby methods. Demonstration replay methods may put expert transitions into the replay buffer, but their backup still typically follows the learned actor alone. BC-regularized fine-tuning may keep the actor close to demonstrations, but its critic target is still usually tied to the actor’s next action. IBRL instead changes the backup policy itself: the critic is trained as if the next action were chosen by a greedy hybrid selector over both IL and RL candidates.
With ensembles, the same proof goes through after replacing the scalar critic by a conservative aggregate. For example, if KKK critics are used, the selected action may maximize
min⁡i∈KQϕi′(st+1,a′),\min_{i\in K}Q_{\phi'_i}(s_{t+1},a'),i∈Kmin​Qϕi′​​(st+1​,a′),
so the backup uses
rt+γmax⁡a′∈Cθ′,ψ(st+1)min⁡i∈KQϕi′(st+1,a′).r_t+\gamma
\max_{a'\in C_{\theta',\psi}(s_{t+1})}
\min_{i\in K}Q_{\phi'_i}(s_{t+1},a').rt​+γa′∈Cθ′,ψ​(st+1​)max​i∈Kmin​Qϕi′​​(st+1​,a′).
The proof is unchanged: define the greedy hybrid policy with respect to the conservative ensemble value, then evaluate that policy with a one-step Bellman target. The minimum over critics only changes the scoring function used to rank candidates; it does not change the logical structure of the derivation.
A compact proof ladder is helpful here because the argument is mostly about tracking one expression through a definition. The visual that follows should be read from top to bottom: first define the hybrid action as an argmax, then insert that action into the Bellman target, then replace its critic value by the corresponding maximum over the candidate set.
The key mental picture is that the blue-highlighted hybrid action and the blue-highlighted max expression are the same quantity viewed from two sides. One is the policy form—“choose this action.” The other is the backup form—“use the value of the best candidate.” IBRL’s bootstrap target is exactly the result of moving from the first form to the second by direct substitution.

9. Algorithm: IBRL with a TD3 Backbone

Having expanded the hybrid policy definition, we can now turn it into an actual learning algorithm. The key point is that IBRL is not a wholesale replacement for off-policy actor-critic learning. It is much more surgical: start with a TD3-like backbone, keep replay, target networks, critic regression, delayed actor updates, and conservative value aggregation, then replace exactly two places where TD3 normally trusts the learned actor alone.
Those two places are the places where the algorithm must answer the question: which action should we treat as greedy under the current value function?
In vanilla TD3, the answer is essentially always “the actor’s action,” possibly with target smoothing noise. During data collection, the actor proposes an action; during bootstrapping, the target actor proposes the next action used in the Bellman target. IBRL changes both decisions by letting the imitation policy compete with the RL policy. The learned critic ensemble then chooses between them.
The first ingredient is a behavior cloning policy trained from demonstrations:
L(ψ)=E(s,a)∼D∥μψ(s)−a∥22.L(\psi)
=
\mathbb{E}_{(s,a)\sim\mathcal{D}}
\left\|
\mu_\psi(s)-a
\right\|_2^2.L(ψ)=E(s,a)∼D​∥μψ​(s)−a∥22​.
This policy μψ\mu_\psiμψ​ is not assumed to be optimal everywhere. That assumption would be too strong, especially in robotics, where demonstrations may be few, noisy, suboptimal, or only cover a narrow corridor of the state space. Instead, μψ\mu_\psiμψ​ is treated as a candidate generator: it proposes actions that are often reasonable near demonstrated behavior, while the RL actor πθ\pi_\thetaπθ​ remains free to discover improvements.
At interaction time, IBRL forms two candidate actions:
aIL∼μψ(st),aRL=πθ(st)+ϵ,ϵ∼N(0,σ2).a^{IL} \sim \mu_\psi(s_t),
\qquad
a^{RL} = \pi_\theta(s_t) + \epsilon,
\qquad
\epsilon \sim \mathcal{N}(0,\sigma^2).aIL∼μψ​(st​),aRL=πθ​(st​)+ϵ,ϵ∼N(0,σ2).
Then it executes the one that looks better under a conservative critic estimate:
at=argmax⁡a∈{aIL,aRL}min⁡i∈KQϕi′(st,a).a_t
=
\operatorname*{argmax}_{a\in\{a^{IL},a^{RL}\}}
\min_{i\in K} Q_{\phi'_i}(s_t,a).at​=a∈{aIL,aRL}argmax​i∈Kmin​Qϕi′​​(st​,a).
This is the actor proposal substitution. TD3 would have executed the noisy actor action. IBRL instead asks: “Between the demonstrator-like action and the RL action, which one has higher pessimistic value?” The minimum over a sampled critic subset KKK is important. Since Q-functions are learned from sparse, off-policy data, overestimation can easily make an untested RL action look attractive. The ensemble minimum acts as a guardrail: an action must look good to multiple critics before it is preferred.
The second substitution happens inside the Bellman target. Standard TD3 would construct a target roughly by evaluating the target critic at the target actor’s next action. IBRL again uses a hybrid candidate set:
Cθ′,ψ(s)={μψ(s),πθ′(s)},C_{\theta',\psi}(s)
=
\{\mu_\psi(s), \pi_{\theta'}(s)\},Cθ′,ψ​(s)={μψ​(s),πθ′​(s)},
possibly with the same practical details as TD3-style smoothing or stochastic candidate generation. The target becomes
y(j)=rt(j)+γmax⁡a′∈Cθ′,ψ(st+1(j))min⁡i∈KQϕi′(st+1(j),a′).y^{(j)}
=
r_t^{(j)}
+
\gamma
\max_{a'\in C_{\theta',\psi}(s_{t+1}^{(j)})}
\min_{i\in K}
Q_{\phi'_i}(s_{t+1}^{(j)},a').y(j)=rt(j)​+γa′∈Cθ′,ψ​(st+1(j)​)max​i∈Kmin​Qϕi′​​(st+1(j)​,a′).
This is the bootstrap proposal substitution. It matters because the critic is trained to predict returns under the greedy hybrid policy, not under the RL actor alone. If the imitation action is still better at the next state, the Bellman backup should reflect that. Otherwise, the critic may prematurely undervalue demonstrated recovery behavior or overcommit to a weak early actor.
Once the target is defined, the critic update is standard supervised temporal-difference regression:
L(ϕi)=1N∑j=1N[y(j)−Qϕi(st(j),at(j))]2,i=1,…,E.L(\phi_i)
=
\frac{1}{N}
\sum_{j=1}^{N}
\left[
y^{(j)}
-
Q_{\phi_i}(s_t^{(j)},a_t^{(j)})
\right]^2,
\qquad i=1,\ldots,E.L(ϕi​)=N1​j=1∑N​[y(j)−Qϕi​​(st(j)​,at(j)​)]2,i=1,…,E.
The actor update is also recognizable from TD3, except that the critic objective is usually evaluated pessimistically through the ensemble:
L(θ)=−1N∑j=1Nmin⁡i∈KQϕi(st(j),πθ(st(j))).L(\theta)
=
-
\frac{1}{N}
\sum_{j=1}^{N}
\min_{i\in K}
Q_{\phi_i}
\left(
s_t^{(j)}, \pi_\theta(s_t^{(j)})
\right).L(θ)=−N1​j=1∑N​i∈Kmin​Qϕi​​(st(j)​,πθ​(st(j)​)).
So the RL actor is still trained to maximize Q. IBRL does not behavior-clone the actor forever, nor does it add a permanent BC penalty to the actor objective. This is an important distinction from BC-regularized fine-tuning. The imitation policy remains as a separate proposal mechanism, while the actor learns from value gradients and can eventually surpass the demonstrations.
Target networks are updated by the usual exponential moving average:
ϕi′←ρϕi′+(1−ρ)ϕi,θ′←ρθ′+(1−ρ)θ.\phi'_i \leftarrow \rho \phi'_i + (1-\rho)\phi_i,
\qquad
\theta' \leftarrow \rho \theta' + (1-\rho)\theta.ϕi′​←ρϕi′​+(1−ρ)ϕi​,θ′←ρθ′+(1−ρ)θ.
This keeps the greedy hybrid targets from moving too abruptly. Without target networks, the max over candidates could amplify critic noise: a small transient overestimate could select the wrong candidate, which would then become part of the target, reinforcing the error. The TD3 machinery is therefore not incidental; it stabilizes the more aggressive hybrid backup.
The resulting algorithm can be read as:
Pretrain an imitation proposal μψ\mu_\psiμψ​ from demonstrations.
Seed the replay buffer with demonstration data.
Interact using a critic-greedy choice between IL and RL candidate actions.
Bootstrap using a critic-greedy choice between IL and target-RL candidate actions.
Train critics and actor with ordinary TD3-style losses and target-network updates.
The visual below condenses this into a pseudocode-style view. The two pale-blue lines are the only places where IBRL departs conceptually from TD3: action selection in the environment and action selection in the TD target. The yellow-highlighted lines are not unique to IBRL, but they are crucial stabilizers inherited from TD3.
Read the algorithm as a TD3 skeleton with a hybrid policy inserted at the two decision points where “the next greedy action” must be chosen. That is the main implementation takeaway: IBRL’s power comes less from adding demonstrations to replay, and more from letting demonstrations participate directly in the greedy improvement and Bellman backup steps.

10. Soft IBRL: Avoiding Candidate Masking

The TD3-style version of IBRL gives us a concrete recipe: maintain an RL actor, keep an imitation policy available as a second proposal, evaluate both with target critics, and use the best-looking candidate both for environment interaction and for bootstrapping. That greedy hybrid is appealing because it turns the imitation policy into a kind of safety net: if the learned actor is bad early in training, the demonstrator-like action can keep the agent near useful states. But there is a subtle failure mode hiding in the word best.
The problem is that the critic used to choose between candidates is itself still being learned. Early in training, Qϕ′(s,a)Q_{\phi'}(s,a)Qϕ′​(s,a) may rank actions incorrectly, especially under sparse rewards where most observed transitions carry no immediate learning signal. If the target critic initially assigns
Qϕ′(st,aRL)<Qϕ′(st,aIL),Q_{\phi'}(s_t,a^{RL}) < Q_{\phi'}(s_t,a^{IL}),Qϕ′​(st​,aRL)<Qϕ′​(st​,aIL),
then greedy IBRL will always select the imitation candidate at sts_tst​. That sounds harmless if aILa^{IL}aIL is genuinely better, but it is dangerous if aRLa^{RL}aRL is merely undervalued. Since the RL candidate is never executed, the replay buffer receives little or no evidence about where it leads, and the critic has no opportunity to correct its estimate. The policy that could eventually outperform the imitation behavior is effectively hidden behind the critic’s initial pessimism.
This is the candidate masking issue. It is not ordinary exploration noise over a continuous action space; it is a discrete masking effect caused by the hybrid proposal mechanism itself. Once the argmax chooses one candidate, the other candidate can lose learning support. In sparse-reward robotics, this matters because the interesting actions are often not locally rewarded. A grasp adjustment, a pre-contact motion, or an initially awkward recovery maneuver may look worse than a demonstration-like action until several steps later. A hard max can prevent those actions from ever generating the delayed evidence that would make them look good.
Soft IBRL replaces this brittle deterministic choice with a QQQ-weighted stochastic choice over the candidate set. Let
Cθ,ψ(s)C_{\theta,\psi}(s)Cθ,ψ​(s)
denote the candidate actions proposed at state sss, typically including an imitation proposal aIL∼πψ(⋅∣s)a^{IL}\sim \pi_\psi(\cdot\mid s)aIL∼πψ​(⋅∣s) and an RL actor proposal aRL=πθ(s)a^{RL}=\pi_\theta(s)aRL=πθ​(s), possibly with exploration noise depending on the implementation. Instead of taking
arg⁡max⁡a∈Cθ,ψ(s)Qϕ′(s,a),\arg\max_{a\in C_{\theta,\psi}(s)} Q_{\phi'}(s,a),arga∈Cθ,ψ​(s)max​Qϕ′​(s,a),
Soft IBRL samples from the Boltzmann distribution
pQ(a∣s)=exp⁡(βQϕ′(s,a))∑a′∈Cθ,ψ(s)exp⁡(βQϕ′(s,a′)),a∈Cθ,ψ(s).p_Q(a\mid s)=
\frac{\exp(\beta Q_{\phi'}(s,a))}
{\sum_{a'\in C_{\theta,\psi}(s)}\exp(\beta Q_{\phi'}(s,a'))},
\qquad a\in C_{\theta,\psi}(s).pQ​(a∣s)=∑a′∈Cθ,ψ​(s)​exp(βQϕ′​(s,a′))exp(βQϕ′​(s,a))​,a∈Cθ,ψ​(s).
The inverse temperature β\betaβ controls how sharp the preference is. When β\betaβ is large, even a modest QQQ-advantage makes the distribution concentrate on the higher-valued candidate, recovering behavior close to greedy IBRL. When β\betaβ is smaller, the lower-valued candidate still receives nonzero probability. This does not mean Soft IBRL ignores the critic; it still biases selection toward actions that look better. The key difference is that critic preference becomes a probability, not a veto.
The same softening applies in two places. For environment interaction, the actor proposal becomes
a∗∼pQ(⋅∣st),a^*\sim p_Q(\cdot\mid s_t),a∗∼pQ​(⋅∣st​),
so the behavior policy sometimes executes the IL action and sometimes the RL action, with probabilities shaped by the current target critic. For bootstrapping, the target action is sampled similarly at the next state:
a′∼pQ(⋅∣st+1),a'\sim p_Q(\cdot\mid s_{t+1}),a′∼pQ​(⋅∣st+1​),
and the critic target becomes
y(j)=rt+γQϕ′(st+1,a′).y^{(j)}=r_t+\gamma Q_{\phi'}(s_{t+1},a').y(j)=rt​+γQϕ′​(st+1​,a′).
This is a small change syntactically, but it changes the learning dynamics. The bootstrap target no longer always backs up through the single candidate that currently wins the hard comparison. Instead, both proposals can contribute to temporal-difference learning over time, which reduces the chance that one branch of the hybrid policy is starved of updates.
There is an important nuance here: Soft IBRL is not simply “more random exploration.” The randomness is structured around the candidate set and the critic’s relative preferences. If the imitation policy is clearly better, it will be sampled more often. If the RL actor begins to discover useful deviations, its probability increases automatically as the critic improves. In this sense, Soft IBRL preserves the main intuition of IBRL—use demonstrations to bootstrap sparse-reward RL—while avoiding the pathological all-or-nothing behavior of the hard maximum.
The tradeoff is controlled by β\betaβ. Too large, and the method collapses back toward greedy masking. Too small, and the proposal distribution may become nearly uniform over candidates, weakening the benefit of critic-guided selection. In practice, β\betaβ is a knob for how much uncertainty we want to acknowledge in the critic’s ranking. Early in sparse-reward training, uncertainty is high, so maintaining support for both IL and RL candidates is often valuable; later, a sharper distribution can exploit more confidently learned value estimates.
The visual below condenses this logic into two contrasting parts. On the left is the failure case: a hard maximum sees the imitation candidate as higher value and repeatedly selects it, while the RL candidate receives no execution support. On the right, the Boltzmann distribution turns those two QQQ-values into sampling probabilities, so the lower-valued proposal is downweighted but not eliminated.
The bottom portion then mirrors the two algorithmic uses of the same idea: sample a∗a^*a∗ from pQ(⋅∣st)p_Q(\cdot\mid s_t)pQ​(⋅∣st​) for interaction, and sample a′a'a′ from pQ(⋅∣st+1)p_Q(\cdot\mid s_{t+1})pQ​(⋅∣st+1​) for the TD target. The small β\betaβ-to-large β\betaβ slider is the conceptual summary: Soft IBRL interpolates between exploratory candidate support and the original greedy hybrid policy.

11. Architectural Additions: Actor Dropout and a Shallow ViT Critic

After softening the hybrid selection rule, it is tempting to think the remaining gains in IBRL come from yet another change to the policy logic. They do not. The core idea is still the same: maintain a fixed imitation proposer μψ\mu_\psiμψ​, train an online RL actor πθ\pi_\thetaπθ​, and use the critic to decide how to combine their action proposals. The additions here are architectural rather than conceptual. They make the TD3-style online loop more stable, especially under sparse rewards and visual observations, while preserving the clean separation between imitation and reinforcement learning.
This distinction matters because robotic RL failures are often blamed on the high-level algorithm when the real bottleneck is more mundane: unstable value estimation, brittle actor updates, and visual encoders that overfit to a tiny stream of online experience. In IBRL, the demonstrations provide a good action prior, but the RL components still have to learn from sparse, delayed, and highly correlated data. If the actor or critic becomes unstable, the hybrid policy can select bad actions for the wrong reason—not because imitation was unhelpful, but because the learned QQQ-function is noisy.
The first addition is actor dropout inside the online actor πθ\pi_\thetaπθ​:
pdrop=0.5.p_{\mathrm{drop}} = 0.5 .pdrop​=0.5.
At first glance, dropout may seem odd in a deterministic actor-critic method such as TD3. TD3 typically learns a deterministic policy a=πθ(s)a=\pi_\theta(s)a=πθ​(s), while exploration is added externally through action noise. But in sparse-reward robotics, the actor is trained on a replay buffer that changes rapidly and may contain very few successful transitions early on. A fully deterministic actor can overfit to transient critic errors: if QϕQ_\phiQϕ​ briefly assigns high value to an accidental action pattern, the actor may chase it aggressively.
Dropout regularizes this process by forcing the actor to remain useful under many thinned internal subnetworks. Informally, instead of optimizing a single brittle mapping s↦as \mapsto as↦a, the actor update behaves more like optimizing an implicit ensemble of related policies. This can dampen the feedback loop between actor exploitation and critic overestimation. The benefit is especially relevant in IBRL because the actor is not solely responsible for competent behavior from the start—the imitation proposer can keep the system near reasonable actions while the online actor improves.
There are also limits to this trick. Dropout is not a substitute for a good critic, nor does it magically solve exploration. If applied too aggressively or inconsistently, it can make action outputs noisy in a way that harms precise manipulation. The point of the reported choice, pdrop=0.5p_{\mathrm{drop}}=0.5pdrop​=0.5, is not that this value is theoretically universal, but that a simple regularizer can noticeably improve online actor learning with negligible computational overhead. It is an engineering choice that supports the hybrid policy rather than redefining it.
The second addition concerns the critic for pixel observations. Instead of using the standard DrQ-style convolutional critic encoder, IBRL uses a shallow ViT-style critic encoder for QϕQ_\phiQϕ​. The pipeline is roughly:
image  →  overlapping patches  →  one transformer layer  →  fuse with action and proprioception  →  Qϕi(s,a).\text{image}
\;\to\;
\text{overlapping patches}
\;\to\;
\text{one transformer layer}
\;\to\;
\text{fuse with action and proprioception}
\;\to\;
Q_{\phi_i}(s,a).image→overlapping patches→one transformer layer→fuse with action and proprioception→Qϕi​​(s,a).
The motivation is subtle. A high-capacity visual model can be excellent for behavior cloning when trained offline on demonstrations, but online RL places different pressure on the representation. The critic must support bootstrapping, target networks, action-conditioned value prediction, and frequent updates on a nonstationary replay distribution. A shallow transformer-style encoder gives the critic enough spatial flexibility to process image patches while avoiding some of the instability and compute burden of a deeper visual backbone.
This is where IBRL’s modularity becomes important. The imitation policy μψ\mu_\psiμψ​ can use a strong behavior-cloning visual encoder trained on demonstration data. Meanwhile, the online RL actor πθ\pi_\thetaπθ​ and critic QϕQ_\phiQϕ​ can use architectures chosen for stable bootstrapped learning. There is no requirement that the IL proposer and RL components share weights or even use the same visual representation. In fact, forcing them to share a representation could entangle two very different learning problems:
Imitation learning benefits from supervised fitting to expert-like actions.
Reinforcement learning needs representations that support stable temporal-difference updates.
Hybrid IBRL benefits when these roles remain separable.
The ensemble choices also depend on the observation setting. For state-based inputs, the method can lean on a RED-Q style setup with an ensemble size such as
E=5,E = 5,E=5,
and a larger update-to-data ratio GGG. With pixel inputs, the system typically uses image augmentation and smaller ensembles, because visual critics are more expensive and more prone to representation drift. The design principle is not “make everything larger,” but rather allocate capacity where it improves stability without overwhelming online learning.
The visual summary below compactly organizes these additions around the unchanged IBRL loop. The actor-side change is isolated as dropout inside πθ\pi_\thetaπθ​, emphasizing that the hybrid action rule itself is not modified. The critic-side change is shown as a separate pixel pipeline: image patches pass through a shallow ViT encoder, are fused with action and proprioceptive information, and then feed multiple Qϕi(s,a)Q_{\phi_i}(s,a)Qϕi​​(s,a) heads.
The bottom modularity strip is the key architectural lesson. The fixed IL proposer μψ\mu_\psiμψ​ can remain a strong behavior-cloned visual policy, while the online RL components πθ\pi_\thetaπθ​ and QϕQ_\phiQϕ​ use shallower, more stable machinery. IBRL’s empirical strength comes partly from this restraint: it bootstraps from imitation without forcing imitation and RL to collapse into one shared architecture.

12. Experimental Design: Fair Baselines and Sparse Rewards

With the architecture now fixed—actor dropout, shallow ViT critic, shared normalization and augmentation choices—the next question is not whether IBRL has a clever implementation detail, but whether its use of demonstrations is experimentally separable from more familiar ways of mixing imitation and reinforcement learning. This is where the paper’s empirical design matters. In sparse-reward robotic RL, small differences in how demonstrations enter training can produce large differences in apparent performance, so the baselines need to be constructed carefully enough that the comparison is about the algorithmic idea, not an accidental advantage in data, architecture, or preprocessing.
The central difficulty is that all tasks use sparse binary rewards,
R:S×A→{0,1}.R:\mathcal{S}\times\mathcal{A}\to\{0,1\}.R:S×A→{0,1}.
That looks simple, but it is precisely what makes the setting hard. Most transitions provide no learning signal beyond “not yet.” In a manipulation task, the agent may need to reach, grasp, lift, align, and place before receiving a single positive reward. Off-policy algorithms such as TD3 can reuse data efficiently once rewarding trajectories exist in the replay buffer, but they are still vulnerable to the initial exploration bottleneck: if the policy almost never stumbles into success, bootstrapping has little useful value information to propagate.
Demonstrations are the obvious remedy, but the way they are used is subtle. A demonstration policy can help the agent reach meaningful parts of the state space, yet it can also constrain improvement if treated as something the RL policy must continue to imitate. This is the tension IBRL is designed around. It wants imitation to act as a proposal mechanism—a source of plausible actions—without turning the final RL policy into a permanently behavior-cloned controller.
That distinction motivates the baselines. A replay-buffer demonstration method, represented here by RLPD, injects demonstration transitions directly into off-policy learning: each training minibatch contains a fixed fraction of demonstration data, typically half of an NNN-sample batch. This gives the critic repeated access to expert-like transitions and can stabilize learning. But it does not directly ask, at action-selection time, “should I use the actor’s action or the demonstrator’s proposed action in this state?” Demonstrations are data, not live proposals.
A behavior-cloning-regularized fine-tuning method, represented here by RFT, makes a different tradeoff. It first initializes the policy with BC, then continues RL while adding a supervised imitation penalty,
αλLBC.\alpha\lambda L_{\mathrm{BC}}.αλLBC​.
This can prevent early collapse away from the demonstrator, which is useful when reward is sparse. But the same term can become a liability: if the demonstrator is imperfect, or if the optimal RL behavior differs from the demonstrations, the BC penalty may keep pulling the actor back toward suboptimal actions. In the hardest regimes, the question is not merely whether imitation helps exploration, but whether the agent can outgrow imitation.
IBRL occupies a more specific middle ground. A behavior cloning model μψ\mu_\psiμψ​ is trained from demonstrations and then held fixed. Meanwhile, TD3 trains the actor πθ\pi_\thetaπθ​ and critic ensemble QϕiQ_{\phi_i}Qϕi​​. At decision points and in TD targets, the algorithm compares candidate actions proposed by the learned actor and by μψ\mu_\psiμψ​, using the critic to choose greedily between them. In other words, the demonstrator is not a permanent regularizer on the actor’s parameters; it is an alternative action source that the value function may accept or reject.
That is why the experimental controls are important. If IBRL used a stronger critic, better augmentation, different normalization, or a more favorable visual encoder, then gains could be misattributed to the hybrid IL/RL mechanism. The comparison is therefore designed to keep the online RL backbone and major architectural choices aligned wherever possible. The intended empirical question is narrow:
Does a fixed BC policy help more when used as an action proposer?
Does that avoid the brittleness of forcing the RL actor to imitate?
Can this still work when the demonstrator is imperfect and rewards are sparse?
The task selection probes this directly. The benchmark suite includes four Meta-World tasks and Robomimic Can and Square, all under sparse rewards. Meta-World provides controlled simulation coverage and enables comparison with an additional baseline, MoDem, where appropriate. Robomimic Can and especially Square are more demanding visual manipulation settings, where demonstrations are useful but not necessarily sufficient. The hardest tests push the central claim even further: Robomimic Square and the real-world Hang task combine imperfect imitation, difficult exploration, and sparse success signals.
The visual summary below condenses this experimental design into a table: the left column names the component or baseline, while the right column states exactly how it enters the comparison. The IBRL row is the key reference point—it separates BC as a fixed proposer from TD3 as the improving RL learner—while the RLPD and RFT rows clarify the two major alternative demonstration-integration strategies.
The amber callout at the bottom emphasizes the most diagnostic cases. If IBRL only helped when μψ\mu_\psiμψ​ was already nearly perfect, the result would be much less interesting. The difficult Square and real-world Hang settings ask a stronger question: can imitation provide enough structure to bootstrap sparse-reward RL without becoming a cage that prevents the policy from improving beyond the demonstrations?

13. Simulation Results and Ablations: Where the Gains Come From

Having fixed the sparse-reward evaluation protocol, we can now ask the more interesting question: are the gains really coming from IBRL’s algorithmic choices, or are they an artifact of benchmark setup and baseline tuning? This is the point of the simulation study. Because all methods are evaluated under the same sparse success signal, with demonstrations available but no shaped reward crutches, differences in learning curves are much more informative. They tell us whether a method can turn a small amount of imitation knowledge into online improvement rather than merely replaying demonstrations or regularizing toward them.
The core empirical pattern is that IBRL improves sample efficiency not through a single trick, but through a consistent hybridization of imitation and reinforcement learning. The actor is not asked to discover successful behavior from scratch; it is allowed to propose actions using both the learned RL policy and the imitation policy. But the critic is also trained under a matching assumption: its temporal-difference targets bootstrap through the same kind of hybrid controller that will be used in the future. This alignment matters. If exploration uses a hybrid policy but Bellman targets assume only the current RL actor, the critic may undervalue states from which the imitation branch would have recovered. Conversely, if the critic is optimistic about hybrid continuation but the behavior policy never actually uses it, the learned values become operationally misleading.
This is why the ablations are especially informative. Removing the actor proposal mainly damages early exploration. Sparse-reward robotics has a brutal cold-start problem: before the agent has seen enough successful transitions, gradients from the environment are rare and noisy. Demonstrations help populate the replay buffer, but they do not automatically make the online policy visit the same high-value regions. The actor proposal gives the learner a practical way to remain near the demonstrator’s support while still allowing RL actions to be tested and improved. Without it, the system more often drifts into unrewarding parts of the state-action space before the critic has learned enough to guide it back.
Removing the bootstrap proposal causes an even larger degradation, which is a useful clue about where the main benefit comes from. The issue is not only “which actions do we try now?” but also “what future behavior does the critic believe will be available?” In sparse tasks, a state can be valuable even if the current RL actor is not yet capable of completing the task from that state, provided the hybrid controller can complete it by leaning on imitation. The bootstrap proposal lets the target value reflect that future hybrid competence. Without it, TD learning becomes more conservative and less aligned with the actual policy improvement process: the critic evaluates future rollouts as if the imitation fallback were absent, even though the full IBRL agent is designed around precisely that fallback.
The third recurring ingredient is architectural stabilization, especially actor dropout. In demonstration-augmented RL, a subtle failure mode is over-specialization: the actor can latch onto narrow correlations from the demonstration distribution or become brittle as the critic changes. Dropout makes the policy update less deterministic in representation space, which tends to improve robustness and smooth the transition from imitation-supported behavior to autonomous RL improvement. The state-based Square result is a particularly sharp example: baselines that otherwise seem competitive fail without actor dropout, while IBRL continues to learn effectively. That suggests the method’s advantage is not merely visual representation quality; it is also about stabilizing the actor-critic dynamics.
The comparisons against strong baselines sharpen this story. RLPD benefits from demonstrations in replay, but its policy improvement is still largely governed by the learned critic and the actor it trains. It does not explicitly ask the critic to value the future hybrid controller. RFT, by contrast, can perform very well when the behavioral cloning regularizer is tuned appropriately, but that tuning is task-sensitive: too much regularization prevents improvement beyond the demonstrator, while too little loses the stabilizing imitation prior. IBRL’s result is important because it matches or exceeds these approaches without relying on a carefully tuned BC penalty such as αλLBC\alpha\lambda L_{\mathrm{BC}}αλLBC​. The imitation component is not simply a loss term pulling the actor backward; it is embedded into action selection and value bootstrapping.
Across the simulated domains, the same hierarchy appears. On Meta-World, IBRL solves all tasks within roughly 40K40\text{K}40K interaction steps, outperforming RLPD and MoDem while remaining competitive with tuned RFT. On Robomimic Can and Square, IBRL again beats RLPD and RFT, with the separation especially large on Square. That larger gap is meaningful because Square is the kind of long-horizon, contact-rich manipulation task where sparse rewards make naive exploration particularly inefficient and where small compounding errors can destroy success. If a method’s benefit were only “better use of demonstrations in replay,” we would expect the advantage to be smaller on precisely these difficult settings. Instead, the advantage grows where hybrid exploration and hybrid bootstrapping should matter most.
There is also an architectural lesson in the critic encoder ablation. The shallow ViT critic works better for online RL than the commonly used DrQ-style critic, even though a stronger BC encoder may still be better for pure behavioral cloning. This is not contradictory. Pure BC rewards representation fidelity to expert actions under a fixed offline distribution. Online RL rewards representations that produce stable value estimates under shifting data, bootstrapped targets, and policy-induced distribution drift. A representation that is excellent for supervised imitation can still be poorly matched to the instability of actor-critic learning.
The visual below compactly organizes these findings as the paper wants us to interpret them: not as one headline curve, but as a set of mutually reinforcing evidence. The main performance plots establish that IBRL is stronger under the fair sparse-reward setup; the ablation plots then identify which mechanisms are responsible. The strongest drop comes from removing the bootstrap proposal, supporting the claim that TD target quality under the future hybrid controller is central. The actor proposal contributes earlier exploration, and actor dropout contributes stability.
Taken together, the simulation evidence argues that IBRL’s advantage is not just “more imitation” or “better RL.” It is the careful coupling of the two: imitation helps the agent reach meaningful parts of the task, and the critic is trained as if that hybrid competence will remain available in the future. That coupling is what makes the sparse reward informative early enough for online RL to become sample-efficient.

14. Real-World Results: Lift, Drawer, and Deformable Cloth Hang

The simulation ablations gave us a controlled view of why IBRL helps: the imitation policy is useful but imperfect, and the learned critic can increasingly decide when to trust it versus when to use an RL-improved action. The natural next question is whether that mechanism survives contact with the messier regime that motivated the method in the first place: physical robots, sparse success detectors, and distribution shift between demonstrations and online execution.
In the real-world experiments, the reward remains brutally sparse:
R:S×A→{0,1}.R:\mathcal{S}\times\mathcal{A}\to\{0,1\}.R:S×A→{0,1}.
That is, the robot typically receives no graded signal for “almost lifting,” “partially opening,” or “nearly hanging the cloth correctly.” It only receives success or failure. This matters because sparse rewards make online RL sample-inefficient even when the underlying behavior is not conceptually complicated. A robot can spend many trials visiting states that are locally informative to a human observer but indistinguishable to the reward function. Off-policy learning helps reuse data, but it does not magically create reward information where none was observed.
This is exactly where demonstrations are tempting. A small number of successful trajectories can place the agent near rewarding regions of the state-action space. In these experiments, the demonstration counts are deliberately modest: Lift uses n=10n=10n=10, while Drawer and Hang each use n=30n=30n=30. The important point is that the demonstrations are not treated as a complete solution. They are a prior over competent behavior, and IBRL’s bet is that this prior should remain available during online learning rather than being quickly overwritten or only passively stored in a replay buffer.
That distinction becomes sharper on real robots because failures are not merely random exploration errors. They often come from distribution shift: the camera view changes slightly, a drawer sticks differently, the cube starts at a different pose, or cloth deforms into a configuration not well represented in the demonstrations. A behavior cloning policy μψ\mu_\psiμψ​ can be very strong on familiar states but brittle off the demonstration manifold. Conversely, a pure or mostly RL policy can in principle adapt, but it may waste many physical trials rediscovering the basic manipulation strategy.
IBRL’s hybrid policy is designed for this middle ground. The fixed imitation policy provides a reliable fallback proposal, while the actor learned by RL proposes alternatives. The critic then mediates between them through value estimates. Intuitively, the agent asks: does the learned actor appear better here, or should I use the demonstrator-like action? Because the critic is trained with TD bootstrapping on the growing online buffer B\mathcal{B}B, it can gradually learn about precisely those states where the demonstrations are insufficient.
This is different from two common demonstration-based baselines:
Replay-buffer demonstration methods can keep expert data available, but the policy may still drift toward poor actions if the critic does not consistently prefer the demonstration-like behavior in relevant states.
BC-regularized fine-tuning can prevent catastrophic drift, but the regularizer may also hold the policy too close to the demonstrator when adaptation is necessary.
IBRL is trying to avoid both extremes. It does not assume the demonstrator is always right, but it also does not force online RL to abandon the demonstrator before it has earned that freedom.
The real-world results are therefore most interesting not just as success percentages, but as evidence that this selection-and-bootstrapping mechanism remains useful under physical variability. IBRL reaches very high final evaluation rates on the rigid-object tasks: 100% on Lift, 95% on a harder Lift evaluation, 95% on Drawer, and 100% on the Drawer early-stop setting. These tasks already suggest that the method is not merely exploiting simulator regularities; it can use a small set of demonstrations to stabilize real online learning.
The deformable cloth Hang task is the more revealing stress test. Cloth manipulation is difficult because the state is high-dimensional, partially observed, and history-dependent: the same gripper motion can produce different outcomes depending on folds, tension, and contact geometry. Here the gap between methods is large: BC reaches 65%, RLPD 15%, RFT 35%, and IBRL 85%. Relative to RFT, the improvement is roughly
85%35%≈2.4.\frac{85\%}{35\%}\approx 2.4.35%85%​≈2.4.
That ratio is not just a headline number. It reflects the practical advantage of keeping imitation as an actionable option during learning, rather than treating demonstrations only as initialization or static replay data.
The visual below condenses these real-world findings into three pieces: the physical tasks and demonstration counts, the final IBRL evaluation rates across Lift, Drawer, and Hang, and the Hang-only comparison against the major baselines. Reading it from left to right, the key story is that the same sparse-reward setting is used across tasks, but the difficulty increases substantially as we move toward deformable cloth.
The bottom comparison is the most diagnostic part: Hang exposes the failure modes that are easier to miss on simpler tasks. BC can imitate when the world looks familiar, but cannot repair its own mistakes; RFT can adapt, but struggles to get enough successful sparse-reward signal; replay-based RL does not automatically solve the exploration problem. IBRL’s advantage is that it can exploit μψ\mu_\psiμψ​ when the imitation action is still good, while using online TD learning over B\mathcal{B}B to improve in the states where the demonstrations fall short.

15. Unifying Summary: Ways to Use Demonstrations in Online RL

After seeing the real-world lift, drawer, and cloth-hanging results, it is useful to step back from the benchmark details and ask a more structural question: what role are demonstrations actually playing in online reinforcement learning? Many methods “use demos,” but they use them in very different places in the learning loop. That distinction matters because sparse-reward robotic RL is not only hard because the agent lacks good initial data; it is hard because every subsequent choice—exploration, critic bootstrapping, actor updates, and representation learning—can drift away from the narrow set of behaviors that make reward discovery possible.
The simplest use of demonstrations is pure behavior cloning. We train a policy μψ(a∣s)\mu_\psi(a \mid s)μψ​(a∣s) on a demonstration dataset D\mathcal{D}D, then execute that policy directly. This is often a strong baseline in robotics because demonstrations encode contact-rich timing, object-relative geometry, and recovery behaviors that are difficult to rediscover from sparse rewards. But BC has a familiar weakness: it learns the expert’s conditional action distribution only on the demonstrated state manifold. Once small errors compound, the robot visits states where the supervised target is poorly defined. In other words, BC solves imitation under the demonstration distribution, not control under the induced closed-loop distribution.
Replay-buffer demonstration methods, such as RLPD-style approaches, take a different view. Demonstrations are treated as valuable off-policy data: initialize the replay buffer with D\mathcal{D}D, oversample demonstrations, and train an actor-critic algorithm using a mixture of demo and online transitions. This is appealing because it fits naturally into modern off-policy RL. The critic sees successful trajectories early, and the actor can in principle improve beyond the demonstrator. But the demonstrations remain static data. They help define value estimates, yet the online actor πθ\pi_\thetaπθ​ is still responsible for choosing actions during exploration:
at∼πθ(st).a_t \sim \pi_\theta(s_t).at​∼πθ​(st​).
If πθ\pi_\thetaπθ​ is initially poor, the robot may still spend most of its online interaction in low-value regions, especially under sparse rewards. The demonstrations are in the buffer, but they are not necessarily present in the action selection rule that determines what data is collected next.
BC-regularized fine-tuning makes the imitation signal more active by adding a supervised penalty during RL. The actor is pretrained from demonstrations and then updated with a mixture of RL and BC objectives, often schematically resembling
Lactor=LRL+αλLBC.L_{\mathrm{actor}} = L_{\mathrm{RL}} + \alpha \lambda L_{\mathrm{BC}}.Lactor​=LRL​+αλLBC​.
This can stabilize early learning and prevent the actor from immediately forgetting the demonstrator. The subtle failure mode is that the BC term can become either too weak or too strong. If too weak, the actor drifts into unproductive exploration. If too strong, the actor is over-constrained and cannot exploit reward feedback to improve beyond the demonstrations, adapt to environment variation, or choose slightly non-demonstrated actions that the critic has learned are better. The method introduces extra knobs—regularization weights, decay schedules, demo sampling ratios—that often matter as much as the algorithmic idea.
IBRL reframes the role of imitation more sharply. Instead of treating the imitation policy only as data or as a penalty, it treats it as a fixed action proposer. At each state, the agent considers a small candidate set Cθ,ψ(s)C_{\theta,\psi}(s)Cθ,ψ​(s), typically containing actions proposed by the learned RL actor πθ\pi_\thetaπθ​ and the imitation policy μψ\mu_\psiμψ​. The critic then chooses among those proposals. In the greedy version, the online action is
at=argmax⁡a∈Cθ,ψ(st)Qϕ′(st,a).a_t
=
\operatorname*{argmax}_{a\in C_{\theta,\psi}(s_t)}
Q_{\phi'}(s_t,a).at​=a∈Cθ,ψ​(st​)argmax​Qϕ′​(st​,a).
This is the central conceptual move. The imitation policy does not have to be perfect, and the RL actor does not have to be trusted early. Instead, the critic arbitrates between them. Demonstrations become a persistent source of plausible actions throughout online learning, while RL remains free to discover improvements.
The second part of the idea is equally important: IBRL uses the same hybrid proposal mechanism in the TD target. Rather than bootstrapping only from the target actor πθ′\pi_{\theta'}πθ′​, the target considers the candidate actions proposed by both the target actor and the fixed imitation policy:
y=rt+γmax⁡a′∈Cθ′,ψ(st+1)Qϕ′(st+1,a′).y
=
r_t
+
\gamma
\max_{a'\in C_{\theta',\psi}(s_{t+1})}
Q_{\phi'}(s_{t+1},a').y=rt​+γa′∈Cθ′,ψ​(st+1​)max​Qϕ′​(st+1​,a′).
This aligns exploration and bootstrapping. The policy used to collect data and the policy implicit in the Bellman target are no longer mismatched in the usual way. If the demonstrator proposes a good recovery action at the next state, the critic can bootstrap through it; if the RL actor proposes a better action, the critic can prefer that instead. This is why IBRL is more than “BC plus RL.” It changes the effective policy class used by the critic: the agent behaves like a greedy hybrid over imitation and reinforcement-learning proposals.
There are still assumptions and limitations. The critic must be reliable enough to rank candidate actions, so Q-ensembles, target networks, conservative update choices, and architectural choices matter. A greedy max over a small candidate set can also behave like action masking: if an important action is not proposed by either μψ\mu_\psiμψ​ or πθ\pi_\thetaπθ​, it cannot be selected. This motivates softer variants of IBRL, where actions are sampled from a critic-weighted proposal distribution pQ(a∣s)p_Q(a \mid s)pQ​(a∣s) rather than chosen by a hard argmax. Soft selection introduces its own knob, usually a temperature β\betaβ, but it can reduce brittleness when Q-values are uncertain or when exploration requires occasionally trying lower-ranked proposals.
A useful way to summarize the landscape is to ask four questions for each method:
Where do demonstrations enter: data, regularization, or proposal generation?
Who chooses online actions: the BC policy, the RL actor, or the critic over a hybrid candidate set?
What policy is used inside the TD target?
What is the main new failure mode introduced by that choice?
From this perspective, IBRL’s distinctive contribution is compact: use imitation as an action proposer, and let the critic decide both what to execute and what to bootstrap from. This keeps the imitation and RL components modular. The BC policy can use architectures well suited to supervised imitation; the actor and critic can use architectures better suited to online RL; and in the real-world systems, additional design choices such as dropout, visual encoders, and critic ensembles can be layered on top without changing the core principle.
The visual below consolidates these distinctions into a single comparison. The rows are not merely a list of algorithms; they separate different philosophies for using D\mathcal{D}D. BC uses demonstrations as the entire policy. RLPD uses them as replay data. RFT uses them as a constraint on the actor. IBRL uses them as a standing proposal mechanism, and crucially applies that same mechanism to both action selection and temporal-difference bootstrapping.
The highlighted IBRL rows should be read as the punchline of the lecture. Greedy IBRL gives the cleanest mathematical form through the argmax action rule and hybrid Bellman target, while soft IBRL relaxes the hard decision into critic-weighted sampling. The final architecture-enhanced version reminds us that the real robotic performance also depends on implementation choices—modular networks, vision backbones, ensembles, and resets—but those are supporting mechanisms around the central idea: demonstrations are most powerful when they remain available as actions the critic can choose, not just examples the learner once saw.