Imitation Bootstrapped Reinforcement Learning (IBRL): Using Demonstrations in Exploration and Bootstrapping - FeynmanWiki

CONTENTS

Bookmark this paper

Save for later reading

MACHINE LEARNING, REINFORCEMENT LEARNING - 45 MIN READ

Imitation Bootstrapped Reinforcement Learning (IBRL): Using Demonstrations in Exploration and Bootstrapping

1. Why sparse-reward robotics is hard

A useful way to think about sparse-reward robotics is that the task is conceptually simple but statistically brutal. In the idealized MDP view, we have a state space S\mathcal{S}S, a continuous action space A=[−1,1]d\mathcal{A} = [-1,1]^dA=[−1,1]d, and dynamics T:S×A→S\mathcal{T}: \mathcal{S} \times \mathcal{A} \to \mathcal{S}T:S×A→S. The challenge is not that the formalism is complicated; the challenge is that the reward signal is almost absent. When R(s,a)∈{0,1}R(s,a) \in \{0,1\}R(s,a)∈{0,1} and the 111 appears only at true success, most trajectories are indistinguishable from one another for long stretches of time.
That has a very specific consequence for reinforcement learning. If the policy starts out randomly, then the agent explores a huge continuous action space with no meaningful guidance. In discrete domains, random exploration can sometimes stumble into informative events often enough to get learning started. In robotics, however, the geometry is much harsher: each action is a vector in [−1,1]d[-1,1]^d[−1,1]d, and the “correct” combination of forces, torques, or end-effector motions may occupy an extremely small region of that space. The probability of accidentally producing a successful trajectory can be tiny.
When success is that rare, the critic has almost nothing to learn from. Temporal-difference methods need nontrivial targets, but if nearly every rollout receives zero reward, then the estimated returns collapse toward zero as well. Formally, the bootstrapping signal becomes uninformative: the agent cannot reliably distinguish a slightly better trajectory from a slightly worse one because both look like failure. This is one of the most frustrating failure modes in sparse-reward RL: the algorithm is working as designed, but the data distribution contains almost no evidence of progress.
The problem is amplified by the fact that robotics is a sample-efficiency regime. We usually cannot depend on millions of parallel simulated episodes to brute-force our way into reward. Real robots are expensive, slow, and brittle; every rollout consumes time and may risk hardware wear. So even if random exploration is theoretically sufficient in the limit, it is often practically useless. The question is not whether learning could happen eventually, but whether it can happen before the robot runs out of data or patience.
There is also a subtler point hiding underneath the sparse-reward story: the transition dynamics may be deterministic, but determinism alone does not make exploration easy. A deterministic T\mathcal{T}T means the next state is fully determined by the current state and action, yet that does not tell us which actions matter. Without a reward gradient, the policy has no direct signal pointing toward the narrow subset of state-action pairs that eventually lead to success. In other words, the environment is predictable, but not legible.
This is exactly the regime where demonstrations become useful. A demonstration can inject structure into exploration by revealing that successful trajectories exist at all, and by placing the agent in states that random rollouts almost never reach. But the important point is that demonstrations are not a replacement for RL here; they are a way to start RL when reward alone is too sparse to do the job. The later IBRL machinery builds on this idea, using demonstrations to guide both exploration and bootstrapping rather than treating them as a separate imitation-only solution.
The visual below condenses this failure mode into a compact picture. The left side captures the robotics intuition: many random trajectories wander around the workspace and die out with reward=0reward = 0reward=0, while only a rare successful path reaches the goal and earns reward=1reward = 1reward=1. The right side packages the formal setup—S\mathcal{S}S, A=[−1,1]d\mathcal{A}=[-1,1]^dA=[−1,1]d, T\mathcal{T}T, and R(s,a)∈{0,1}R(s,a)\in\{0,1\}R(s,a)∈{0,1}—so you can see why the learning problem is so brittle even before we introduce demonstrations.
Taken together, the diagram is meant to make one point feel inevitable: in sparse-reward robotics, the issue is not model capacity or optimization tricks, but data availability. If success is rare enough that random rollouts almost never touch it, then the first challenge is not improving the policy—it is simply producing any informative experience at all.

CONTENTS

Bookmark this paper

Save for later reading

MACHINE LEARNING, REINFORCEMENT LEARNING - 45 MIN READ

Imitation Bootstrapped Reinforcement Learning (IBRL): Using Demonstrations in Exploration and Bootstrapping

1. Why sparse-reward robotics is hard

A useful way to think about sparse-reward robotics is that the task is conceptually simple but statistically brutal. In the idealized MDP view, we have a state space S\mathcal{S}S, a continuous action space A=[−1,1]d\mathcal{A} = [-1,1]^dA=[−1,1]d, and dynamics T:S×A→S\mathcal{T}: \mathcal{S} \times \mathcal{A} \to \mathcal{S}T:S×A→S. The challenge is not that the formalism is complicated; the challenge is that the reward signal is almost absent. When R(s,a)∈{0,1}R(s,a) \in \{0,1\}R(s,a)∈{0,1} and the 111 appears only at true success, most trajectories are indistinguishable from one another for long stretches of time.
That has a very specific consequence for reinforcement learning. If the policy starts out randomly, then the agent explores a huge continuous action space with no meaningful guidance. In discrete domains, random exploration can sometimes stumble into informative events often enough to get learning started. In robotics, however, the geometry is much harsher: each action is a vector in [−1,1]d[-1,1]^d[−1,1]d, and the “correct” combination of forces, torques, or end-effector motions may occupy an extremely small region of that space. The probability of accidentally producing a successful trajectory can be tiny.
When success is that rare, the critic has almost nothing to learn from. Temporal-difference methods need nontrivial targets, but if nearly every rollout receives zero reward, then the estimated returns collapse toward zero as well. Formally, the bootstrapping signal becomes uninformative: the agent cannot reliably distinguish a slightly better trajectory from a slightly worse one because both look like failure. This is one of the most frustrating failure modes in sparse-reward RL: the algorithm is working as designed, but the data distribution contains almost no evidence of progress.
The problem is amplified by the fact that robotics is a sample-efficiency regime. We usually cannot depend on millions of parallel simulated episodes to brute-force our way into reward. Real robots are expensive, slow, and brittle; every rollout consumes time and may risk hardware wear. So even if random exploration is theoretically sufficient in the limit, it is often practically useless. The question is not whether learning could happen eventually, but whether it can happen before the robot runs out of data or patience.
There is also a subtler point hiding underneath the sparse-reward story: the transition dynamics may be deterministic, but determinism alone does not make exploration easy. A deterministic T\mathcal{T}T means the next state is fully determined by the current state and action, yet that does not tell us which actions matter. Without a reward gradient, the policy has no direct signal pointing toward the narrow subset of state-action pairs that eventually lead to success. In other words, the environment is predictable, but not legible.
This is exactly the regime where demonstrations become useful. A demonstration can inject structure into exploration by revealing that successful trajectories exist at all, and by placing the agent in states that random rollouts almost never reach. But the important point is that demonstrations are not a replacement for RL here; they are a way to start RL when reward alone is too sparse to do the job. The later IBRL machinery builds on this idea, using demonstrations to guide both exploration and bootstrapping rather than treating them as a separate imitation-only solution.
The visual below condenses this failure mode into a compact picture. The left side captures the robotics intuition: many random trajectories wander around the workspace and die out with reward=0reward = 0reward=0, while only a rare successful path reaches the goal and earns reward=1reward = 1reward=1. The right side packages the formal setup—S\mathcal{S}S, A=[−1,1]d\mathcal{A}=[-1,1]^dA=[−1,1]d, T\mathcal{T}T, and R(s,a)∈{0,1}R(s,a)\in\{0,1\}R(s,a)∈{0,1}—so you can see why the learning problem is so brittle even before we introduce demonstrations.
Taken together, the diagram is meant to make one point feel inevitable: in sparse-reward robotics, the issue is not model capacity or optimization tricks, but data availability. If success is rare enough that random rollouts almost never touch it, then the first challenge is not improving the policy—it is simply producing any informative experience at all.

2. Why demonstrations help, and why they are still not enough

Once we admit that sparse-reward robotics is essentially a needle-in-a-haystack problem, demonstrations become the most obvious source of structure. A small set of expert trajectories D\mathcal{D}D gives us state-action pairs ξ={(s0,a0),…,(sT,aT)}\xi = \{(s_0,a_0),\dots,(s_T,a_T)\}ξ={(s0​,a0​),…,(sT​,aT​)} that reveal which parts of the state space matter and which actions are at least plausible. In that sense, imitation learning is not merely a warm start; it is a way to inject search bias into a problem whose native reward signal is otherwise too weak to shape behavior.
The standard first step is behavior cloning. We fit an imitation policy μψ(a∣s)\mu_\psi(a\mid s)μψ​(a∣s) by maximizing the likelihood of the demonstrated actions, equivalently minimizing
L(ψ)=−E(s,a)∼D[log⁡μψ(a∣s)].\mathcal{L}(\psi) = -\mathbb{E}_{(s,a)\sim\mathcal{D}}\big[\log \mu_\psi(a\mid s)\big].L(ψ)=−E(s,a)∼D​[logμψ​(a∣s)].
If μψ\mu_\psiμψ​ is a Gaussian policy with fixed isotropic covariance, this reduces to a familiar regression objective: the deterministic mean output μψ(s)\mu_\psi(s)μψ​(s) is pushed toward the expert action. That observation is important because it clarifies what BC does and does not provide. It learns a local approximation of the expert’s action manifold, but it does not itself solve exploration, value estimation, or compounding error.
A naive way to use these demonstrations in RL is simply to oversample them in the replay buffer B\mathcal{B}B. This helps because the critic and actor see more informative transitions, which can stabilize optimization when the environment reward is sparse or delayed. But the benefit is limited: replay is fundamentally retrospective. It reuses what the expert already did, rather than leveraging the imitation policy as a mechanism for generating better candidate actions in unfamiliar states. Put differently, oversampling changes the training distribution, not the interaction policy.
That distinction is crucial in robotics. If the policy is stuck in regions of the state space it already knows, then more replayed demonstrations can improve fit without improving coverage. The robot still has to discover how to deviate from the expert just enough to adapt to its own dynamics, stochasticity, and task-specific constraints. In the most optimistic case, demo replay gives the learner a safer gradient signal; in the pessimistic case, it merely makes the network better at memorizing expert behavior without meaningfully improving exploration.
A second common baseline is to keep training with reinforcement learning while adding a behavior cloning penalty, typically of the form αλLBC\alpha\lambda\mathcal{L}_{\mathrm{BC}}αλLBC​. Here the policy is pulled toward the demonstrations while still being optimized for return. This is appealing because it seems to combine imitation and RL in a single objective. However, it introduces a delicate tradeoff: if the BC term is too weak, it does little to prevent drift and instability; if it is too strong, the policy becomes overly conservative and may never surpass the demonstrator. The optimization landscape becomes sensitive to α\alphaα and λ\lambdaλ, and that sensitivity can dominate the practical outcome.
There is also a deeper architectural issue. In many such setups, the imitation policy μψ\mu_\psiμψ​ and the RL policy πθ\pi_\thetaπθ​ are tied too tightly, either by sharing most of their parameters or by forcing the same network to serve both roles. That coupling can be convenient, but it blurs two distinct functions: proposal and optimization. The expert model is best thought of as a source of candidate actions, while the RL policy should be free to reshape behavior according to return. If these roles are entangled, the learner may inherit the demonstrator’s bias without gaining a separate channel for exploration.
So the right lesson from demonstrations is subtle. They are undeniably helpful because they stabilize learning and reduce the effective search space, but standard replay or BC-regularized fine-tuning still treats them as static supervision. What is missing is a way to ask the imitation policy to actively propose actions during RL, instead of only serving as a dataset or a penalty. That gap is exactly what the visual below is meant to compactly summarize: the left side collects the two conventional ways demonstrations are injected, and the right side makes their limitation explicit. The contrast is less about whether demos help—they do—and more about how much of their structure survives once we move from imitation to exploration.
Seen in that light, the comparison table is not just a bookkeeping device. It distills the paper’s setup into a simple dichotomy: demonstrations can be reused as data, or they can regularize learning, but neither baseline fully turns them into an exploration tool. That is the conceptual stepping stone to IBRL’s core idea, where the imitation policy is elevated from a passive teacher into an active source of bootstrapped proposals.

3. Prior uses of reference policies in RL

Building on the demonstration policy from the previous section, it is useful to separate where a reference policy can help in an RL pipeline from what it cannot fix. In sparse-reward robotics, these distinctions matter because the failure mode is usually not a single bad update; it is a compounding absence of informative experience. If the agent never reaches a rewarding state, then the critic never sees a useful learning signal, the actor never gets a meaningful gradient, and training collapses into drifting around the initial policy class.
Historically, prior methods have used demonstrations or a reference policy in one of three partially overlapping ways. The first is exploration-only: use the demo policy to propose better actions during data collection, but once trajectories are in the buffer, the RL learner updates as usual from its own policy and its own targets. This can improve early visitation of useful states, but it leaves a mismatch: the policy that generated good behavior is not necessarily the policy that the critic is bootstrapping from later.
The second is regularization-only. Here the reference policy influences learning indirectly, typically through a behavior cloning or policy proximity penalty, so that the actor stays near demonstrated behavior:
Lactor=LRL+λ LBC.\mathcal{L}_{\text{actor}} = \mathcal{L}_{\text{RL}} + \lambda \,\mathcal{L}_{\text{BC}}.Lactor​=LRL​+λLBC​.
This can stabilize training, especially when rewards are noisy or sparse, but it does not fundamentally change the data distribution the critic sees. If exploration is still poor, the algorithm can remain trapped in a narrow region of state space, merely with a better-behaved policy inside that region.
A third family is offline reuse or oversampling. Demonstrations are inserted into replay more frequently than ordinary experience, which helps keep successful trajectories visible to the learner. Yet this is still passive: the reference policy is not actively generating new actions, and the algorithm is not using the demo as a live decision rule. In robotics, that distinction is important because a static dataset cannot adapt to the exact state distribution induced by the current controller, contact dynamics, or initialization noise.
The key idea behind IBRL is to stop treating the reference policy as a one-role tool. Instead, the imitation policy μψ\mu_\psiμψ​ is used both to propose actions during interaction and to participate in bootstrapping. That dual use matters because sparse-reward RL needs help in two places at once:
Better exploration: get the agent into states where reward is possible.
Better target construction: make the critic’s backups less dependent on fragile self-generated action choices.
This is why the paper frames the method as a comparison between two candidates at each state:
aIL=μψ(st),aRL=πθ(st)+ϵ,a_{\mathrm{IL}} = \mu_\psi(s_t), \qquad
a_{\mathrm{RL}} = \pi_\theta(s_t) + \epsilon,aIL​=μψ​(st​),aRL​=πθ​(st​)+ϵ,
and then selects the better one according to the critic:
a∗=argmax⁡a∈{aIL,aRL}Qϕ(st,a).a^* = \operatorname*{argmax}_{a \in \{a_{\mathrm{IL}}, a_{\mathrm{RL}}\}} Q_\phi(s_t,a).a∗=a∈{aIL​,aRL​}argmax​Qϕ​(st​,a).
Conceptually, this is a very small action set, but it changes the learning dynamics substantially. The critic is no longer asked to search over the entire action space blindly; it only has to decide whether the imitation proposal or the current actor’s proposal looks more valuable. That restriction is useful precisely because the action space in robotics is continuous and high-dimensional, where exhaustive exploration is impossible.
There is also an important subtlety here: bootstrapping is only as good as the target action. If the backup action is drawn from a weak or poorly explored policy, the critic’s estimate can become overly pessimistic or overly optimistic depending on function approximation error and reward sparsity. IBRL reduces this risk by letting the reference policy influence the backup itself. In effect, demonstrations are not just a source of labels for the actor; they shape the Bellman targets that drive value learning.
The practical consequence is that reference policies are most effective when they are used in a closed loop:
they generate better trajectories,
those trajectories improve the critic’s estimates,
the improved critic helps choose better actions,
which in turn improves later exploration.
That loop is exactly what many earlier methods only partially realize. Some methods help with step 1 but not 2; others help with 2 but not 1. IBRL argues that the best sample efficiency comes from doing both.
A useful way to remember the comparison is that prior methods usually treat the demonstration as either data or regularization, while IBRL treats it as an active decision component. That is the conceptual bridge to the visual summary below: the left side collects the older one-sided uses of reference policies, while the right side shows IBRL’s two-stage role for μψ\mu_\psiμψ​—first as a proposal during interaction, then as a candidate inside bootstrapped targets. The compact equations at the bottom reinforce the central mechanism: the critic does not blindly trust either source, but explicitly compares aILa_{\mathrm{IL}}aIL​ and aRLa_{\mathrm{RL}}aRL​ and keeps whichever looks better under QϕQ_\phiQϕ​.
Seen this way, the diagram is not just a taxonomy of prior work. It is the argument in miniature: reference policies help most when they influence both exploration and bootstrapping, because sparse-reward robotics needs help generating experience and help learning from that experience.

4. RL and IL notation for the rest of the lecture

Building on the earlier discussion of reference policies, we now freeze the notation that will carry the rest of IBRL. This may look like bookkeeping, but it is actually an important modeling choice: once the symbols are fixed, we can cleanly separate what comes from reinforcement learning and what comes from imitation learning, which is exactly what the method later exploits.
The RL side is a standard deterministic actor-critic setup in the style of TD3 / DDPG. The actor πθ\pi_\thetaπθ​ maps a state to an action, and in deterministic form we write the output simply as πθ(s)\pi_\theta(s)πθ​(s). To reduce estimation bias and stabilize targets, the algorithm also maintains target networks πθ′\pi_{\theta'}πθ′​ and Qϕ′Q_{\phi'}Qϕ′​, which are slow-moving copies of the online actor and critic. The replay buffer B\mathcal{B}B stores transitions (st,at,rt,st+1)(s_t,a_t,r_t,s_{t+1})(st​,at​,rt​,st+1​), letting updates reuse past experience rather than relying only on the most recent trajectory.
From those objects, the one-step bootstrapped target is
y=rt+γQϕ′(st+1,πθ′(st+1)).y = r_t + \gamma Q_{\phi'}\bigl(s_{t+1},\pi_{\theta'}(s_{t+1})\bigr).y=rt​+γQϕ′​(st+1​,πθ′​(st+1​)).
The intuition is classical: the observed reward rtr_trt​ contributes the immediate signal, and the second term asks the target critic to estimate the value of the next state under the target actor. The discount γ\gammaγ determines how much the learner trusts that future value. The crucial assumption here is that the target networks are stable enough that this target is not wildly moving from one gradient step to the next; without that slow target mechanism, bootstrapping can become numerically brittle.
The critic then fits itself to this target by minimizing a squared temporal-difference error:
L(ϕ)=[y−Qϕ(st,at)]2.\mathcal{L}(\phi) = \bigl[y - Q_\phi(s_t,a_t)\bigr]^2.L(ϕ)=[y−Qϕ​(st​,at​)]2.
This is not merely regression for its own sake. The critic is trying to approximate the action-value function induced by the current policy and data distribution, and that estimate is the backbone of policy improvement. If the critic is biased or overestimates values, the actor will be pulled toward actions that only look good under a flawed value model. In sparse-reward robot learning, that failure mode is especially damaging because there are few positive signals to correct the critic’s mistakes.
The actor update is correspondingly simple:
L(θ)=−Qϕ(s,πθ(s)).\mathcal{L}(\theta) = -Q_\phi\bigl(s,\pi_\theta(s)\bigr).L(θ)=−Qϕ​(s,πθ​(s)).
This says: choose actions that the critic predicts will have high value. One subtle but important point is that the actor is not trained to imitate the replay buffer directly. Instead, it is optimized through the critic, so the actor inherits whatever structure the critic has learned about long-horizon return. That distinction will matter later when demonstrations are used either as a source of proposals or as a bootstrap target.
On the imitation side, IBRL introduces a separate policy μψ(a∣s)\mu_\psi(a\mid s)μψ​(a∣s). This policy is trained only on the demonstration dataset D\mathcal{D}D, and it is never updated with RL gradients. That separation is deliberate: the imitation model should preserve the demonstrator’s behavior as a clean prior, not be distorted by noisy reward optimization. In practice, the behavioral cloning objective is
L(ψ)=−E(s,a)∼D[log⁡μψ(a∣s)]≈E(s,a)∼D∥μψ(s)−a∥22.\mathcal{L}(\psi) = -\mathbb{E}_{(s,a)\sim\mathcal{D}}\bigl[\log \mu_\psi(a\mid s)\bigr]
\approx \mathbb{E}_{(s,a)\sim\mathcal{D}}\|\mu_\psi(s)-a\|_2^2.L(ψ)=−E(s,a)∼D​[logμψ​(a∣s)]≈E(s,a)∼D​∥μψ​(s)−a∥22​.
For many continuous-control implementations, the Gaussian log-likelihood and the mean-squared error view are effectively two ways of expressing the same pressure: put probability mass, or prediction mass, on the demonstrated action. The approximation is useful because it highlights the geometry of cloning—μψ\mu_\psiμψ​ is simply being pulled toward the expert action in action space.
The reason to freeze this notation now is that IBRL repeatedly combines these pieces in different ways, and later variants only make sense if we keep track of which network is being used for value estimation, which is being used for policy improvement, and which is the pure imitation prior. In particular, the method will later exploit the fact that a learned demonstration policy can provide candidate actions even when the RL policy itself has not yet discovered anything useful.
The visual below is therefore not meant to introduce new ideas so much as compress the entire setup into one glance. The TD target, critic loss, actor loss, and BC loss form the core computational loop; the notation map on the side reminds us which symbols belong to RL, which belong to IL, and which are just slowly updated targets or frozen buffers. If this feels almost too elementary, that is precisely the point: the rest of the lecture can now build on a stable vocabulary instead of repeatedly re-explaining the machinery.

5. IBRL at a glance

What makes IBRL interesting is that it does not try to force demonstrations and reinforcement learning into a single blended policy from the start. Instead, it keeps the imitation policy μψ\mu_\psiμψ​ as a separate, frozen source of competence and lets the RL system decide—state by state—whether to trust imitation or its own exploratory proposal. That separation matters in sparse-reward robotics, where the earliest stages of learning are dominated not by fine policy improvement but by the much harsher problem of finding any reward at all.
The core difficulty is easy to state but hard to solve: if the reward is sparse, then a randomly initialized policy πθ\pi_\thetaπθ​ is unlikely to stumble into rewarding states often enough to generate a useful gradient. In practice, this creates a dead zone where the actor is too weak to explore meaningfully and the critic is too weak to evaluate good long-horizon consequences. Demonstrations help, but naive ways of using them can also be fragile:
Behavior cloning only can imitate the demonstrator but does not improve beyond it.
Replay oversampling can bias learning too strongly toward the demonstration distribution.
BC regularization can keep the policy near the demonstrations, but may also suppress the very deviations needed for improvement.
IBRL’s response is elegant: instead of asking the RL policy to replace imitation immediately, it asks the critic to arbitrate between them.
At each state sts_tst​, IBRL constructs exactly two candidate actions,
aIL∼μψ(st),aRL=πθ(st)+ϵ,ϵ∼N(0,σ2).a_{\mathrm{IL}} \sim \mu_\psi(s_t), \qquad a_{\mathrm{RL}} = \pi_\theta(s_t) + \epsilon, \qquad \epsilon \sim \mathcal{N}(0,\sigma^2).aIL​∼μψ​(st​),aRL​=πθ​(st​)+ϵ,ϵ∼N(0,σ2).
This is a subtle but important design choice. The imitation proposal is a direct sample from the frozen demonstrator policy, while the RL proposal retains stochasticity through Gaussian noise. The noise is not just a technical detail; it is the mechanism that keeps the RL branch exploratory rather than collapsing into a deterministic local optimum too early.
The critic then acts as a gate:
a∗=argmax⁡a∈{aIL, aRL}Qϕ(st,a).a^* = \operatorname*{argmax}_{a \in \{a_{\mathrm{IL}},\, a_{\mathrm{RL}}\}} Q_\phi(s_t,a).a∗=a∈{aIL​,aRL​}argmax​Qϕ​(st​,a).
This means the system is never forced to commit to a weak RL action when imitation is clearly better, but it also never abandons the RL branch once the learned policy becomes stronger. In effect, the critic defines a winner-take-state competition between two proposals: the demonstrator supplies a safe fallback, and the RL policy supplies the possibility of improvement. The important intuition is that the actor does not need to be globally good to be useful; it only needs to win in the states where it has become better than imitation.
This same comparison is reused in the TD target, which is where IBRL becomes more than a clever action-selection trick. Bootstrapping normally propagates value estimates through the next action selected by the current policy, so if that next action is poor, the critic can inherit and amplify error. IBRL reduces that risk by letting the target also fall back on the imitation proposal. So demonstrations influence two distinct phases of learning:
Exploration: choose a better action at interaction time.
Bootstrapping: choose a better next action when forming TD targets.
That dual use is the heart of the method. It is not just that imitation seeds the replay buffer; it actively stabilizes both the data collection process and the value-learning process.
This is also why IBRL avoids some common pitfalls of hybrid IL/RL systems. It does not duplicate demonstration data to inflate its importance, and it does not impose a strong behavior-cloning penalty that keeps the actor anchored to the initial expert distribution. Instead, the demonstrations remain present but not overbearing: they prefill the replay buffer B\mathcal{B}B, and then the learning system decides when they are useful. The result is a modular architecture in which IL and RL have clear roles rather than being entangled into a single objective that can be hard to tune.
One subtle failure mode is worth keeping in mind: this “choose the higher-QQQ” rule depends on the critic being at least somewhat trustworthy. If QϕQ_\phiQϕ​ is badly miscalibrated, then the gate can prefer the wrong proposal even when the underlying action is good. That is one reason why the paper later explores variants and architectural choices that improve robustness, including the value of keeping the imitation and RL components modular and the practical gains from actor dropout. But at this level, the key idea is already visible: use the critic to decide when imitation should rescue exploration and when RL should take over.
The visual below compresses this logic into a single flow. The two branches from sts_tst​ make the “two proposals” idea concrete, while the shared critic node emphasizes that the same evaluation rule governs both acting and bootstrapping. The replay buffer annotation is there for an equally important reason: demonstrations are not absent, but they are not artificially inflated. That combination—frozen imitation, RL proposal, critic selection, and ordinary replay—is the whole IBRL story in miniature, and it sets up the next step: understanding how the actor proposal itself is chosen and why that choice matters.

6. Actor proposal: choosing between imitation and RL actions

Building on the two-candidate view from the previous section, the key question is no longer how to generate an action, but which candidate should actually be trusted. IBRL answers this with a simple but powerful rule: at every state sss, it produces one action from the imitation policy and one from the RL policy, then lets the critic decide which one is better.
Formally, the two proposals are
aIL∼μψ(s),aRL∼πθ(s)+ϵ,ϵ∼N(0,σ2).a_{\mathrm{IL}} \sim \mu_\psi(s), \qquad a_{\mathrm{RL}} \sim \pi_\theta(s) + \epsilon, \qquad \epsilon \sim \mathcal{N}(0,\sigma^2).aIL​∼μψ​(s),aRL​∼πθ​(s)+ϵ,ϵ∼N(0,σ2).
The imitation candidate aILa_{\mathrm{IL}}aIL​ is a deterministic or near-deterministic output from the demonstrator-trained policy μψ\mu_\psiμψ​, while aRLa_{\mathrm{RL}}aRL​ is the current actor’s suggestion with exploration noise added. That noise matters: without it, the RL branch can become too narrow too early, especially in continuous control where local improvement needs some stochasticity to escape brittle habits. But the crucial point is that neither action is executed by default.
Instead, IBRL evaluates both proposals with the target critic and takes
a∗=argmax⁡a∈{aIL,aRL}Qϕ′(s,a).a^* = \operatorname*{argmax}_{a\in\{a_{\mathrm{IL}},a_{\mathrm{RL}}\}} Q_{\phi'}(s,a).a∗=a∈{aIL​,aRL​}argmax​Qϕ′​(s,a).
This is not a soft mixture or a learned gate in the usual sense. It is a winner-take-all selection rule driven by the critic’s current estimate of long-term value. In effect, the critic plays the role of a referee: it does not invent actions, but it adjudicates between a behavior that is known to be reasonable from demonstrations and a behavior that is potentially better according to RL.
That design is especially important in sparse-reward robotics. When reward arrives only after a long sequence of precise movements, a pure RL policy can spend a huge amount of time generating actions that are technically exploratory but practically useless. The imitation branch provides a structured fallback: if the RL proposal is not yet credible, the policy can still follow a demonstrator-like move that keeps the agent in a productive region of the state space. This is a subtle but crucial distinction from ordinary exploration noise, which often perturbs the agent away from meaningful trajectories rather than toward them.
There is also a useful asymmetry here. The imitation action is not treated as sacred; it is simply one candidate among two. That means IBRL does not get trapped repeating demonstrations forever. If the RL proposal has learned to exploit a better local improvement than the demonstrator would choose, the critic can select it immediately. So the mechanism supports both safe exploration and improvement beyond the demonstrations:
if the RL branch is weak, the imitation branch protects against random drift;
if the RL branch becomes stronger, it can gradually dominate;
if both are poor, the critic still chooses the less bad option.
This is why the target critic Qϕ′Q_{\phi'}Qϕ′​ is central. The selection rule only makes sense if the critic is sufficiently aligned with actual long-horizon return. If Qϕ′Q_{\phi'}Qϕ′​ is inaccurate, the system can select an action that looks good under the critic but is not truly better in the environment. In other words, the actor proposal mechanism inherits the usual value-estimation failure modes of actor-critic methods, but it makes them especially consequential because the critic is now used not merely for learning, but for execution.
The visual below compresses that logic into a single decision pathway. The two large branches emphasize that IBRL always compares exactly one imitation candidate and one RL candidate; there is no uncontrolled action soup. The central comparison node labeled Qϕ′(s,a)Q_{\phi'}(s,a)Qϕ′​(s,a) is the conceptual hinge: it explains why the chosen action is not “the imitation action” or “the RL action,” but the action with the higher estimated return. The highlighted green output a∗a^*a∗ reinforces the main takeaway: imitation is not a replacement for RL, but a principled exploration fallback that the critic can invoke whenever the learned policy is not yet reliable enough.

7. Bootstrap proposal: using imitation in TD targets

Building on the actor proposal from the previous section, IBRL makes a second, subtler move: it does not only let demonstrations influence which action the policy executes, but also which action the critic bootstraps from. That distinction matters because in actor–critic methods, the critic is the engine that propagates sparse reward backward through time. If the bootstrap target is brittle, the whole learning signal becomes noisy or vanishingly small.
Recall the vanilla TD3 target. At time ttt, it evaluates the next state st+1s_{t+1}st+1​ using the target actor plus smoothing noise:
yTD3=rt+γQϕ′ ⁣(st+1,πθ′(st+1)+clip(ϵ,−c,c)).y_{\text{TD3}} = r_t + \gamma Q_{\phi'}\!\bigl(s_{t+1}, \pi_{\theta'}(s_{t+1}) + \mathrm{clip}(\epsilon,-c,c)\bigr).yTD3​=rt​+γQϕ′​(st+1​,πθ′​(st+1​)+clip(ϵ,−c,c)).
This is sensible when the current policy is already competent, but it can be fragile in sparse-reward robotics. If the learned policy is still wandering in uninformative regions of state space, then the target action may be poor, the critic target may be low, and the learning signal may never “discover” a better trajectory. In other words, a weak target policy can create a weak bootstrap loop.
IBRL breaks that loop by introducing two candidate next actions at st+1s_{t+1}st+1​: one from imitation and one from RL. The imitation branch samples
aIL∼μψ(st+1),a_{\mathrm{IL}} \sim \mu_{\psi}(s_{t+1}),aIL​∼μψ​(st+1​),
while the RL branch uses the usual target actor with clipped exploration noise,
aRL=πθ′(st+1)+clip(ϵ,−c,c).a_{\mathrm{RL}} = \pi_{\theta'}(s_{t+1}) + \mathrm{clip}(\epsilon,-c,c).aRL​=πθ′​(st+1​)+clip(ϵ,−c,c).
The target then becomes
y=rt+γmax⁡a′∈{aIL,aRL}Qϕ′(st+1,a′).y = r_t + \gamma \max_{a' \in \{a_{\mathrm{IL}},a_{\mathrm{RL}}\}} Q_{\phi'}(s_{t+1},a').y=rt​+γa′∈{aIL​,aRL​}max​Qϕ′​(st+1​,a′).
So instead of assuming that the next action will come from only the current RL policy, the critic is trained under a best-of-two continuation rule: whichever proposal appears better under the target critic is the one used to bootstrap.
This is a compact but important idea. In sparse-reward tasks, the imitation policy often knows how to continue along a plausible demonstration trajectory even when the learned policy does not. That means μψ\mu_\psiμψ​ can supply a high-value continuation when πθ′\pi_{\theta'}πθ′​ is still immature. Conversely, once RL surpasses imitation in some region, the max operator allows the RL branch to dominate. The mechanism is therefore not “always imitate” but rather “bootstrap from the better proposal at the next state.”
A useful way to think about it is as a form of critic-side proposal selection:
Imitation proposal helps when the learned policy is uncertain or stuck.
RL proposal helps when RL has already improved beyond the demos.
Max over proposals prevents the critic from being anchored to a single brittle continuation.
This also clarifies why the method is robust in the early phase of training. In standard TD bootstrapping, a bad next action can depress QQQ-targets and slow down value propagation. With IBRL, the imitation branch acts as a rescue path: it can preserve a meaningful target even before the RL actor has discovered rewarding behaviors on its own. That extra path is especially valuable in robotics, where reward signals are often delayed, sparse, and highly contingent on long action sequences.
There is an important assumption hiding here: the imitation proposal must be sufficiently competent to provide useful continuations, but it does not need to be optimal. The max operator only needs one of the two candidates to look better under the target critic. If both are poor, the target is still poor. So IBRL does not magically solve sparse reward; rather, it uses demonstrations to improve the odds that bootstrapping has at least one viable branch.
The visual below is a clean summary of that logic. The left side compresses the target construction into the three equations you should keep in mind: the imitation action, the noisy RL action, and the final target that takes the maximum over the two. The right side turns those symbols into a branching picture: one next state splits into an imitation path and an RL path, and both feed the target critic with a small max annotation at the merge point. That visual is useful because the essence of IBRL bootstrapping is not the algebra itself, but the choice structure it introduces.
Seen this way, the diagram is less a derivation than a reminder of the algorithmic bias IBRL adds to TD learning: the critic does not passively accept the RL policy’s next action as ground truth. Instead, it treats imitation as a second hypothesis about the future and bootstraps from whichever hypothesis currently looks better.

8. Why IBRL is better than oversampling or BC regularization

Once we have the hybrid action rule in place, the next question is not whether demonstrations help, but how they are inserted into the learning loop. That distinction matters a lot in sparse-reward robotics, because two methods can both “use demonstrations” while having almost opposite behavior in practice. One may simply recycle old trajectories; another may actively let the imitation policy compete with the learned policy at decision time. Those are very different mechanisms, and they fail for different reasons.
A useful way to frame the comparison is to separate data reuse from action proposal. Oversampling and behavior-cloning-style regularization both operate primarily on the training objective: they change which samples are seen more often, or they penalize the RL policy for moving too far away from the demonstrator. IBRL, by contrast, changes the set of candidate actions available to the agent. That sounds subtle, but it is exactly what makes the method more powerful in sparse reward settings.
For replay oversampling, the algorithm just revisits ξ∈D\xi\in\mathcal{D}ξ∈D more frequently inside the replay buffer B\mathcal{B}B. This can stabilize training early on, because the agent sees good trajectories often enough to avoid drifting immediately into nonsense. But the limitation is structural: the demonstration is still only a dataset, not a policy. No matter how many times the same transitions are replayed, the agent does not learn to call μψ(s)\mu_\psi(s)μψ​(s) as a live source of actions when it is stuck. In other words, oversampling improves gradient statistics, not decision options.
Behavior-cloning-regularized fine-tuning is stronger, but it still treats imitation as an auxiliary constraint rather than a separate control channel. A typical objective looks like
J(θ)=JRL(θ)−αλL(θ),J(\theta)=J_{\mathrm{RL}}(\theta)-\alpha\lambda L(\theta),J(θ)=JRL​(θ)−αλL(θ),
where the extra term keeps πθ\pi_\thetaπθ​ close to the demonstrator. This can be effective when the optimal policy lies near the demonstrations and the parameterization is well behaved. But it introduces two practical fragilities:
the balance between RL and imitation depends on α\alphaα and λ\lambdaλ, which are often hard to tune;
if the policy and imitation share an architecture, the RL updates can interfere with the imitation behavior, especially when reward is sparse or noisy.
So BC regularization is still fundamentally a single-policy compromise. It asks one network to represent both imitation and improvement, which can work, but it can also blur the clean structure that demonstrations are supposed to provide.
IBRL makes a different bet. Instead of folding demonstrations into the RL policy, it keeps μψ\mu_\psiμψ​ separate and queries it directly as a candidate action aILa_{\mathrm{IL}}aIL​. The learned policy πθ\pi_\thetaπθ​ produces aRLa_{\mathrm{RL}}aRL​, and the critic decides between them:
a∗=argmax⁡a∈{aIL,aRL}Qϕ′(s,a).a^*=\operatorname*{argmax}_{a\in\{a_{\mathrm{IL}},a_{\mathrm{RL}}\}}Q_{\phi'}(s,a).a∗=a∈{aIL​,aRL​}argmax​Qϕ′​(s,a).
That is the core conceptual shift. Demonstrations are no longer just evidence in the training set; they become an active policy source during interaction and bootstrapping. This is why IBRL can recover from uncertainty more gracefully: when the RL policy is weak, the imitation action can still be evaluated and selected if the critic predicts it is better.
This design also explains two practical advantages that are easy to miss if one only looks at average returns. First, because μψ\mu_\psiμψ​ is not updated by RL gradients, it avoids catastrophic forgetting. The demonstration behavior stays intact even when the RL learner starts exploring aggressively. Second, because πθ\pi_\thetaπθ​ and μψ\mu_\psiμψ​ are modular, they can use different architectures or action parameterizations. That flexibility is important in robotics, where the imitation policy might encode a human teleoperation style while the RL policy learns a more task-optimal controller over time.
The failure modes line up neatly with this mechanism. Oversampling can tell the learner that demonstrations are important, but it cannot make the demonstrator available at test-time decision points. BC regularization can keep the policy near demonstrations, but it can also overconstrain improvement or require delicate weight tuning. IBRL avoids both issues by making the demonstrator compete on equal footing with the RL action. The critic, not the optimizer, decides which source is better in the current state.
So the diagram below is best read as a summary of this shift in role. The left two columns show methods that treat D\mathcal{D}D as training data or as a regularizer. The IBRL column instead shows a separate imitation pathway feeding into the action selection rule itself. The boxed annotation captures the heart of the method: the move is from reusing demonstrations to querying the imitation policy as a candidate action. That small change in phrasing is exactly the large change in behavior.

9. Soft IBRL and the tabular failure case

After the hard-selection rule becomes clear, the next question is whether we can soften it just enough to avoid pathological lock-in without undoing the whole bootstrapping idea. That is exactly the motivation for Soft IBRL. The core issue is not that imitation is bad, but that a deterministic argmax⁡\operatorname*{argmax}argmax over the two candidates {aIL,aRL}\{a_{\mathrm{IL}}, a_{\mathrm{RL}}\}{aIL​,aRL​} can create a one-way gate: once the imitation action wins early, the RL action may never be executed, which means its value estimate never receives the evidence needed to catch up.
This is easiest to see in a tabular counterexample, where the two candidate values are effectively decoupled. Suppose changing Qϕ(s,aRL)Q_\phi(s,a_{\mathrm{RL}})Qϕ​(s,aRL​) updates only that entry, while Qϕ(s,aIL)Q_\phi(s,a_{\mathrm{IL}})Qϕ​(s,aIL​) stays fixed. If initially
Qϕ(s,aRL)<Qϕ(s,aIL),Q_\phi(s,a_{\mathrm{RL}}) < Q_\phi(s,a_{\mathrm{IL}}),Qϕ​(s,aRL​)<Qϕ​(s,aIL​),
then hard IBRL always chooses aILa_{\mathrm{IL}}aIL​. But then aRLa_{\mathrm{RL}}aRL​ is never tried, so there is no new transition data to revise its value upward. The result is a kind of self-fulfilling suppression: the better action may exist, but the policy never visits it long enough to discover that fact.
This failure mode matters because it reveals a subtle assumption hiding inside hard selection: it assumes the estimate of the currently selected action can change enough to let the other action eventually compete. In tabular learning that assumption is false. The selection rule and the value update are too separated; choosing aILa_{\mathrm{IL}}aIL​ repeatedly does not directly improve Qϕ(s,aRL)Q_\phi(s,a_{\mathrm{RL}})Qϕ​(s,aRL​), so the worse action can monopolize the policy forever. In more realistic function-approximation settings, the coupling between actions may help, but it does not eliminate the basic risk of premature masking.
Soft IBRL fixes this by replacing the hard arg⁡max⁡\arg\maxargmax with a Boltzmann sample over the same two candidates:
a∗∼pQ(a),pQ(a)∝exp⁡(βQϕ(s,a)),a∈{aIL,aRL}.a^* \sim p_Q(a), \qquad p_Q(a) \propto \exp\big(\beta Q_\phi(s,a)\big), \quad a\in\{a_{\mathrm{IL}},a_{\mathrm{RL}}\}.a∗∼pQ​(a),pQ​(a)∝exp(βQϕ​(s,a)),a∈{aIL​,aRL​}.
Now the lower-valued action is not excluded; it is merely downweighted. When β\betaβ is finite, even a slightly worse candidate is still sampled occasionally, which gives the RL branch a chance to generate new experience and potentially overtake the imitation branch later. In other words, soft selection turns a brittle winner-take-all decision into a probabilistic preference.
A useful way to interpret β\betaβ is as a temperature-like sharpness parameter. Large β\betaβ makes the distribution concentrate near the current maximizer, recovering behavior close to hard IBRL. Smaller β\betaβ spreads probability mass more evenly, increasing exploration but also increasing the chance of dithering. So the point is not to make the policy uniformly random; it is to prevent the system from becoming too confident too early.
Two practical caveats are worth keeping in mind:
Soft IBRL is mainly a stability aid for state-based or tabular-style settings.
It is not meant to replace the hard rule everywhere; rather, it reduces the risk that a locally preferred action becomes permanently invisible.
The benefit is strongest when selection and value updates are weakly coupled, which is exactly when hard argmax is most dangerous.
The same idea is applied at the bootstrap step for st+1s_{t+1}st+1​: instead of always backing up through the current maximizer, the next action is drawn from the same soft distribution. That keeps the target from becoming too brittle, since the backup itself also has a small amount of stochasticity. This is important because a deterministic target can reinforce the same masked preference over and over, whereas a softened target leaves room for alternative actions to influence learning.
The visual below compresses that argument into a compact failure case. On the left, the tabular example makes the asymmetry concrete: the imitation action sits slightly above the RL action, so hard argmax always routes execution to the imitation branch while the RL branch stays blocked. On the right, the Boltzmann rule captures the repair in one line: the better action is still preferred, but the lower-valued one is never assigned exactly zero probability when β\betaβ is finite. Together, the two halves show why Soft IBRL is less about changing the objective and more about preventing permanent masking from a single early mistake.

10. Soft IBRL equations

Building on the tabular failure case, the key move in soft IBRL is to stop treating the imitation and RL proposals as a winner-take-all choice. Instead of asking which candidate has the larger QQQ-value and then committing immediately, we turn the comparison into a probabilistic draw. That small change matters because the hard arg⁡max⁡\arg\maxargmax rule is brittle: when the two proposals are close, tiny estimation noise can flip the decision; when one proposal is only slightly better early in training, a hard switch can prematurely lock the agent into a poor branch. Soft IBRL keeps both candidates alive, but biases the choice toward the one with higher value.
Formally, the two candidates are still the same: aILa_{\mathrm{IL}}aIL​ from the demonstration-driven proposal and aRLa_{\mathrm{RL}}aRL​ from the learned policy. What changes is the selection rule,
pQ(a)∝exp⁡ ⁣(β Qϕ(s,a)),a∈{aIL,aRL}.p_Q(a) \propto \exp\!\big(\beta \, Q_\phi(s,a)\big), \qquad a \in \{a_{\mathrm{IL}}, a_{\mathrm{RL}}\}.pQ​(a)∝exp(βQϕ​(s,a)),a∈{aIL​,aRL​}.
This is just a two-action Boltzmann distribution. The QQQ-values are no longer used as a hard comparator, but as logits that determine preference strength. The normalization is implicit: the larger the gap in Qϕ(s,a)Q_\phi(s,a)Qϕ​(s,a), the more mass goes to that action. When the gap is small, the distribution stays mixed, which is exactly where hard argmax is least trustworthy.
The same probabilistic rule is used in two places. First, during online interaction, the agent samples
a∗∼pQ(a)a^* \sim p_Q(a)a∗∼pQ​(a)
at the current state sss. So the behavior policy is not a deterministic selector between imitation and RL proposals; it is a soft chooser that can still explore both branches. This is useful because the imitation proposal may be more reliable in some regions of the state space, while the RL proposal may become better after enough value learning. Soft selection lets the agent adapt without needing a crisp, possibly premature, commitment.
Second, the exact same distribution is used for the bootstrap target:
a′∼pQ(a),y=rt+γ Qϕ′(st+1,a′).a' \sim p_Q(a), \qquad y = r_t + \gamma \, Q_{\phi'}(s_{t+1}, a').a′∼pQ​(a),y=rt​+γQϕ′​(st+1​,a′).
That reuse is not cosmetic. In value-based RL, the target operator is where overconfidence often enters. If we hard-select the next action, then a small error in QQQ can be amplified by always backing up through the wrong branch. Sampling from pQp_QpQ​ makes the target less deterministic and less prone to overcommitting to a noisy estimate. Conceptually, the target becomes an expectation over the same two-proposal competition that governs action selection.
The temperature parameter β≥0\beta \ge 0β≥0 controls how sharp that competition is. In the limit of large β\betaβ, the Boltzmann distribution concentrates near the better candidate and approaches a hard argmax⁡\operatorname*{argmax}argmax. In the limit of small β\betaβ, the two proposals are mixed more evenly, and at β→0\beta \to 0β→0 the choice becomes nearly uniform. So β\betaβ is the knob that interpolates between stability through averaging and exploitation through decisiveness. The practical lesson is that soft IBRL is not “more stochastic for its own sake”; it is a controlled relaxation of the brittle two-way comparison.
There is also an important subtlety in the failure mode from the tabular setting. Softening the decision does not magically fix every issue if the QQQ-function itself is badly misspecified or if the action proposals are systematically misleading. What it does fix is the boundary behavior around near-ties and noisy estimates. In other words, it reduces the damage caused by winner-take-all updates when the agent has not yet earned the right to be so certain. That is why the soft variant is especially helpful in state-based experiments, where the representation is compact enough that modest stochasticity can smooth training without obscuring the signal. For pixel-based experiments, the paper uses the greedy version for simplicity, reflecting the engineering tradeoff between robustness and implementation complexity.
A useful way to think about the overall mechanism is this:
Hard IBRL: choose one proposal, then bootstrap through one proposal.
Soft IBRL: sample a proposal from a value-shaped distribution, then bootstrap through the same distribution.
Effect of β\betaβ: set the balance between exploration of both proposals and near-deterministic preference for the better one.
The visual below is meant to compress that logic into a single glance. The central Boltzmann equation captures the core replacement: one distribution over the two candidate actions instead of a hard comparison. The arrows then remind you that the same pQp_QpQ​ governs both the online action a∗a^*a∗ and the bootstrap action a′a'a′, which is the key structural point of soft IBRL. The small β\betaβ scale at the bottom is there to reinforce the interpretation of temperature: low β\betaβ means more mixing, high β\betaβ means behavior closer to greedy selection.
Seen this way, the diagram is not just an equation panel; it is a compact summary of the algorithmic philosophy. Soft IBRL keeps the imitation and RL proposals as first-class candidates, but uses the learned value function to blend them probabilistically rather than abruptly. That is exactly the kind of smoothing that helps a sparse-reward robot avoid the pathological flips exposed in the tabular failure case.

11. Algorithm 1: IBRL with a TD3 backbone

Once the soft variant has clarified the principle—choose between an imitation proposal and an RL proposal, then train against the better one—the next question is how that idea looks inside a familiar deep-RL pipeline. The cleanest answer is that IBRL is not a new optimizer or a new critic architecture; it is TD3 with two carefully inserted proposal comparisons. Everything else is deliberately left intact so that the algorithm inherits TD3’s stability while exploiting demonstrations exactly where they help most: exploration and target construction.
The key design choice is to treat the imitation policy μψ\mu_\psiμψ​ as a proposal generator, not as a policy that permanently overrides the learner. At each environment step, we form two candidate actions:
aIL∼μψ(st),aRL=πθ(st)+ϵ.a_{\mathrm{IL}} \sim \mu_\psi(s_t), \qquad a_{\mathrm{RL}} = \pi_\theta(s_t) + \epsilon.aIL​∼μψ​(st​),aRL​=πθ​(st​)+ϵ.
These are not merged by averaging, which would blur the signal and can create unsafe intermediate actions. Instead, IBRL evaluates both candidates with the target critics and selects the one with the larger conservative estimate:
at=argmax⁡a∈{aIL,aRL}min⁡i∈KQϕi′(st,a).a_t = \operatorname*{argmax}_{a\in\{a_{\mathrm{IL}},a_{\mathrm{RL}}\}} \min_{i\in\mathcal{K}} Q_{\phi'_i}(s_t,a).at​=a∈{aIL​,aRL​}argmax​i∈Kmin​Qϕi′​​(st​,a).
This is the first core mechanism, often called the actor proposal. It matters because sparse-reward robotic tasks are usually dominated by long stretches of indistinguishable zero reward; in that regime, a purely learned policy can wander indefinitely, while a demonstrated policy can propose actions that stay near the support of successful behavior. But IBRL does not blindly imitate. If the learned actor has already found a better local move, the max over the two proposals lets it take over.
The same logic reappears in the Bellman target. For each replayed transition, TD3 would normally back up through a noisy target actor. IBRL instead asks the imitation and RL proposals to compete again at the next state:
y(j)=rt(j)+γmax⁡a′∈{aIL′,aRL′}min⁡i∈KQϕi′(st+1(j),a′).y^{(j)} = r^{(j)}_t + \gamma \max_{a'\in\{a'_{\mathrm{IL}},a'_{\mathrm{RL}}\}} \min_{i\in\mathcal{K}} Q_{\phi'_i}(s^{(j)}_{t+1},a').y(j)=rt(j)​+γa′∈{aIL′​,aRL′​}max​i∈Kmin​Qϕi′​​(st+1(j)​,a′).
This is the second core mechanism, the bootstrap proposal. It is subtle but important: the target is not forced to trust the demonstration forever. The critic receives whichever next-action proposal currently looks better under the target ensemble. That makes demonstrations useful as a bootstrap prior rather than as a hard constraint. In practice, this helps the value function propagate sparse reward through regions that the RL policy has not yet mastered, while still allowing policy improvement to move beyond the demonstrations.
A helpful way to think about the whole method is as a three-way separation of roles:
μψ\mu_\psiμψ​ provides a safe or competent exploratory candidate.
πθ\pi_\thetaπθ​ provides the learner’s improving candidate.
The critic ensemble decides which candidate is currently more valuable.
That separation is what makes the algorithm modular. The imitation policy does not need to share parameters with the RL actor, and the RL actor does not need to be “distilled” into imitation. The two are compared only at the action level. This is also why the method can be dropped into TD3 almost verbatim: the critic loss, deterministic policy gradient update, target networks, and EMA-style target tracking are all standard TD3 machinery.
There is an important failure mode hiding underneath this simplicity. If the critic becomes overconfident on bad extrapolations, then the max⁡\maxmax over proposals can select the wrong candidate, especially early in training when both proposals are poorly calibrated outside the data support. That is exactly why the method keeps TD3’s clipped double-Q style conservatism through min⁡iQϕi\min_i Q_{\phi_i}mini​Qϕi​​, and why the target network is still needed. The min over critics reduces optimistic bias, while the proposal comparison ensures the algorithm only has to discriminate between two actionable choices rather than search the entire action space.
This also connects back to the soft IBRL discussion: in a tabular or overly simplified setting, a soft blend can fail because it can smear probability mass onto suboptimal actions and prevent decisive improvement. The hard proposal comparison used here avoids that failure mode. It is selective rather than averaging, so the method can switch from imitation-led behavior to RL-led behavior once the learned policy becomes better. That switching behavior is precisely what sparse-reward robotics needs: demonstrations to get moving, then RL to surpass them.
The visual below is most useful if you read it as a compact proof of modularity. The top part isolates initialization and environment interaction, where the two proposals compete before execution. The lower part then mirrors standard TD3 updates, except that the target value also compares the same two proposals at the next state. In other words, the diagram summarizes the paper’s central claim: IBRL is TD3 with two proposal comparisons—one for acting, one for bootstrapping—and everything else remains the familiar TD3 backbone.

12. Implementation choices: what is fixed, what is modular

Building on the action-selection rule and the full training loop, the key implementation point is that IBRL is much more opinionated about the decision rule than about the surrounding machinery. That distinction matters because many imitation-plus-RL methods blur together three different questions: how actions are chosen, how imitation is trained, and which RL backbone is used. IBRL mostly fixes the first question and leaves much of the rest as a modular design choice.
At the center is the separation between the imitation policy μψ(a∣s)\mu_\psi(a\mid s)μψ​(a∣s) and the RL actor πθ(a∣s)\pi_\theta(a\mid s)πθ​(a∣s). The imitation policy is trained independently from the reinforcement-learning gradients, so it acts as a stable source of demonstrated behavior rather than a co-adapted network that is constantly pulled around by critic noise. That separation is not just an implementation convenience; it prevents the demonstration pathway from being corrupted by sparse-reward optimization dynamics, where early critic errors can otherwise distort the policy before the agent has learned anything useful.
This modularity also extends to architecture. The paper does not require μψ\mu_\psiμψ​ and πθ\pi_\thetaπθ​ to share the same network depth, feature extractor, or output parameterization. In practice, that is important because the imitation policy may benefit from a richer supervised model, while the RL actor may need to remain lightweight and compatible with the critic-driven update rule. In other words, IBRL does not insist that the imitation model and the RL model be twins; it only insists that the decision mechanism can compare their proposed actions fairly.
The RL side remains deliberately familiar: a TD3/RED-Q-style setup with online critic(s) Qϕ(s,a)Q_\phi(s,a)Qϕ​(s,a), target critic(s) Qϕ′(s,a)Q_{\phi'}(s,a)Qϕ′​(s,a), and a target actor πθ′\pi_{\theta'}πθ′​. This is a practical design choice, not an incidental one. The sparse-reward setting is already difficult; adding a novel critic architecture on top would make it harder to tell whether gains come from the IBRL idea itself or from an unrelated stabilization trick. By keeping the backbone recognizable, the paper makes the empirical story cleaner: the improvement is attributable to how demonstrations are used, not to a wholesale change in off-policy RL.
The same logic explains why many hyperparameters are kept in the “engineering” bucket rather than elevated to the algorithmic core. Adam with learning rate 10−410^{-4}10−4, batch size 256256256, γ=0.99\gamma=0.99γ=0.99, target smoothing σ=0.1\sigma=0.1σ=0.1, policy noise clipping c=0.3c=0.3c=0.3, Polyak averaging ρ=0.99\rho=0.99ρ=0.99, delayed actor updates U=2U=2U=2, ensemble size EEE, and critic-update count GGG are all part of a robust TD3/RED-Q implementation. They matter a lot in practice, but they do not define IBRL conceptually. The lesson is subtle: a strong method often depends on a stable stack of low-level defaults, yet those defaults should remain swappable so long as the decision rule and the demonstration bootstrapping logic are preserved.
Two details deserve special attention because they change the behavior of the method without changing its core logic. First, actor dropout at 0.5 is an architectural regularizer that helps the RL actor avoid overfitting to brittle action choices, especially when demonstrations are only one part of the exploration story. Second, in soft IBRL, the hard argmax-style comparison is replaced by a Boltzmann sample over the two candidate actions,
pQ(a)=exp⁡(βQ(s,a))∑a′∈{aIL,aRL}exp⁡(βQ(s,a′)).p_Q(a)=\frac{\exp(\beta Q(s,a))}{\sum_{a'\in\{a_{\mathrm{IL}},a_{\mathrm{RL}}\}}\exp(\beta Q(s,a'))}.pQ​(a)=∑a′∈{aIL​,aRL​}​exp(βQ(s,a′))exp(βQ(s,a))​.
This keeps the same comparison structure while injecting stochasticity controlled by β\betaβ. When β\betaβ is large, the policy behaves almost like the hard selector; when β\betaβ is small, it becomes more exploratory. The failure mode is instructive: in a tabular or very low-capacity setting, soft selection can dilute the clean bootstrapping signal and make the agent too willing to revisit mediocre choices, which is why the hard comparison is often easier to reason about.
For pixel inputs, the implementation leans on image augmentation as an additional stabilizer, because the critic has to generalize over nuisance variation in observations. For state-based tasks, that role is played more by the critic ensemble and RED-Q-style updates, which reduce overestimation and smooth out the value comparison between aILa_{\mathrm{IL}}aIL​ and aRLa_{\mathrm{RL}}aRL​. This is another example of the paper’s philosophy: it does not enforce a single monolithic recipe, but rather a family of stabilizers that serve the same goal under different observation modalities.
A useful way to read this design is as a separation between what defines the algorithm and what makes it train well. The algorithmic core is the comparison between imitation and RL proposals; the rest is a set of interchangeable components that preserve that comparison while improving stability and sample efficiency. Put differently, IBRL shares the decision rule, not the network design.
The visual summary below is meant to compress exactly that point. The left side collects the pieces that are effectively fixed by the paper’s formulation, while the right side highlights the parts that can vary across implementations without changing the conceptual identity of IBRL. The table format is especially helpful here because the main lesson is not a new computation, but a boundary: where the method ends and engineering begins.

13. Why the paper changes the network architecture

The architectural choice in IBRL is not an implementation footnote; it is part of the algorithmic argument. If the setting were dense-reward control, one might tolerate a single shared representation for everything and optimize end-to-end. But in sparse-reward robotic RL, the signal is too weak to let a single encoder simultaneously serve two very different purposes: matching demonstrations and discovering reward through exploration. Those objectives place qualitatively different pressures on the representation, so forcing them through one bottleneck can make both worse.
The core tension is easy to state. Behavior cloning wants a feature map that is expressive enough to imitate demonstrations faithfully, even if that means leaning on finer-grained visual cues or a deeper visual backbone. Online RL, in contrast, needs a representation that is stable under bootstrapped targets, easy to optimize from sparse successes, and not so heavyweight that the critic becomes brittle. In other words, imitation often benefits from capacity, while RL under sparse reward often benefits from optimization simplicity.
This is why the paper’s modularity matters. Instead of requiring a single shared encoder for both imitation and control, IBRL lets the components specialize. The imitation policy μψ\mu_\psiμψ​ can use an encoder that is good at fitting demonstrations, while the online actor πθ\pi_\thetaπθ​ and critic QϕQ_\phiQϕ​ can use an architecture better suited to bootstrapping and sparse-reward learning. The important point is not just that the modules are separate; it is that the policy class for RL is no longer constrained by the feature extractor chosen for imitation.
A useful way to think about this is as a bias–variance tradeoff across objectives. A deeper encoder can reduce imitation error by providing richer features, but it may also increase optimization difficulty for online RL, especially when the critic’s targets are noisy and the reward is delayed. Conversely, a lightweight architecture can make RL more tractable, but it may underfit demonstrations if asked to serve as the only shared representation. IBRL avoids that compromise by allowing each path to choose its own inductive bias.
Concretely, the paper’s RL side uses a ViT-style visual pathway with a relatively compact transformer-like design. The critic can be written schematically as
Qϕ(s,a)=MLP(Fuse(Flatten(ViT(s)),a)),Q_\phi(s,a) = \mathrm{MLP}\big(\mathrm{Fuse}(\mathrm{Flatten}(\mathrm{ViT}(s)), a)\big),Qϕ​(s,a)=MLP(Fuse(Flatten(ViT(s)),a)),
which emphasizes a few design decisions: image patches are embedded, token features are flattened, proprioception and action are fused late, and only then does the MLP predict value. This is a good fit for control because the critic only needs enough visual abstraction to estimate action values, not necessarily the same rich feature hierarchy that a demonstration policy might prefer.
The actor path is even simpler:
πθ(s)=MLP(Flatten(ViT(s))).\pi_\theta(s) = \mathrm{MLP}\big(\mathrm{Flatten}(\mathrm{ViT}(s))\big).πθ​(s)=MLP(Flatten(ViT(s))).
That simplicity matters. The actor is trained through RL updates, so its job is to produce a policy that is compatible with sparse bootstrapped learning; it does not need to inherit the imitation encoder’s representational choices. This decoupling is especially valuable when the demonstration policy μψ\mu_\psiμψ​ is stronger or deeper, because otherwise the RL side would be forced to optimize through a feature map chosen for a different objective.
There is also a subtle failure mode here that is easy to miss. If imitation and RL share a backbone, the gradient from behavior cloning can dominate early training, pulling the representation toward what best explains the demos rather than what best supports exploration and value estimation. That can be helpful at first, but it can also trap the online learner in a suboptimal feature geometry: the policy becomes good at copying without becoming good at improving beyond the demonstrations. Decoupling the architectures reduces this interference.
Two takeaways are worth keeping in mind:
Stronger imitation features do not have to constrain the RL policy class.
Sparse-reward optimization is often helped by a simpler, more direct visual pathway than supervised imitation.
The visual below condenses exactly this design philosophy. The left branch isolates the behavior-cloning path with its own encoder, while the right branch shows the RL actor–critic stack built around a lighter ViT-style representation. The central separation is the message: IBRL does not ask one network to solve two incompatible problems; it lets each side use the architecture that makes its own learning signal most usable.

14. Simulation results and ablations

Now that the algorithm and architecture are fixed, the question becomes empirical: what actually buys the performance? This is where the simulation results matter, because they separate the qualitative story—“IBRL uses demonstrations in two places”—from the quantitative claim that those two places are both necessary. In sparse-reward robotic control, that distinction is crucial: a method can look plausible on paper while still failing because it explores the wrong regions of state space or learns from overly optimistic targets.
Across Meta-World and Robomimic, IBRL matches or exceeds strong demonstration-augmented baselines such as RLPD, RFT, and MoDem on the reported tasks. The interesting part is not just the average improvement; it is where the gains are largest. On harder tasks like Square, where the agent must coordinate a long sequence of precise interactions under a tight sample budget, IBRL reaches the best available simulation performance among the compared methods. That matters because sparse-reward benchmarks are exactly where “small” algorithmic differences in exploration or bootstrapping become visible as large differences in final success.
The right way to read this result is that IBRL does not win by one monolithic trick. It wins because the two-proposal mechanism improves two distinct failure points in off-policy RL with demonstrations:
What to try: the actor proposal helps the agent sample actions that are more likely to enter rewarding regions.
What to learn from: the bootstrap proposal improves the target action used in temporal-difference learning, so the value function is trained against a better next-step estimate.
Formally, the policy/value update is guided by selecting between the imitation action and the RL action through the critic,
a∗=argmax⁡a∈{aIL,aRL}Qϕ(st,a),a^* = \operatorname*{argmax}_{a\in\{a_{\mathrm{IL}},a_{\mathrm{RL}}\}} Q_\phi(s_t,a),a∗=a∈{aIL​,aRL​}argmax​Qϕ​(st​,a),
and then using the same discrete choice set in the bootstrap target,
y=rt+γmax⁡a′∈{aIL,aRL}Qϕ′(st+1,a′).y = r_t + \gamma\max_{a'\in\{a_{\mathrm{IL}},a_{\mathrm{RL}}\}} Q_{\phi'}(s_{t+1},a').y=rt​+γa′∈{aIL​,aRL​}max​Qϕ′​(st+1​,a′).
This is a small change in notation, but a large change in behavior. The actor proposal influences exploration pressure at the current state, while the bootstrap proposal influences the credit assignment target at the next state. If either one is weak, the other has to compensate, and in sparse-reward settings that compensation is rarely enough.
The ablations make this separation especially clear. When the actor proposal is removed, early learning slows down because the agent loses one of its easiest routes into productive behavior. It may still recover eventually, but it spends more updates wandering before it finds enough rewarding transitions. By contrast, removing the bootstrap proposal does not primarily hurt exploration; it hurts the quality of the learning signal itself. The critic then trains on poorer targets, so value estimates sharpen more slowly and convergence degrades even if the agent occasionally visits good states.
This is also why the actor-dropout result should be interpreted carefully. Dropout improves stability, which is useful in online RL because the actor can otherwise become too brittle or overconfident in a narrow action distribution. But stability is not the same as explanatory power. The ablation suggests dropout is a helpful regularizer, not the main source of the gain. In other words, it smooths training, but it does not replace the two-proposal mechanism.
The architecture comparison reinforces the same lesson. The proposed ViT backbone performs better than the DrQ-style CNN and the BC-oriented ResNet-18 when the network is used for online optimization rather than pure imitation. That difference is not cosmetic: an architecture that is good for behavior cloning is not automatically the best substrate for value-based updates. RL needs representations that remain useful under bootstrapped targets, policy improvement, and distribution shift, whereas imitation-only backbones are often tuned for predicting demonstrated actions under a narrower data regime. The result is a reminder that in demo-augmented RL, the backbone is part of the optimization story, not just the perception story.
So the overall message is fairly crisp. IBRL is strongest when both proposals are active and the backbone is selected for RL, not just for imitation. The method’s gains come from better action selection and better target construction, and the ablations rule out the tempting but incomplete explanation that any one of these ingredients alone accounts for the full improvement.
The visual below compresses that argument into two complementary views. The left panel summarizes the benchmark outcomes: IBRL’s curves or bars sit at the top across tasks, with the hardest setting, Square, standing out as the clearest case where the method stays effective under the tightest budget. The right panel then explains why those results happen by showing the damage caused when either proposal is removed, and by contrasting the RL-friendly ViT backbone with architectures that are less suited to online improvement. Together, the panels turn the empirical claim into evidence for the mechanism: better exploration, better bootstrapping, and better architecture all point in the same direction.

15. Unifying summary: all ways IBRL uses the imitation policy

After seeing how the proposal rules and the TD3-style update loop work in isolation, the bigger picture is that IBRL does not use demonstrations in just one place. It turns the imitation policy μψ\mu_\psiμψ​ into a reusable module that supports the entire learning pipeline: initialization, exploration, and target construction. That reuse is what makes the method more sample efficient than approaches that merely replay demonstrations or add a static imitation regularizer.
The first role is the most familiar one: behavior cloning pretraining. Before any RL interaction, the imitation policy is fit on the demonstration dataset D\mathcal{D}D by maximum likelihood,
L(ψ)=−E(s,a)∼D[log⁡μψ(a∣s)].\mathcal{L}(\psi)= -\mathbb{E}_{(s,a)\sim\mathcal{D}}[\log \mu_\psi(a\mid s)].L(ψ)=−E(s,a)∼D​[logμψ​(a∣s)].
This gives μψ\mu_\psiμψ​ a meaningful action prior in regions of the state space where sparse rewards would otherwise provide no learning signal. In practice, this matters because early exploration in robotics is usually dominated by failure states: the agent does not just learn slowly, it often never stumbles onto reward at all. A demonstration-trained policy changes that starting point from random thrashing to a plausible trajectory manifold.
The second role is the key IBRL idea: actor proposal. When the RL actor πθ\pi_\thetaπθ​ proposes an action, IBRL does not blindly execute it. Instead, it compares the RL action with the imitation action and chooses
a∗=argmax⁡a∈{aIL,aRL}Qϕ′(s,a).a^*=\operatorname*{argmax}_{a\in\{a_{\mathrm{IL}},a_{\mathrm{RL}}\}} Q_{\phi'}(s,a).a∗=a∈{aIL​,aRL​}argmax​Qϕ′​(s,a).
This is a small change architecturally, but a large change behaviorally. The agent can still discover better-than-demonstration behavior, yet it is protected from the early-stage brittleness of an undertrained actor. The selection is statewise and Q-guided, so the method is not “always imitate” and not “always optimize RL”; it is “trust whichever action currently looks better under value estimation.” That is exactly what sparse-reward robotics needs, because the Q-function can act as the arbiter between a reasonable prior and a potentially better but noisier learned policy.
The third role is subtler and often more important for stability: bootstrap proposal. Instead of only using the better proposal at action time, IBRL also uses it inside the TD target,
y=rt+γmax⁡a′∈{aIL,aRL}Qϕ′(st+1,a′).y=r_t+\gamma\max_{a'\in\{a_{\mathrm{IL}},a_{\mathrm{RL}}\}}Q_{\phi'}(s_{t+1},a').y=rt​+γa′∈{aIL​,aRL​}max​Qϕ′​(st+1​,a′).
This means the critic is trained to evaluate the best available proposal in the next state, not merely the next action from the RL actor. Conceptually, this shrinks the gap between the value function and the behavior actually being executed. If the RL actor is weak, the bootstrap target can still be anchored by the imitation policy; if the RL actor improves, it can gradually take over. The important assumption here is that the two proposals are comparable enough for Q to rank them meaningfully; the failure mode is when both are poor or the Q estimates are badly miscalibrated, in which case the max operator can still propagate error.
A useful way to think about this is that IBRL splits the imitation policy’s influence into two channels:
acting: choose safer, higher-value exploration steps;
learning: choose stronger bootstrap targets.
That separation is exactly why the method is more robust than a pure demonstration replay strategy. Replay can reuse good transitions, but it does not directly help the agent decide what to do in novel states. Likewise, regularization can keep the policy near demonstrations, but it does not explicitly help Bellman backups prefer the demonstration action when the RL actor is uncertain.
The soft IBRL variant relaxes the hard arg⁡max⁡\arg\maxargmax decision by sampling from a proposal distribution,
a∗∼pQ(a),  a′∼pQ(a).a^*\sim p_Q(a),\; a'\sim p_Q(a).a∗∼pQ​(a),a′∼pQ​(a).
This smooths the selection rule and can reduce brittleness when Q estimates are noisy or when the two candidate actions are close in value. But the tabular setting reveals an important limitation: if the proposal distribution is too soft, then the method can lose the decisive advantage of the imitation policy and become more like an ordinary exploratory RL update with an extra noisy candidate. In other words, hardness is not merely a design choice; it is part of why IBRL can exploit demonstrations so effectively when reward is sparse.
This is also where the modular IL/RL architecture and actor dropout matter. If the imitation and RL policies are forced to share too much representation, the stronger supervised signal can dominate or interfere with the RL signal, and the system may inherit the wrong inductive bias. By keeping the modules decoupled, IBRL preserves the roles: the imitation policy remains a stable proposal source, while the RL actor can specialize in surpassing the demonstrator. Actor dropout then makes the system more robust by preventing overreliance on either branch, encouraging the critic and selection logic to function even when one proposal is temporarily poor.
So the unifying story is simple but powerful: μψ\mu_\psiμψ​ is not just pretraining data converted into weights; it is a reusable action prior that participates in both exploration and bootstrapping. That is why the method can begin conservatively, yet still improve beyond demonstrations once the critic becomes informative. The visual below compresses this whole logic into one table: each row marks a different place where the imitation policy enters the pipeline, and the highlighted rows emphasize the two core contributions, actor proposal and bootstrap proposal. The takeaway box then distills the mechanism into the central claim of the paper: separate imitation plus Q-based selection gives the agent a better way to explore and a better way to learn from what it sees.