Modern BC architectures are typically deep models with tens of millions of parameters that use action chunking or diffusion-based approaches, which can make it challenging to apply RL methods to directly optimize the policy.
Our ResFiT approach leverages BC policies as black-box bases and learns lightweight per-step residual corrections via efficient off-policy RL, requiring only sparse binary reward signals and enabling real-world training on high-DoF systems.
ResFiT is a two-phase approach using online RL to improve BC policies. It takes as input a pre-trained BC policy and learns lightweight per-step residual corrections via efficient off-policy RL.
The method leverages BC policies as black-box bases, avoiding the need to modify the complex BC architecture while enabling real-world training on high-DoF systems with only sparse binary reward signals.
Our residual off-policy approach demonstrates substantial improvements in sample efficiency and final performance compared to standard off-policy algorithms across a variety of challenging environments.
We compare our method to several strong baselines and ablations. First, we examine how off-policy residual RL on top of an action-chunked base policy compares to directly performing off-policy RL to learn a single-action policy from the same demonstrations. For this comparison, we use an optimized version of RLPD [1] (state-of-the-art off-policy RL) called "Tuned RLPD" that incorporates the same design decisions as our method but without the base policy. We also compare against IBRL [2], which uses a pre-trained BC policy to propose actions and bootstrap target values during RL training. Additionally, we include "Filtered BC," an online BC fine-tuning baseline that starts with the same base policy but iteratively adds successful rollouts back into the dataset for continued behavioral cloning rather than using residual RL. Finally, we ablate key design choices including layer normalization, demo incorporation during online RL.
Performance comparison across simulation tasks showing ResFiT's improvements over baseline methods including Tuned RLPD, IBRL, and Filtered BC.
On the simulated BoxClean task, we compare our off-policy residual RL approach to performing residual RL with the on-policy PPO algorithm, as done in the ResiP algorithm [3].
Our approach converges at 200k steps versus 40M steps, demonstrating a 200× boost in sample efficiency. This improvement highlights the need for off-policy approaches when performing RL directly in the real world.
Off-policy TD3 converges in 200k steps vs. on-policy PPO requiring 40M steps—a 200× improvement in sample efficiency.
We apply the ResFiT method to real-world tasks and demonstrate, to our knowledge, the first successful real-world RL training on a humanoid robot with dexterous hands.
We demonstrate ResFiT on a real-world bimanual dexterous manipulation platform - the wheeled Vega humanoid from Dexmate. The robot features two 7-DoF arms, two 6-DoF OyMotion dexterous hands, and 2 image streams from a head-mounted Zed camera, with a 29-dimensional action space using absolute joint position control.
Tasks: We evaluate on two challenging manipulation tasks. In
WoollyBallPnP
,
the right hand picks up a ball from a random table location and places it into a randomly positioned tote.
In PackageHandover
, the right hand picks up a deformable package, hands it to the left
hand,
and places it into a tote on the left side of the workspace.
Evaluation: To ensure fair comparison, we use blind A/B testing with matched initial conditions. For each round, we sample random scene configurations, randomly assign policies to labels A and B, execute both policies from identical states, and reveal identities only after completion. This mitigates evaluator bias and environmental confounds common in real-world robot evaluations.
Success rates on real-world tasks comparing BC with ACT baseline vs. ResFiT residual policy.
PackageHandover
ComparisonsSide-by-side comparison of base policy (left) vs. residual policy (right) performance on package handover tasks. The residual policy demonstrates improved success rates and more robust handling across different scenarios.
WoollyBallPnP
ComparisonsSide-by-side comparison of base policy (left) vs. residual policy (right) performance on woolly ball manipulation tasks. These examples showcase scenarios where the base policy fails but the residual policy successfully completes the task, demonstrating the robustness improvements achieved through residual learning.
We find that in long-horizon tasks with sparse rewards, the Update-to-Data (UTD) ratio and using n-step returns to be crucial.
UTD ratio ablation showing optimal performance at UTD=4, with diminishing returns at higher values.
For UTD ratios, we observe that for tasks with horizon lengths of 150-250 steps (such as BoxCleanup), UTD values greater than 1 provide clear benefits, but gains plateau at moderate values. Learning is noticeably slower for a UTD of ½, but increasing to very high UTDs of 8 or higher yields diminishing returns, with UTDs of 4 already capturing most of the benefit while remaining stable.
N-step returns ablation demonstrating the importance of multi-step returns for sparse reward tasks, with optimal performance around n=5.
For n-step returns, we see the importance of using values larger than 1 for sparse reward tasks. However, since larger n-step values also increase bias, excessively large values can negatively affect performance, requiring a careful balance between reducing variance and controlling bias.
Side-by-side comparison of BC policy (left) vs. Tuned RLPD (center) vs. ResFiT (right) performance on simulated tasks. Tuned RLPD shows much faster but very different behavior than the BC policy, and it fails on the coffee task. ResFiT improves performance over all tasks.
Success Rate: 77%
Average Number of Steps: 322
Success Rate: 0%
Average Number of Steps: N/A
Success Rate: 98%
Average Number of Steps: 141
Success Rate: 76%
Average Number of Steps: 231
Success Rate: 98%
Average Number of Steps: 39
Success Rate: 97%
Average Number of Steps: 97
Success Rate: 89%
Average Number of Steps: 292
Success Rate: 93%
Average Number of Steps: 27
Success Rate: 99%
Average Number of Steps: 166
Success Rate: 76%
Average Number of Steps: 174
Success Rate: 99%
Average Number of Steps: 39
Success Rate: 91%
Average Number of Steps: 121
Success Rate: 78%
Average Number of Steps: 160
Success Rate: 97%
Average Number of Steps: 41
Success Rate: 99%
Average Number of Steps: 110
There's a lot of excellent work related to ours in the space of manipulation, reinforcement learning, and BC models. Here are some notable examples:
Recent work has explored combining diffusion models with reinforcement learning:
Learning corrective residual components has seen widespread success in robotics:
There's been an increasing amount of theoretical analysis of imitation learning, with recent works focusing on the properties of noise injection and corrective actions:
These works aim to enhance the robustness and sample efficiency of imitation learning algorithms.
Looking ahead, figuring out the right way to relax the frozen base constraint without sacrificing stability could provide further improvement in performance and robustness. If we can distill the more precise, reliable, and fast behavior from the combined policy back into the base policy, that would provide more room for the residual model to improve further.
This would be particularly powerful in the multitask setting, where one can distill task-specific residual improvements into an increasingly capable generalist. Our method is base-model agnostic and could potentially scale to fine-tuning large multi-task behavior models while fully preserving their pre-training capabilities.
@misc{ankile2025residualoffpolicyrlfinetuning,
title={Residual Off-Policy RL for Finetuning Behavior Cloning Policies},
author={Lars Ankile and Zhenyu Jiang and Rocky Duan and Guanya Shi and Pieter Abbeel and Anusha Nagabandi},
year={2025},
eprint={2509.19301},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2509.19301},
}