Residual Off-Policy RL for Finetuning Behavior Cloning Policies

1Amazon FAR (Frontier AI & Robotics)   2Stanford University   3Carnegie Mellon University   4UC Berkeley
*Work done while interning at Amazon FAR   †Work done while at Amazon FAR

We present ResFiT, a residual RL method that performs real-world RL directly on our 29 Degree-of-Freedom (DoF) wheeled humanoid platform with two 5-fingered-hands, demonstrating successful real-world RL training on a humanoid robot with dexterous hands.

Abstract

Recent advances in behavior cloning (BC) have enabled impressive visuomotor control policies. However, these approaches are limited by the quality of human demonstrations, the manual effort required for data collection, and the diminishing returns from increasing data.
In comparison, reinforcement learning (RL) trains an agent through autonomous interaction with the environment and has shown remarkable success in various domains. Still, training RL policies directly on real-world robots remains challenging due to sample inefficiency, safety concerns, and the difficulty of learning from sparse rewards for long-horizon tasks, especially for high-degree-of-freedom (DoF) systems.
We present a recipe that combines the benefits of BC and RL through a residual learning framework. Our approach leverages BC policies as black-box bases and learns lightweight per-step residual corrections via efficient off-policy RL.
We show that our method requires only sparse binary reward signals and can effectively learn manipulation tasks on high-DoF systems in both simulation and the real world. In particular, we demonstrate, to the best of our knowledge, the first successful real-world RL training on a humanoid robot with dexterous hands.
Our results show state-of-the-art performance in various vision-based tasks, establishing a practical pathway to deploy RL in the real world.

The key insight

Modern BC architectures are typically deep models with tens of millions of parameters that use action chunking or diffusion-based approaches, which can make it challenging to apply RL methods to directly optimize the policy.

Our ResFiT approach leverages BC policies as black-box bases and learns lightweight per-step residual corrections via efficient off-policy RL, requiring only sparse binary reward signals and enabling real-world training on high-DoF systems.

Pipeline diagram



Introducing ResFiT: Residual Fine-tuning with Off-Policy RL

ResFiT is a two-phase approach using online RL to improve BC policies. It takes as input a pre-trained BC policy and learns lightweight per-step residual corrections via efficient off-policy RL.

The method leverages BC policies as black-box bases, avoiding the need to modify the complex BC architecture while enabling real-world training on high-DoF systems with only sparse binary reward signals.

Task overview



Residual learning leads to significant performance improvements

Our residual off-policy approach demonstrates substantial improvements in sample efficiency and final performance compared to standard off-policy algorithms across a variety of challenging environments.

We compare our method to several strong baselines and ablations. First, we examine how off-policy residual RL on top of an action-chunked base policy compares to directly performing off-policy RL to learn a single-action policy from the same demonstrations. For this comparison, we use an optimized version of RLPD [1] (state-of-the-art off-policy RL) called "Tuned RLPD" that incorporates the same design decisions as our method but without the base policy. We also compare against IBRL [2], which uses a pre-trained BC policy to propose actions and bootstrap target values during RL training. Additionally, we include "Filtered BC," an online BC fine-tuning baseline that starts with the same base policy but iteratively adds successful rollouts back into the dataset for continued behavioral cloning rather than using residual RL. Finally, we ablate key design choices including layer normalization, demo incorporation during online RL.

Success rate comparison

Performance comparison across simulation tasks showing ResFiT's improvements over baseline methods including Tuned RLPD, IBRL, and Filtered BC.




Off-Policy vs On-Policy Residual RL

On the simulated BoxClean task, we compare our off-policy residual RL approach to performing residual RL with the on-policy PPO algorithm, as done in the ResiP algorithm [3].

Our approach converges at 200k steps versus 40M steps, demonstrating a 200× boost in sample efficiency. This improvement highlights the need for off-policy approaches when performing RL directly in the real world.

Off-policy vs On-policy comparison on BoxClean task

Off-policy TD3 converges in 200k steps vs. on-policy PPO requiring 40M steps—a 200× improvement in sample efficiency.




Real-World RL Fine-Tuning using ResFiT

We apply the ResFiT method to real-world tasks and demonstrate, to our knowledge, the first successful real-world RL training on a humanoid robot with dexterous hands.




Real-World Task Results

We demonstrate ResFiT on a real-world bimanual dexterous manipulation platform - the wheeled Vega humanoid from Dexmate. The robot features two 7-DoF arms, two 6-DoF OyMotion dexterous hands, and 2 image streams from a head-mounted Zed camera, with a 29-dimensional action space using absolute joint position control.

Tasks: We evaluate on two challenging manipulation tasks. In WoollyBallPnP, the right hand picks up a ball from a random table location and places it into a randomly positioned tote. In PackageHandover, the right hand picks up a deformable package, hands it to the left hand, and places it into a tote on the left side of the workspace.

Evaluation: To ensure fair comparison, we use blind A/B testing with matched initial conditions. For each round, we sample random scene configurations, randomly assign policies to labels A and B, execute both policies from identical states, and reveal identities only after completion. This mitigates evaluator bias and environmental confounds common in real-world robot evaluations.

Real-world task success rates comparison

Success rates on real-world tasks comparing BC with ACT baseline vs. ResFiT residual policy.




PackageHandover Comparisons

Side-by-side comparison of base policy (left) vs. residual policy (right) performance on package handover tasks. The residual policy demonstrates improved success rates and more robust handling across different scenarios.

Base Policy

Residual Policy




WoollyBallPnP Comparisons

Side-by-side comparison of base policy (left) vs. residual policy (right) performance on woolly ball manipulation tasks. These examples showcase scenarios where the base policy fails but the residual policy successfully completes the task, demonstrating the robustness improvements achieved through residual learning.

Base Policy

Residual Policy




What Matters for Performance?

We find that in long-horizon tasks with sparse rewards, the Update-to-Data (UTD) ratio and using n-step returns to be crucial.

UTD ratio ablation study

UTD ratio ablation showing optimal performance at UTD=4, with diminishing returns at higher values.

For UTD ratios, we observe that for tasks with horizon lengths of 150-250 steps (such as BoxCleanup), UTD values greater than 1 provide clear benefits, but gains plateau at moderate values. Learning is noticeably slower for a UTD of ½, but increasing to very high UTDs of 8 or higher yields diminishing returns, with UTDs of 4 already capturing most of the benefit while remaining stable.

N-step returns ablation study

N-step returns ablation demonstrating the importance of multi-step returns for sparse reward tasks, with optimal performance around n=5.

For n-step returns, we see the importance of using values larger than 1 for sparse reward tasks. However, since larger n-step values also increase bias, excessively large values can negatively affect performance, requiring a careful balance between reducing variance and controlling bias.




Comparison with Baseline in Simulation

Side-by-side comparison of BC policy (left) vs. Tuned RLPD (center) vs. ResFiT (right) performance on simulated tasks. Tuned RLPD shows much faster but very different behavior than the BC policy, and it fails on the coffee task. ResFiT improves performance over all tasks.

BC Policy

Success Rate: 77%
Average Number of Steps: 322

Tuned RLPD

Success Rate: 0%
Average Number of Steps: N/A

ResFiT

Success Rate: 98%
Average Number of Steps: 141

Success Rate: 76%
Average Number of Steps: 231

Success Rate: 98%
Average Number of Steps: 39

Success Rate: 97%
Average Number of Steps: 97

Success Rate: 89%
Average Number of Steps: 292

Success Rate: 93%
Average Number of Steps: 27

Success Rate: 99%
Average Number of Steps: 166

Success Rate: 76%
Average Number of Steps: 174

Success Rate: 99%
Average Number of Steps: 39

Success Rate: 91%
Average Number of Steps: 121

Success Rate: 78%
Average Number of Steps: 160

Success Rate: 97%
Average Number of Steps: 41

Success Rate: 99%
Average Number of Steps: 110

Related Links

There's a lot of excellent work related to ours in the space of manipulation, reinforcement learning, and BC models. Here are some notable examples:

Policy Learning and Fine-tuning

  • Policy-Agnostic Reinforcement Learning introduces a universal method for fine-tuning various policy classes, including diffusion and transformer-based policies, by decoupling policy improvement from specific policy architectures to enhance sample efficiency in both offline and online RL settings.
  • Policy Decorator presents a modular approach that allows a single policy to generalize across diverse agent morphologies, enabling control over various agents with different skeletal structures through shared modular policies.
  • EXPO: Expressive Policy Fine-tuning focuses on refining pre-trained policies by leveraging expressive models to adapt and improve performance in new tasks or environments through efficient policy adaptation.
  • Q-chunking presents a simple yet effective recipe for improving reinforcement learning algorithms for long-horizon, sparse-reward tasks by applying action chunking to temporal difference-based RL methods, enabling agents to leverage temporally consistent behaviors from offline data for more effective online exploration.
  • Horizon Reduction Makes RL Scalable studies the scalability of offline reinforcement learning algorithms and introduces SHARSA, a minimal yet scalable method that effectively reduces the horizon to unlock the scalability of offline RL on challenging tasks.

Diffusion Models and Reinforcement Learning

Recent work has explored combining diffusion models with reinforcement learning:

  • DPPO introduces Diffusion Policy Policy Optimization, an algorithmic framework for fine-tuning diffusion-based policies using policy gradient methods.
  • FPO presents Flow Policy Optimization, a policy gradient algorithm that brings flow matching into the policy gradient framework for training flow-based generative policies from scratch.
  • Black et al. and Fan et al. studied how to cast diffusion de-noising as a Markov Decision Process, enabling preference-aligned image generation with policy gradient RL.
  • IDQL uses a Q-function to select the best among multiple diffusion model outputs.
  • Goo et al. explored advantage weighted regression for diffusion models.
  • Decision Diffuser and related works change the objective into a supervised learning problem with return conditioning.
  • Wang et al. explored augmenting the de-noising training objective with a Q-function maximization objective.

Residual Learning in Robotics

Learning corrective residual components has seen widespread success in robotics:

  • ResiP enhances behavior cloning policies by training residual policies with reinforcement learning for high-precision assembly tasks.
  • Works like Silver et al., Davchev et al., and others have explored learning residual policies that correct for errors made by a nominal behavior policy.
  • Ajay et al. and Kloss et al. combined learned components to correct for inaccuracies in analytical models for physical dynamics.
  • Schoettler et al. applied residual policies to insertion tasks.
  • TRANSIC by Jiang et al. applied residual policy learning to the FurnitureBench task suite, using the residual component to model online human-provided corrections.

Theoretical Analysis of Imitation Learning

There's been an increasing amount of theoretical analysis of imitation learning, with recent works focusing on the properties of noise injection and corrective actions:

  • Provable Guarantees for Generative Behavior Cloning proposes a framework for generative behavior cloning, ensuring continuity through data augmentation and noise injection.
  • CCIL generates corrective data using local continuity in environment dynamics.
  • TaSIL penalizes deviations in higher-order Taylor series terms between learned and expert policies.

These works aim to enhance the robustness and sample efficiency of imitation learning algorithms.

Future Directions

Looking ahead, figuring out the right way to relax the frozen base constraint without sacrificing stability could provide further improvement in performance and robustness. If we can distill the more precise, reliable, and fast behavior from the combined policy back into the base policy, that would provide more room for the residual model to improve further.

This would be particularly powerful in the multitask setting, where one can distill task-specific residual improvements into an increasingly capable generalist. Our method is base-model agnostic and could potentially scale to fine-tuning large multi-task behavior models while fully preserving their pre-training capabilities.

BibTeX

@misc{ankile2025residualoffpolicyrlfinetuning,
        title={Residual Off-Policy RL for Finetuning Behavior Cloning Policies}, 
        author={Lars Ankile and Zhenyu Jiang and Rocky Duan and Guanya Shi and Pieter Abbeel and Anusha Nagabandi},
        year={2025},
        eprint={2509.19301},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2509.19301}, 
  }