Robot Self-Improvement via Human-Video Dynamics Models

Abstract

A central question in robot learning is how to acquire skills from the kinds of data that humans learn from: passive observation, embodied practice, and the experience of failure. Human videos provide the first of these in abundance, and prior work has shown they can initialize useful policies. Far less clear is whether they can support the second and third: whether priors extracted from human videos can ground a robot's own attempts well enough to evaluate them, correct them, and improve from them. In this work, we show that human videos can be used to learn embodiment-agnostic action, dynamics, and value representations that transfer across robot embodiments, providing the predictive foundation for robots to autonomously improve from their own rollouts and failures. We introduce Dynamics-Guided Action Correction (DGAC), a training-free approach that leverages these adapted models to repair failed states — each failure becomes a query for which the learned models propose and rank corrective actions, turning failures into supervision for the next policy update. Across seven real-world manipulation tasks spanning both a mobile manipulator and a static manipulator arm, our approach improves success rates from 40% to 81% across multiple policy backbones, demonstrating cross-embodiment robot self-improvement from human-video priors.

Method

Left: We pretrain shared policy, dynamics, and value representations from human videos to support cross-embodiment robot self-improvement. The policy model predicts wrist actions represented by a 6-DoF pose and a hand-closure variable. The dynamics model forecasts action-conditioned world states represented by DINO-v3 visual features and 3D point trajectories. The value model learns an embodiment-agnostic progress representation that estimates a state's proximity to task success.

Right: Building on these pretrained models, we develop a self-improvement pipeline that learns from autonomous robot experience. Successful and failed rollouts are used to adapt the dynamics and value models, while Dynamics-Guided Action Correction (DGAC) converts recoverable failures into corrective supervision. The resulting trajectories are then used for policy improvement through advantage-conditioned policy extraction, enabling continual learning without human intervention.

Dynamics-Guided Action Correction (DGAC)

DGAC converts failed states into corrective supervision. It uses the learned dynamics and value models to rank candidate actions and identify the best correction for near-failure but recoverable states, without human intervention.

Given a failed state, DGAC samples candidate corrective actions (colorful trajectories), predicts their future states and values, and selects the highest-value proposal (green) as the corrective action, adding it to the repair dataset for policy supervision.

Self-Improvement Results

Benchmark Results

Average success rate (%)

41.3Expert BC 49.3SWIM 60.0LPB 60.0AWR 61.3RECAP^† 76.0RISE^† 85.3Ours

Ours: 85.3% average success rate, best among all baselines.

^† No human-intervened corrections for fair comparisons.

Across 5 real-world tasks, our framework achieves the highest average success rate among 6 representative baselines.

Policy-Agnostic Self-Improvement

Average success rate (%)

62.7π0.5+SFT 68.0π0.5+RECAP^† 88.0π0.5+DGAC 85.3Ours

Supervised fine-tuning (SFT) provides a strong starting point through large-scale robot pretraining.
RECAP^† without human corrections yields only +5.3%.
Our DGAC module delivers a substantial +25.3% improvement for 𝜋_0.5 + SFT.
Human-video priors and failure-to-supervision learning (Ours) nearly match heavily robot-pretrained policies (85.3% vs. 88.0%).

Our framework also generalizes to a different policy backbone (𝜋_0.5).

Qualitative Self-Improvement Results

Each pair shows Before Self-Improvement (left) vs. After Self-Improvement (right) using proposed DGAC module. Repeated trials show that self-improvement converts initial failures into consistent task success across different embodiments. All videos are played at 1x speed.

Before Self-Improvement

After Self-Improvement

Open Ricecooker

We further show additional full-rollout results. Select a robot and task below.