Human videos are not just for imitation

Robot Self-Improvement via Human-Video Dynamics Models

Preprint 2026
*Equal Contribution   Equal Advising  
1ETH Zürich   2Technical University of Munich   3Microsoft   4MCML

We leverage policy, dynamics, and value models pre-trained on human videos to enable autonomous failure correction, boosting success rates from 40% to 81% across seven real-world tasks on both mobile and stationary manipulators—without human intervention.

A central question in robot learning is how to acquire skills from the kinds of data that humans learn from: passive observation, embodied practice, and the experience of failure. Human videos provide the first of these in abundance, and prior work has shown they can initialize useful policies. Far less clear is whether they can support the second and third: whether priors extracted from human videos can ground a robot's own attempts well enough to evaluate them, correct them, and improve from them. In this work, we show that human videos can be used to learn embodiment-agnostic action, dynamics, and value representations that transfer across robot embodiments, providing the predictive foundation for robots to autonomously improve from their own rollouts and failures. We introduce Dynamics-Guided Action Correction (DGAC), a training-free approach that leverages these adapted models to repair failed states — each failure becomes a query for which the learned models propose and rank corrective actions, turning failures into supervision for the next policy update. Across seven real-world manipulation tasks spanning both a mobile manipulator and a static manipulator arm, our approach improves success rates from 40% to 81% across multiple policy backbones, demonstrating cross-embodiment robot self-improvement from human-video priors.

general Pipeline

Left: We pretrain shared policy, dynamics, and value representations from human videos to support cross-embodiment robot self-improvement. The policy model predicts wrist actions represented by a 6-DoF pose and a hand-closure variable. The dynamics model forecasts action-conditioned world states represented by DINO-v3 visual features and 3D point trajectories. The value model learns an embodiment-agnostic progress representation that estimates a state's proximity to task success.

Right: Building on these pretrained models, we develop a self-improvement pipeline that learns from autonomous robot experience. Successful and failed rollouts are used to adapt the dynamics and value models, while Dynamics-Guided Action Correction (DGAC) converts recoverable failures into corrective supervision. The resulting trajectories are then used for policy improvement through advantage-conditioned policy extraction, enabling continual learning without human intervention.

Dynamics-Guided Action Correction (DGAC)

DGAC converts failed states into corrective supervision. It uses the learned dynamics and value models to rank candidate actions and identify the best correction for near-failure but recoverable states, without human intervention.

DGAC Pipeline

Given a failed state, DGAC samples candidate corrective actions (colorful trajectories), predicts their future states and values, and selects the highest-value proposal (green) as the corrective action, adding it to the repair dataset for policy supervision.

Benchmark Results

Average success rate (%)

Ours: 85.3% average success rate, best among all baselines.

No human-intervened corrections for fair comparisons.

Across 5 real-world tasks, our framework achieves the highest average success rate among 6 representative baselines.

Policy-Agnostic Self-Improvement

Average success rate (%)
  • Supervised fine-tuning (SFT) provides a strong starting point through large-scale robot pretraining.
  • RECAP without human corrections yields only +5.3%.
  • Our DGAC module delivers a substantial +25.3% improvement for 𝜋0.5 + SFT.
  • Human-video priors and failure-to-supervision learning (Ours) nearly match heavily robot-pretrained policies (85.3% vs. 88.0%).

Our framework also generalizes to a different policy backbone (𝜋0.5).

Qualitative Self-Improvement Results

Each pair shows Before Self-Improvement (left) vs. After Self-Improvement (right) using proposed DGAC module. Repeated trials show that self-improvement converts initial failures into consistent task success across different embodiments. All videos are played at 1x speed.

Before Self-Improvement
After Self-Improvement
Open Ricecooker

We further show additional full-rollout results. Select a robot and task below.

Before Self-Improvement
After Self-Improvement
1 / 1

This work was supported by Technical University of Munich (TUM) and the State of Bavaria through the REACT project, TUM Georg Nemetschek Institute via the SPAICR project, Munich Center for Machine Learning (MCML) and ETH Zurich. We thank Helen Oleynikova for her support during the initial phase of the project.

BibTeX


      @article{chenzhang2026robot,
        title   = {Robot Self-Improvement via Human-Video Dynamics Models},
        author  = {Chen, Hanzhi and Zhang, Anran and Schaefer, Simon and
                  Chen, Kejia and Chen, Shi and Cremers, Daniel and
                  Mees, Oier and Leutenegger, Stefan},
        journal = {arXiv preprint arXiv:2606.21406},
        year    = {2026}
      }
    
Content Abstract Method Results Acknowledgement BibTeX