State of Robot Learning, December 2025

Basically all robot learning systems today (December, 2025) are pure Behavior Cloning (BC, also called Imitation Learning) systems. Humans provide (near) optimal demonstrations of a task, and machine learning models try to imitate those actions. Formally, a policy $\pi$ is trained in a supervised fashion — given the robot’s state $s$ (i.e. the camera images, robot joint angles, and maybe task description text), $\pi(s)$ predicts the demonstrated actions $a$ (often an action chunk, e.g. the next second of ~50hz).

This doc aims to describe the anatomy of a modern BC stack, as well as its shortcomings and (incomplete / clunky) workarounds. Notably, it focuses on the relevant problem formulations and data sources, not seemingly less important details like model architecture. It then aims to explain what other approaches people are considering for the future, and the issues preventing them from being the conventional approach. Finally, it concludes with some predictions about the future of robot learning, and navigation advice for the “picks and shovels” salesmen in the Embodied AI race.

The Anatomy of a 2025 Robot Learning Stack

Collecting human expert demonstrations

First and foremost, to do Behavior Cloning you need data to clone. These come from human demonstrations, and from a variety of sources.

Leader-Follower (GELLO, ALOHA)

Humans directly teleoperate a full robot (follower) using a controller (leader). This can be done with a full copy of the robot setup (ALOHA¹) or a smaller, lighter scaled down version (GELLO²).

Smart Demo Gloves (Universal Manipulation Interface)

Rather than full leader follower, humans hold devices (e.g. Universal Manipulation Interface³) in their hands and use these devices to perform the task. The end effectors match the robot, along with a cheap version of the sensor suite onboard the robot to try to reconstruct $s$ . Devices perform SLAM to get end effector pose in task space, such that IK can later be used to estimate full joint state.

Direct human demonstrations

YouTube and other video sources have large scale data of humans performing all kinds of tasks. Similarly, many factories feature humans performing dexterous tasks, and these workers can be augmented with cameras to record their observations, providing an enormous source of data.

The hard problem of behavior cloning (OOD states)

Behavior cloning sounds simple in principle — supervise $\pi(s)$ to predict $a$ .

However, even with extremely clean demonstration data these policies still wander into out of distribution states. There are several reasons for this:

Tackling these challenges requires design choices, both for the model itself and for the data it’s trained on. Modeling choices are important — you need data driven priors and model classes that can handle action multi-modality — but plenty of literature exists covering that (e.g. $\pi_0$ ⁶), and the data distribution the model is trained seems to matter much more.

As discussed in 3), naively training these models on expert human demonstrations will result in the accumulation of errors in their predictions during inference, leading them to drift out-of-distribution into states they’ve never seen before. While the strong visual priors of a VLM can help the model generalize to novel states, there will still be scenarios where the model fails.

Tackling out-of-distribution state performance (by bringing them in distribution)

This is why it’s important to not just naively train on expert human data! In addition to these straightforward task demonstrations, it’s critical to train the model how to get out of these failure states — a “DAgger”⁷ style approach. There’s a bit of nuance to constructing this data — you want to train your model to leave these bad states, but you do not want to accidentally train to enter these bad states, lest it imitate this data and intentionally visit these bad states. Doing this right means carefully curating your recovery data.

Building out this DAgger data is an iterative process, and an art at that. You train the model for the given task, observe its failure modes, concoct a new dataset to try to address those failure modes, retrain, and retry. This is a tedious process, requiring many hours of very smart and discerning human time to essentially play whack-a-mole with various issues. Along the way, you start to develop a touch and feel for the policy and its issues. Due to the need for rapid iteration, this is typically done as a post-training step atop a base pretrained policy, and hopefully that base policy has already seen quite a bit of task data such that it already mostly knows what it’s doing.

This frustration is compounded by the fact that the touch and feel you have developed from your task iteration can be completely wiped out by a new pretraining of the base policy, sometimes presenting a new (but hopefully much smaller) set of failure modes. This DAgger data can be included in a pretraining run, and alongside data scale often results in higher quality predictions and fewer failures. With sufficient effort on data iteration, policies can be made to be surprisingly robust.

As these policies get more robust, they also take more of your time to evaluate their performance. If your policy typically fails every 15 seconds, you only need a few minutes of evals comparing training run A vs B to get signal on their performance. If your policy takes minutes to hours between failures, you need to spend many hours doing evals to get any relative signal. It’s tempting to look for offline metrics (e.g. the validation MSE featured in = Generalist’s blogpost⁸), but empirically there is very poor correlation between these offline metrics and on-robot performance.

Speeding up your behavior cloning policy (it’s hard!)

DAgger addresses robustness issues, and avoiding catastrophic failures can speed up your average time to complete a task, but it does nothing to improve your speed in best-case-scenario. Given a dataset, you can discard all but the fastest demonstrations (losing enormous data scale and likely hurting robustness), or condition on speed (see: Eric Jang’s “Just Ask For Generalization”⁹), but none of these allow for faster than human demonstration performance.

Another trick is to simply execute the policy actions at faster than realtime (e.g. execute 50hz control at 70hz), but this stresses your low level control stack and leads to incorrect behavior when interacting with world physics (e.g. waiting for a garment to settle flat on a table after being flicked in the air).

Beyond a Behavior Cloning Stack

The 2025 BC stack kind of sucks. It is not just bottlenecked on data scale to get generalization, but also the speed of the data collectors providing the demonstrations and the hustle (and taste) of the data sommelier doing DAgger to address any failures.

Reinforcement Learning seems to fit this bill. RL has been wildly successful in the LLM space, and it’s tempting to imagine we can drag and drop the same techniques into robotics. Unfortunately, this has yet to pan out, despite several different approaches.

RL in LLMs

Because of these two factors, online, on-policy RL becomes feasible. Either directly, or after a little bit of supervised fine-tuning from a few expert demonstrations, the policy can start to achieve a non-zero success rate from a given state $s$ . This allows for the LLM to simply be rolled out hundreds or thousands of times from $s$ as a form of exploration, receive (sparse) rewards from the environment on how its performed, and directly update its policy.

Importantly, this process avoids having to hallucinate a counterfactual. By rolling out many different trajectories from $s$ , it avoids having to hallucinate “what if”s and instead directly receives environment feedback from its already strong guesses.

Robotics has none of these luxuries in the real world. Given the state $s$ of a messy kitchen at the beginning of a “clean the kitchen” task, we do not have the ability to easily perfectly replicate the clutter in the kitchen hundreds of times, nor do we have strong enough base models that we can reliably fully clean the kitchen with some nonzero success rate.

Thus, we either need to leverage simulation, where we can reliably reconstruct $s$ arbitrarily many times (and suffer the sim to real gap), or we need to be able to hallucinate good quality answers to counterfactuals given only a single real rollout from a real state $s$ .

RL in Sim

In LLMs, there is no sim-to-real gap — the environments it interacts with during training are the exact same environments it will see at inference. However, in robotics, our simulators are a facsimile for the real world, and often a poor one at that. Simulators have naive physics models, have to make numerical estimates to handle multiple colliding bodies, must select contact models with different tradeoffs, are poor models of non-rigid objects, and have large visual gaps between sim and real.

For these reasons training policies entirely in simulation performs very poorly when transferring to the real world. Domain randomization, i.e. significantly varying the parameters of the simulator, helps, as does having a highly structured visual input representation (e.g. scan dots), but outside of locomotion (e.g. RMA¹⁰) this has seen limited success on robots.

There is ongoing work in “world models”, which are effectively learned simulators. One major reason for hope is, unlike a policy which needs to know the optimal action given a state, a world model need only simulate the dynamics given a state and action. In domains with structure (such as the real world, which has physics composable rules of interaction), any state action transition data, be it from an optimal or a random policy, seemingly should aid in learning general dynamics, hopefully giving us a shot at building a good, general purpose world model. That said, as of today, I am unaware of any work that comes close to modeling well the sort of environment interaction dynamics that we care about for dexterous manipulation.

RL In Real

Using real-world data avoids any sim to real gap, the same reason we were animated to do BC to begin with. However, learning to improve directly from your own policy rollouts has a number of hurdles.

The goal of an RL improvement loop is to upweight relatively good actions and downweight relatively bad ones. To know if an action was relatively good or not, we need to answer counterfactuals; as we discussed in the LLM section, we don’t have the luxury of simply running the policy over and over from the same state, trying a bunch of semi-reasonable actions to estimate the relative performance of action $a$ vs $a'$ . Instead, we need some sort of system to hallucinate this; either a Q function that directly estimates discounted reward $Q(s, a)$ , or some knowledge of the transition dynamics $(s, a) \rightarrow s'$ and then the Value of nearby state $V(s')$ .

Notably, both $Q$ and $V$ are a sort of world model by a different name; rather than predicting some future state in its entirety as you might imagine out of a learned simulator, its instead baking in a bunch of long horizon information about how, under good decision making through future interactions with the world, you will ultimately get to the goal.

As you might imagine, this too is quite challenging, and learning good Q or V functions is an open area of research. Very recently, Physical Intelligence released $\pi_{0.6}^*$ ¹¹, an approach that performs a variant of advantage weighted regression (BC, but rather than weighting every transition equally, weight it by $Q(s, a) - V(s)$ ), where they show minor improvements beyond that of just doing naive BC on the same data. However, in many of the tasks, the policy also required human DAgger data, and it’s clearly not a silver bullet for real world RL. There is much more work to be done in building good, reliable Q and V functions such that they work well out of distribution, without grossly over or under estimating their true values.

Predictions and Advice

As part of understanding where the field is going, many people have asked me for advice about building “picks and shovels” startups to profit from the Embodied AGI race. I think:

I think the only solid foundation for the future is: human demonstrations will continue to matter. If you build out a hardware plus software stack for demonstration (either GELLO or UMI) that reduces the painpoints described above and you can show produces good policies by training some, you will be an attractive business partner if not outright acquisition target.

Zhao, T. Z., Kumar, V., Levine, S., & Finn, C. (2023). Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. Robotics: Science and Systems (RSS).↩︎
Wu, P., Shentu, Y., Yi, Z., Lin, X., & Abbeel, P. (2023). GELLO: A General, Low-Cost, and Intuitive Teleoperation Framework for Robot Manipulators. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).↩︎
Chi, C., Xu, Z., Pan, C., Cousineau, E., Burchfiel, B., Feng, S., Tedrake, R., & Song, S. (2024). Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots. Robotics: Science and Systems (RSS).↩︎
Generalist AI Team. (2025). GEN-0: Embodied Foundation Models That Scale with Physical Interaction. Generalist AI Blog. Available at: https://generalistai.com/blog/nov-04-2025-GEN-0 ↩︎
Sunday Team. (2025). ACT-1: A Robot Foundation Model Trained on Zero Robot Data. Sunday AI Journal. Available at: https://www.sunday.ai/journal/no-robot-data ↩︎
Black, K., et al. (2024). $\pi_0$ : A Vision-Language-Action Flow Model for General Robot Control. Physical Intelligence.↩︎
Ross, S., Gordon, G., & Bagnell, J. A. (2011). A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. AISTATS.↩︎
Generalist AI Team. (2025). GEN-0: Embodied Foundation Models That Scale with Physical Interaction. Generalist AI Blog. Available at: https://generalistai.com/blog/nov-04-2025-GEN-0 ↩︎
Jang, E. (2021). Just Ask for Generalization. [Blog Post]. Available at: evjang.com/2021/10/23/generalization.html ↩︎
Kumar, A., Fu, Z., Pathak, D., & Malik, J. (2021). RMA: Rapid Motor Adaptation for Legged Robots. Robotics: Science and Systems (RSS).↩︎
Amin, A., et al. (2025). $\pi^*_{0.6}$ : a VLA that Learns from Experience. Physical Intelligence.↩︎

State of Robot Learning — December 2025