Baselines — BestPlans Robotics

Modern imitation learning methods for robotics use generative models to predict chunks of trajectories conditional on the current state of the world and the robot. Such generative models are essential to deal with the stochasticity (e.g. multimodal action distributions) that is abundant in robotics demonstration data. They are noticeably more performant than old methods, but come with both implementation and run-time costs. Generative models require more complex training and inference methods than the simple l1 or l2 losses applied directly to the output. As a result, the models can be difficult to debug and costly to deploy at runtime. Diffusion and flow based generative models in particular often require 100s to 1000s of inference steps at runtime (or special compression methods) to achieve good performance. Many SOTA models are also costly to train, often requiring hundreds of thousands of epochs.

In our own robotics work, we have found it preferable to use a simple method that we call atoms of behavior. It is at heart an adaptation of a method commonly used to train image segmentation models. Instead of outputting 1 image mask, models such as SAM output 3 image masks and confidence scores for each mask. We need more than one mask, because a given prompt might reasonably correspond to more than one segmentation (e.g. you might want to segment out the shoe, the leg, or the whole human in response to a given pixel on a shoe). This allows for a simple type of multi-modality in model output

We apply the same method to robotic trajectories. Given the current state, we forecast n potential trajectories as distinct output heads and we also predict confidence scores for each trajectory. At train time, we use a winner-take-all loss where we only update the trajectory which best matches the observed trajectory. We also use the index of the matching trajectory to update the confidence model. At test time, we again forecast trajectories and pick the trajectory with the highest confidence for execution.

Schematic of the atoms-of-behavior approach to action prediction- world and robot states are jointly passed through a simple neural net to forecast n potential trajectories. The trajectory with the highest confidence is then chosen for execution.

The whole method requires a single inference pass and a simple l2 loss on the target to train. This simplifies our operations considerably. As one example, whereas diffusion policy on PushT needs 200k epochs to train, 500 inference steps on a convolutional UNet and gets 70% correct on PushT (0.97 avg max reward), our method needs only 2k epochs, uses a just a simple MLP to forecast trajectories in 1 inference step and achieves 59% score (0.93 avg max reward). We get similar relative results on many other environments such as Maniskill InsertPeg (65% correct ours vs 78% correct diffusion policy).

The whole pipeline is extremely simple and fast to train and run. Compared to diffusion policy, we also need several fold less design decisions and hyper-parameters to get things working. Although the performance is slightly below diffusion models, the simplicity of code, speed of training and low complexity of hyperparameters gives big dividends because it provides our engineers more time to work on the key drivers of success in our application pipelines: data collection, hardware iteration and low operation latency.

A super simple robotics baseline