Learning to Imitate
A key aspect of human learning is imitation: the capability to mimic and learn behavior from a teacher or an expert. This is an important ability for acquiring new skills, such as walking, biking, or speaking a new language. Although current Artificial Intelligence (AI) systems are capable of complex decisionmaking, such as mastering Go, playing complex strategic games like Starcraft, or manipulating a Rubik’s cube, these systems often require over 100 million interactions with an environment to train — equivalent of more than 100 years of human experience — to reach humanlevel performance. In contrast, a human can acquire new skills in relatively short amounts of time by observing an expert. How can we enable our artificial agents to similarly acquire such fast learning ability?
Another challenge with current AI systems is that they require explicit programming or handdesigning of reward functions inorder to make correct decisions. These methodologies are frequently brittle, or otherwise imperfect, and can lead to these systems struggling to work well in complex situations — with selfdriving cars blockading roads, or robots failing to coordinate with humans. Developing new methods that can instead interactively learn from human or expert data can be a key stepping stone towards sampleefficient agents and learning humanlike behavior.
In this post, I’ll discuss several techniques being developed in a field called “Imitation Learning” (IL) to solve these sorts of problems and present a recent method from our lab, called Inverse QLearning — which was used to create the best AI agent for playing Minecraft using few expert demos^{1}. You can check out the project page here, and the code for the underlying method here.
So, what is imitation learning, and what has it been used for?
In imitation learning (IL), an agent is given access to samples of expert behavior (e.g. videos of humans playing online games or cars driving on the road) and it tries to learn a policy that mimics this behavior. This objective is in contrast to reinforcement learning (RL), where the goal is to learn a policy that maximizes a specified reward function. A major advantage of imitation learning is that it does not require careful handdesign of a reward function because it relies solely on expert behavior data, making it easier to scale to realworld tasks where one is able to gather expert behavior (like video games or driving). This approach of enabling the development of AI systems by datadriven learning, rather than specification through code or heuristic rewards, is consistent with the key principles behind Software 2.0.
Imitation learning has been a key component in developing AI methods for decades, with early approaches dating back to the 1990s and early 2000s, including work by work by Andrew Ng and Stuart Russel, and the creation of the first selfdriving car systems. Recently, imitation learning has become an important topic with increasing realworld utility, with papers using the technique for driving autonomous cars, enabling robotic locomotion, playing video games, manipulating objects, and even robotic surgery.
Imitation Learning as Supervision
Early approaches to imitation learning seek to learn a policy as a machine learning model that maps environment observations to (optimal) actions taken by the expert using supervised learning. The method is called Behavioral Cloning (BC), but it has a drawback: BC has loose, or no, guarantees that the model will generalize to unseen environmental observations. A key issue is that when the agent ends up in an situation that is unlike any of the expert trajectories, BC is prone to failures. For example, in the figure above, the car agent doesn’t know what to do if it goes away from the expert trajectory and it crashes. To avoid making a mistake, BC requires expert data on all possible trajectories in the environment, making it a heavily datainefficient approach.
A simple fix, Dataset Aggregation (DAGGER), was proposed to interactively collect more expert data to recover from mistakes and was used to create the first autonomous drone that could navigate forests. Nevertheless, this requires a human in the loop and such interactive access to an expert is usually infeasible. Instead, we want to emulate the trialanderror process that humans use to fix mistakes. For the above example, if the car agent can interact with the environment to learn “if I do this then I crash,” then it could correct itself to avoid that behavior.
This insight led to the formulation of imitation learning as a problem to learn a reward function from the expert data, such that a policy that optimizes the reward through environment interaction matches the expert, thus inverting the reinforcement learning problem; this approach is termed Inverse Reinforcement Learning (IRL).
How do recent Imitation Approaches using IRL work?
In 2016, Ho and Ermon posed Inverse Reinforcement Learning as a minimax game between two AI models with simple parallels to GANs — a class of generative models. In this formulation, the agent policy model (the “generator”) produces actions interacting with an environment to attain the highest rewards from a reward model using RL, while the reward model (the “discriminator”) attempts to distinguish the agent policy behavior from expert behavior. Similar to GANs, the discriminator acts as a reward model that indicates how expertlike an action is.
Thus, if the policy does something that is not expertlike, it gets a low reward from the discriminator and learns to correct this behavior. This minimax game has a unique equilibrium solution called the saddle point solution (due to the geometrical saddle shape of the optimization). At the equilibrium, the discriminator learns a reward such that the policy behavior based on it is indistinguishable from the expert. With this adversarial learning of a policy and a discriminator, it is possible to reach expert performance using few demonstrations. Techniques inspired by such are referred to as Adversarial Imitation. (see the figure below for an illustration of the method)
Unfortunately, as adversarial imitation is based on GANs, it suffers from the same limitations, such as mode collapse and training instability, and so training requires careful hyperparameter tuning and tricks like gradient penalization. Furthermore, the process of reinforcement learning complicates training because it is not possible to train the generator here through simple gradient descent. This amalgamation of GANs and RL makes for a very brittle combination, which does not work well in complex imagebased environments like Atari. Because of these challenges, Behavioral Cloning remains the most prevalent method of imitation.
Learning Qfunctions for Imitation
These issues, recently, have led to a new nonadversarial approach for imitation: learning Qfunctions to recover expert behavior.
In RL, Qfunctions measure the expected sum of future rewards an agent can obtain starting from the current state and choosing a particular action. By learning Qfunctions using a neural network that takes in the current state and a potential action of the agent as input, one can predict the overall expected future reward obtained by the agent. Because the prediction is of the overall reward, as opposed to only the reward for taking that one step, determining the optimal policy is as simple as sequentially taking actions with the highest predicted Qfunction values in the current state. This optimal policy can be represented as the argmax over all possible actions for a Qfunction in a given state. Thus, the Qfunction is a very useful quantity, providing a connection between the reward function and the optimal behavior policy in an environment.
In IL, a simple, stable, and dataefficient approach has always been out of reach because of the abovementioned issues with previous approaches. Additionally, the instability of adversarial methods makes the Inverse RL formulation hard to solve. A nonadversarial approach to IL could likely resolve many of the challenges the field faces. Is there something to learn for IL from the remarkable success of Qfunctions in RL to determine the optimal behavior policy from a reward function?
Inverse QLearning (IQLearn)
To determine reward functions, what if we directly learn a Qfunction from expert behavior data? This is exactly the idea behind our recently proposed algorithm, Inverse QLearning (IQLearn). Our key insight is that not only can the Qfunction represent the optimal behavior policy but it can also represent the reward function, as the mapping from singlestep rewards to Qfunctions is bijective for a given policy^{2}. This can be used to avoid the difficult minimax game over the policy and reward functions seen in the Adversarial Imitation formulation, by expressing both using a single variable: the Qfunction. Plugging this change of variables into the original Inverse RL objective leads to a much simpler minimization problem over just the single Qfunction; which we refer to as the Inverse Qlearning problem. Our Inverse Qlearning problem shares a onetoone correspondence with the minimax game of adversarial IL in that each potential Qfunction can be mapped to a pair of discriminator and generator networks. This means that we maintain the generality and unique equilibrium properties of IRL while resulting in a simple nonadversarial algorithm that may be used for imitation.
Below is a visualization to intuitively understand our approach – existing IRL methods solve an involved minimax game over policy (\(\pi\)) and rewards (\(r\)), finding a policy that matches expert behavior at the unique saddle point solution (\(\pi^*\), \(r^*\)) by utilizing RL (shown in left). IQLearn proposes a simple transformation from rewards to Qfunctions to instead solve this problem over the policy (\(\pi\)) and the Qfunction (\(Q\)) to find the corresponding solution (\(\pi^*\), \(Q^*\)) (shown in right). Now crucially, if we know the Qfunction then we explicitly know the optimal policy for it: this optimal policy is to simply choose the (softmax) action that maximizes the Qfunction in the given state^{3}. Thus, IQLearn removes the need for RL to find the policy!
Now, instead of optimizing over the space of all possible rewards and policies, we only need to optimize along a manifold in this space corresponding to the choice of a Qfunction and the optimal policy for it (the red line). The new objective along the manifold \(\mathcal{J}^*\) is concave and dependent only on the Qfunction variable, allowing the use of simple gradient descent methods to find the unique optima.
During the course of learning, for discrete action spaces, IQLearn optimizes the objective \(\mathcal{J}^*\), taking gradient steps on the manifold with respect to the Qfunction (the green lines) converging to the globally optimal saddle point. For continuous action spaces calculating the exact gradients is often intractable and IQLearn additionally learns a policy network. It updates the Qfunction (the green lines) and the policy (the blue lines) separately to remain close to the manifold. You can read the technical proofs and implementation details in our paper.
This approach is quite simple and needs only a modified update rule to train a Qnetwork using expert demonstrations and, optionally, environment interactions. The IQLearn update is a form of contrastive learning, where expert behavior is assigned a large reward, and the policy behavior a low reward; with rewards parametrized using Qfunctions. It can be easily implemented in less than 15 lines on top of existing Qlearning algorithms in discrete action spaces, and soft actorcritic (SAC) methods for continuous action spaces.
IQLearn has a number of advantages:
 It optimizes a single training objective using gradient descent and learns a single model for the Qfunction.
 It is performant with very sparse data — even single expert demonstrations.
 It is simple to implement and can work in both settings: with access to an environment (online IL) or without (offline IL).
 It scales to complex imagebased environments and has proven theoretical convergence to a unique global optimum.
 Lastly, it can be used to recover rewards and add interpretability to the policy’s behavior.
Despite the simplicity of the approach, we were surprised to find that it substantially outperformed a number of existing approaches on popular imitation learning benchmarks such as OpenAI Gym, MujoCo, and Atari, including approaches that were much more complex or domainspecific. In all these benchmarks, IQLearn was the only method to successfully reach expert performance by relying on a few expert demonstrations (less than 10). IQLearn with a simple LSTM policy also works surprisingly well in the complex openworld setting of Minecraft and is able to learn from videos of human players to solve various tasks like building a house, creating a waterfall, caging animals, and finding caves^{4}.
Beyond simple imitation, we also tried to imitate experts where only partial expert data is available or the expert has changes in its environment or goals compared to the agent — more akin to the real world. We were able to show that IQLearn can be used for imitation without expert actions, and relying solely on expert observations, enabling learning from videos. Moreover, IQLearn was surprisingly robust to distribution shifts in the expert behavior and goals in an environment, showing great generalization to new unseen settings and an ability to act as a metalearner.
See videos of IQLearn below (trained with Image observations):
Performance Comparisions (from the paper):
The generality of our method — it can be combined with any existing Qlearning or actorcritic implementations — makes IQLearn applicable to a wide range of domains and learning objectives in imitation and reinforcement learning beyond those explored in the paper.
We hope that IQLearn’s simple approach to learning policies via imitation of a few experts will bring us one step closer to developing sampleefficient general AI agents that can learn a variety of behavior from humans in realworld settings.
I would like to thank Stefano Ermon, Mo Tiwari and Kuno Kim for valuable suggestions and proofreading. Also thanks to Jacob Schreiber, Megha Srivastava and Sidd Karamcheti for their helpful and extensive comments. Finally, acknowledging Skanda Vaidyanath, Susan R. Qi and Brad Porter for their feedback
This last part of this post was based on the following research paper:
 IQLearn: Inverse softQ Learning for Imitation. D. Garg, S. Chakraborty, C. Cundy, J. Song, S. Ermon. In NeurIPS, 2021 (Spotlight). (pdf, code)

IQLearn won the #1 place in NeurIPS ‘21 MineRL challenge to create an AI to play Minecraft trained using imitation on only 2040 expert demos (< 5 hrs demo data). Recently, OpenAI released VPT which is the current best in playing Minecraft but in comparison utilizes 70,000 hours of demo data. ↩

As the Qfunction estimates the cumulative reward attained by an agent, the idea is to subtract the nextstep Qfunction from the current state Qfunction to get the singlestep reward. This relationship is captured by the Bellman equation. ↩

We prefer choosing the softmax action over the argmax action as it introduces stochasticity, leading to improved exploration and better generalization, a technique often referred to as soft Qlearning. ↩

Our trained AI agent is adept at finding caves and creating waterfalls. It is also good at building pens but often misses fully enclosing animals. It also can build messy incomplete house structures, although, often abandoning current builds and starting new ones. (it uses a small LSTM to encode history, and better mechanisms like selfattention can help improve the memory) ↩