A Step Towards Conscious Planning

A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning

Mingde “Harry” Zhao, Zhen Liu, Sitao Luan, Shuyuan Zhang, Doina Precup, Yoshua Bengio

(arXiv / GitHub)

We introduce into reinforcement learning inductive biases inspired by higher-order cognitive functions in humans. These architectural constraints enable the planning to direct attention dynamically to the interesting parts of the state at each step of imagined future trajectories. We present an end-to-end learning agent that carries out such latent space partial planning, powered by set representation and a bottleneck mechanism, forcing the number of attended entities (the partial state) to be small at each planning step. The planning agent learns to attend to the relevant parts of the environment state which change in each transition or influence the reward and termination. In experiments, we observe that the bottleneck contributes to learning representations and demonstrates advantages in terms of out-of-distribution generalization in different families of tasks.

Whether when planning our paths home from the office or from a hotel to an airport in an unfamiliar city, we typically focus our attention on a small subset of relevant variables, e.g. the change in position or the presence of obstacles on the planned path. It is plausible that this ability contributes to the ease with which humans handle novel situations.

The observation raises our interest in the inductive biases regarding the algorithms for this kind of conscious planning, which involves attending to the right elements of the state space at different steps of the plan. An interesting hypothesis is that this ability may be due to a style of computation associated with the conscious processing of information. Conscious attention focuses on a few necessary environment elements, with the help of an abstract representation of the world internal to the agent. This pattern, also known as consciousness in the first sense (C1), has been theorized to be the source of humans’ exceptional ability to generalize or adapt well to new situations, and learn skills or new concepts efficiently from very few examples. A central characteristic of conscious processing is that it involves a bottleneck, which forces one to handle dependencies between very few characteristics of the environment at a time. Although this focus on a small subset of the available information may seem limiting, it may facilitate Out-Of-Distribution (OOD) and systematic generalization, e.g. to other settings where the ignored variables are different.

Experiment showing the promising OOD generalization ability of the bottleneck

In this paper, we propose an end-to-end architecture which allows us to encode some of these ideas into reinforcement learning agents. Reinforcement learning (RL) is a natural approach for combining learning how to act from interaction with a complex environment, and planning to achieve new goals. However, most of the big successes of RL have been obtained by deep, model-free agents. While Model-Based RL (MBRL) has generated significant research, its empirical performance has typically lagged behind, with the notable exception of MuZero. We note that most MBRL agents work in the observation space, again with the exceptions of the Predictron and MuZero.

A birdseye view of the overall design

Our proposal is to use inspiration from human conscious planning to build an architecture which can learn a latent space useful for planning and in which attention can be focused on a small set of variables at any time. This builds on the idea of partial planning with modern deep RL architectures. More specifically, we build and train an end-to-end latent-space MBRL agent which does not require reconstructing the observations, and uses tree-search based model predictive control as the planning algorithm. Our model uses a set-based representation to construct a latent state from its observations, and a selective attention bottleneck mechanism to plan over dynamically selected aspects of the state. Our experiments show that these architectural constraints improve both sample complexity and OOD generalization, compared to more conventional MBRL methods.