Research Directions on Deep Reinforcement Learning

Learning an optimal policy in the most efficient way in complex environments (e.g. with continuous state and action spaces, with sparse reward functions, with non-stationarity or high-stochasticity, etc.) is one of the biggest problems researchers are facing in RL, with many challenges such as: how to properly explore an environment, when to explore (exploitation-exploration dilemma), how to estimate the quality of a state (bias-variance dilemma), how to take into account past experience (off-policy learning), etc.
Deep learning pushed the boundaries of some of these issues, but the downside has been the need for a lot of interaction data.

Data Efficiency

Deep learning requires a lot of data to efficiently learn many parameters, however, in many domains, it is difficult and costly to generate such transitions in deep reinforcement learning. This series of work improved data efficiency (i.e. extracting the maximum of information from the gathered transitions) of deep RL algorithms by:

  • proposing and analyzing another type of update for actor-critic architecture [Zimmer and Weng, 2019a,b], [Zimmer et al., 2016a,b,c].
  • exploiting the symmetries present in robotic as a data augmentation technique for multi-goal RL [Lin et al., 2019, 2020].
  • auto-tuning several hyperparameters online instead of relying on a costly offline hyperparameter optimization [Huang et al., 2020, 2021].
  • learning policies represented by first-order logic formulas [Zimmer et al., 2021].
Transfer Learning

Learning to solve a task may take a lot of time. Instead of learning from scratch for each new encountered domain, previous gathered knowledge can be used as bootstrapping. We proposed to use this transfer learning approach when the target domain required too much knowledge to be tackled directly [Zimmer and Doncieux, 2017], when a teacher agent advises a student agent at propitious moments sharing its knowledge [Zimmer et al., 2014], or during curriculum learning where the sensorimotor space of the agent grows while it is learning a policy [Zimmer et al., 2018].

Meta Learning

While an agent is learning to solve a task, a meta-agent observes the learning process. In addition to acting on the process, the meta-agent can also learn the effect of its meta-decisions to improve the learning of the learning process. We proposed to use this idea while a teacher agent gives advice to a student agent: the teacher learns how to teach better [Zimmer et al., 2014]. We worked on a neural network architecture where a first neural network was learning a classification task, while a second one learned to bet if the prediction of the first network was correct to improve the classification score [Zimmer et al., 2012]. We proposed a meta-architecture to trigger the growth of the sensorimotor space based on intrinsic motivation [Zimmer et al., 2018].


As the operations of autonomous systems generally affect simultaneously several users, it is crucial that their designs account for fairness considerations. Hence, we investigated the problem of learning a policy that treats its users equitably [Siddique et al., 2020]. We extended this approach in the multi-agent case in a centralized learning with decentralized execution scenarios or in fully decentralized scenarios [Zimmer et al., 2021].