Research Directions on Deep Reinforcement Learning

Data Efficiency

As deep learning requires a lot of data to efficiently learn many parameters, this constraint also applies to deep RL. However, in some domains, it is difficult and costly to generate such data. This series of work improved data efficiency of deep RL algorithms by:

  • proposing and analyzing another type of update for actor-critic architecture [Zimmer and Weng, 2019a,b], [Zimmer et al., 2016a,b,c].
    Novelty: Instead of updating the policy with the deterministic policy gradient, the policy is updated by maximizing the critic in a non-parametric way, then the policy mimics the non-parametric solution. When performing the non-parametric maximization, instead of relying on the Q function which might be difficult to learn, we propose to rely on the V function which is easier to learn. In such case, less data can be used to perform the update, but those data are more reliable.

    [Zimmer and Weng, 2019a]

  • exploiting the symmetries present in robotic as a data augmentation technique for multi-goal RL [Lin et al., 2019, 2020].
Transfer Learning

Learning to solve a task may take a lot of time. Instead of learning from scratch for each new encountered domain, previous gathered knowledge can be used as bootstrapping. We proposed to use this transfer learning approach when the target domain required too much knowledge to be tackled directly [Zimmer and Doncieux, 2017], when a teacher agent advises a student agent at propitious moments sharing its knowledge [Zimmer et al., 2014], or during curriculum learning where the sensorimotor space of the agent grows while it is learning a policy [Zimmer et al., 2018].

Novelty: It is proposed to define a process to make a robot build its own representation for a reinforcement learning algorithm. The principle is to first use a direct policy search in the sensori-motor space, i.e. with no predefined discrete sets of states nor actions, and then extract from the corresponding learning traces discrete actions and identify the relevant dimensions of the state to estimate the value function. Once this is done, the robot can apply reinforcement learning (1) to be more robust to new domains and, if required, (2) to learn faster than a direct policy search. This approach allows to take the best of both worlds: first learning in a continuous space to avoid the need for a specific representation, but at a price of a long learning process and a poor generalization, and then learning with an adapted representation to be faster and more robust.

[Zimmer and Doncieux, 2017]

Meta Learning

While an agent is learning to solve a task, a meta-agent observes the learning process. In addition to acting on the process, the meta-agent can also learn the effect of its meta-decisions to improve the learning of the learning process. We proposed to use this idea while a teacher agent gives advice to a student agent: the teacher learns how to teach better [Zimmer et al., 2014]. We worked on a neural network architecture where a first neural network was learning a classification task, while a second one learned to bet if the prediction of the first network was correct to improve the classification score [Zimmer et al., 2012]. We proposed a meta-architecture to trigger the growth of the sensorimotor space based on intrinsic motivation [Zimmer et al., 2018].

Novelty: Competing approaches postulate that the dimensionality of the sensorimotor space remains the same between different tasks. The originality here is to evolve this dimensionality. The main results obtained show that performing this transfer learning allows a faster learning towards a better quality solution in two different environments with two different algorithms.

[Zimmer et al., 2018]