Research Directions on Deep Reinforcement Learning
As deep learning requires a lot of data to efficiently learn many parameters, this constraint also applies to deep RL. However, in some domains, it is difficult and costly to generate such data. This series of work improved data efficiency of deep RL algorithms by:
- proposing and analyzing another type of update for actor-critic architecture [Zimmer and Weng, 2019a,b], [Zimmer et al., 2016a,b,c].
Novelty: Instead of updating the policy with the deterministic policy gradient, the policy is updated by maximizing the critic in a non-parametric way, then the policy mimics the non-parametric solution. When performing the non-parametric maximization, instead of relying on the Q function which might be difficult to learn, we propose to rely on the V function which is easier to learn. In such case, less data can be used to perform the update, but those data are more reliable.
- exploiting the symmetries present in robotic as a data augmentation technique for multi-goal RL [Lin et al., 2019, 2020].
Learning to solve a task may take a lot of time. Instead of learning from scratch for each new encountered domain, previous gathered knowledge can be used as bootstrapping. We proposed to use this transfer learning approach when the target domain required too much knowledge to be tackled directly [Zimmer and Doncieux, 2017], when a teacher agent advises a student agent at propitious moments sharing its knowledge [Zimmer et al., 2014], or during curriculum learning where the sensorimotor space of the agent grows while it is learning a policy [Zimmer et al., 2018].
While an agent is learning to solve a task, a meta-agent observes the learning process. In addition to acting on the process, the meta-agent can also learn the effect of its meta-decisions to improve the learning of the learning process. We proposed to use this idea while a teacher agent gives advice to a student agent: the teacher learns how to teach better [Zimmer et al., 2014]. We worked on a neural network architecture where a first neural network was learning a classification task, while a second one learned to bet if the prediction of the first network was correct to improve the classification score [Zimmer et al., 2012]. We proposed a meta-architecture to trigger the growth of the sensorimotor space based on intrinsic motivation [Zimmer et al., 2018].