Matthieu Zimmer
Reinforcement learning allows an agent to learn a behavior that has never been previously defined by humans. The agent discovers the environment and the different consequences of its actions through its interaction: it learns from its own experience, without having pre-established knowledge of the goals or effects of its actions. This thesis tackles how deep learning can help reinforcement learning to handle continuous spaces and environments with many degrees of freedom in order to solve problems closer to reality. Indeed, neural networks have a good scalability and representativeness. They make possible to approximate functions on continuous spaces and allow a developmental approach, because they require little a priori knowledge on the domain. We seek to reduce the amount of necessary interaction of the agent to achieve acceptable behavior. To do so, we proposed the Neural Fitted Actor-Critic framework that defines several data efficient actor-critic algorithms. We examine how the agent can fully exploit the transitions generated by previous behaviors by integrating off-policy data into the proposed framework. Finally, we study how the agent can learn faster by taking advantage of the development of his body, in particular, by proceeding with a gradual increase in the dimensionality of its sensorimotor space.
Reinforcement learning ; Actor-critic ; Neural networks ; Continuous environment ; Developmental approach ; Deep learning
Albus : A theory of cerebellar function, Mathematical Biosciences, vol.10, issue.12, pp.25-61, 1971. ,
Natural Gradient Works Efficiently in Learning, Neural Computation, vol.37, issue.2, pp.251-276, 1998. ,
DOI : 10.1103/PhysRevLett.76.2188
Fitted Q-iteration in continuous actionspace MDPs Advances in neural information processing systems, pp.9-16, 2008. ,
Cognitive Developmental Robotics: A Survey, IEEE Transactions on Autonomous Mental Development, vol.1, issue.1, pp.1-44, 2009. ,
DOI : 10.1109/TAMD.2009.2021702
Using confidence bounds for exploitation-exploration trade-offs, Journal of Machine Learning Research, vol.3, pp.397-422, 2002. ,
A Restart CMA Evolution Strategy With Increasing Population Size, 2005 IEEE Congress on Evolutionary Computation, pp.1769-1776, 2005. ,
DOI : 10.1109/CEC.2005.1554902
Gradient descent for general reinforcement learning, Advances in neural information processing systems, pp.968-974, 1999. ,
Active learning of inverse models with intrinsically motivated goal exploration in robots, Robotics and Autonomous Systems, vol.61, issue.1, pp.49-73, 2013. ,
DOI : 10.1016/j.robot.2012.05.008
URL : https://hal.archives-ouvertes.fr/hal-00788440
Unsupervised Learning, Neural Computation, vol.4, issue.3, pp.295-311, 1989. ,
DOI : 10.1007/BF00288907
Neuronlike adaptive elements that can solve difficult learning control problems, IEEE Transactions on Systems, Man, and Cybernetics, vol.13, issue.5, pp.834-846, 1983. ,
DOI : 10.1109/TSMC.1983.6313077
The arcade learning environment : An evaluation platform for general agents, International Joint Conference on Artificial Intelligence, pp.253-279, 2013. ,
Unifying count-based exploration and intrinsic motivation, Advances in Neural Information Processing Systems, pp.1471-1479, 2016. ,
DYNAMIC PROGRAMMING AND LAGRANGE MULTIPLIERS, Proceedings of the National Academy of Sciences, vol.42, issue.10, pp.767-769, 1956. ,
DOI : 10.1073/pnas.42.10.767
URL : http://doi.org/10.1073/pnas.42.10.767
Dynamic Programming, 1957. ,
Curriculum learning, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, pp.41-48, 2009. ,
DOI : 10.1145/1553374.1553380
Convergent temporal-difference learning with arbitrary smooth function approximation, Advances in Neural Information Processing Systems, pp.1204-1212, 2009. ,
Pattern recognition and machine learning, 2006. ,
Selection of relevant features and examples in machine learning, Artificial intelligence, vol.97, issue.1, pp.245-271, 1997. ,
Boyan : Least-Squares Temporal Difference Learning, Proceedings of the 16th International Conference on Machine Learning, pp.49-56, 1998. ,
Generalization in reinforcement learning : Safely approximating the value function Advances in neural information processing systems, pp.369-376, 1995. ,
Linear least-squares algorithms for temporal difference learning, Recent Advances in Reinforcement Learning, pp.33-57 ,
Jie Tang et Wojciech Zaremba : OpenAI gym, 2016. ,
Brooks : Intelligence without representation, Artificial intelligence, 1991. ,
Radial basis functions, multi-variable functional interpolation and adaptive networks, Royal Signals and Radar Establishment Malvern, 1988. ,
Stochastic models for learning, 1955. ,
Least-Squares Methods for Policy Iteration, Reinforcement Learning, pp.75-109, 2012. ,
DOI : 10.1007/978-3-642-27645-3_3
Méthode générale pour la résolution des systemes d'équations simultanées ,
LeVeque : Algorithms for computing the sample variance : Analysis and recommendations. The American Statistician, pp.242-247, 1983. ,
Actor-Critic Algorithm Based on Incremental Least-Squares Temporal Difference with Eligibility Trace, International Conference on Intelligent Computing, pp.183-188, 2011. ,
DOI : 10.1016/j.neucom.2007.11.026
Reinforcement learning using neural networks, with applications to motor control, Thèse de doctorat, Institut National Polytechnique de Grenoble-INPG, 2002. ,
URL : https://hal.archives-ouvertes.fr/tel-00003985
Off-policy actor-critic. arXiv preprint, 2012. ,
URL : https://hal.archives-ouvertes.fr/hal-00764021
Jan Peters et Others : A survey on policy search for robotics, Foundations and Trends in Robotics, vol.2, issue.12, pp.1-142, 2013. ,
PILCO : A model-based and data-efficient approach to policy search, International Conference on Machine Learning, pp.465-472, 2011. ,
Creativity : A driver for research on robotics in open environments, Intellectica, issue.65, pp.205-219, 2016. ,
Benchmarking Deep Reinforcement Learning for Continuous Control, 2016. ,
The distance-weighted k-nearest-neighbor rule, IEEE Transactions on Systems, Man, and Cybernetics, issue.4, pp.325-327, 1976. ,
Self-organizing developmental reinforcement learning. From Animals to Animats 12, pp.1-11, 2012. ,
DOI : 10.1007/978-3-642-33093-3_31
URL : https://hal.archives-ouvertes.fr/hal-00705350
GNU Octave version 4.2.0 manual : a high-level interactive language for numerical computations. 2016. URL http ,
Bootstrap methods : another look at the jackknife. The annals of Statistics, pp.1-26, 1979. ,
Why does unsupervised pre-training help deep learning, Journal of Machine Learning Research, vol.11, issue.Feb, pp.625-660, 2010. ,
Tree-based batch mode reinforcement learning, Journal of Machine Learning Research, vol.6, pp.503-556, 2005. ,
Pathnet : Evolution channels gradient descent in super neural networks. arXiv preprint, 2017. ,
Catastrophic forgetting in connectionist networks, Trends in Cognitive Sciences, vol.3, issue.4, pp.128-135, 1999. ,
DOI : 10.1016/S1364-6613(99)01294-2
Revisiting Natural Actor-Critics with Value Function Approximation, International Conference on Modeling Decisions for Artificial Intelligence, pp.207-218, 2010. ,
DOI : 10.1007/11596448_9
URL : https://hal.archives-ouvertes.fr/hal-00553175
Neural Networks and the Bias/Variance Dilemma, Neural Computation, vol.36, issue.1, pp.1-58, 1992. ,
DOI : 10.1162/neco.1990.2.1.1
Incremental least-squares temporal difference learning, Proceedings of the National Conference on Artificial Intelligence, p.356, 1999. ,
Regularization Theory and Neural Networks Architectures, Neural Computation, vol.26, issue.3, pp.219-269, 1995. ,
DOI : 10.1016/0893-6080(90)90004-5
URL : http://www.ai.mit.edu/projects/cbcl/publications/ps/GirJonPogNC.ps.gz
Deep learning, 2016. ,
Stable Function Approximation in Dynamic Programming, Proceedings of the twelfth international conference on machine learning, pp.261-268, 1995. ,
DOI : 10.1016/B978-1-55860-377-6.50040-2
Q-Prop : Sample-Efficient Policy Gradient with An Off-Policy Critic. arXiv preprint, 2016. ,
Interpolated Policy Gradient : Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning. arXiv preprint, 2017. ,
Ilya Sutskever et Sergey Levine : Continuous Deep Q-Learning with Model-based Acceleration, 2016. ,
Learning like a baby: a survey of artificial intelligence approaches, The Knowledge Engineering Review, vol.15, issue.02, pp.209-236, 2011. ,
DOI : 10.1016/S0378-4754(97)00057-8
Adaptive importance sampling for value function approximation in off-policy reinforcement learning, Neural Networks, vol.22, issue.10, pp.1399-1410, 2009. ,
DOI : 10.1016/j.neunet.2009.01.002
Reinforcement learning in feedback control, Machine Learning, pp.137-169, 2011. ,
DOI : 10.1109/TIE.2007.896488
The Unreasonable Effectiveness of Data, IEEE Intelligent Systems, vol.24, issue.2, pp.8-12, 2009. ,
DOI : 10.1109/MIS.2009.36
Completely Derandomized Self-Adaptation in Evolution Strategies, Evolutionary Computation, vol.9, issue.2, pp.159-195, 2001. ,
DOI : 10.1016/0004-3702(95)00124-7
Deep Reinforcement Learning in Parameterized Action Space. arXiv preprint, 2016. ,
Deep Reinforcement Learning that Matters ArXiv e-prints, 2017. ,
Distributed representations, 1984. ,
Long Short-Term Memory, Neural Computation, vol.4, issue.8, pp.1735-1780, 1997. ,
DOI : 10.1016/0893-6080(88)90007-X
Improving the Rprop learning algorithm, International Symposium on Neural Computation, pp.115-121, 2000. ,
Batch Normalization : Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv preprint, 2015. ,
Reproducibility of benchmarked deep reinforcement learning tasks for continuous control, 2017. ,
Caffe, Proceedings of the ACM International Conference on Multimedia, MM '14, 2014. ,
DOI : 10.1145/2647868.2654889
Efficient global optimization of expensive black-box functions, Journal of Global Optimization, vol.13, issue.4, pp.455-492, 1998. ,
DOI : 10.1023/A:1008306431147
Reinforcement learning : A survey, Journal of artificial intelligence research, 1996. ,
A natural policy gradient, Advances in neural information processing systems, pp.1531-1538, 2002. ,
Bias-Variance Error Bounds for Temporal Difference Updates, COLT, pp.142-147, 2000. ,
Adam : a Method for Stochastic Optimization, International Conference on Learning Representations, pp.1-13, 2015. ,
Reinforcement Learning in Robotics : A Survey, International Journal of Robotics Research, 2013. ,
A study of cross-validation and bootstrap for accuracy estimation and model selection, Ijcai, pp.1137-1145, 1995. ,
Tsitsiklis : Actor-Critic Algorithms, Neural Information Processing Systems, pp.1008-1014, 1999. ,
Approche sensorimotrice de la perception de l'espace pour la robotique autonome, Thèse de doctorat, 2013. ,
Least-squares policy iteration, Journal of machine learning research, vol.4, pp.1107-1149, 2003. ,
Une procédure d'apprentissage pour réseau a seuil asymmetrique (a learning scheme for asymmetric threshold networks), Proceedings of Cognitiva 85, 1985. ,
Continuous control with deep reinforcement learning. arXiv preprint, 2015. ,
Self-improving reactive agents based on reinforcement learning, planning and teaching, Machine Learning, pp.293-321, 1992. ,
Multi-step Off-policy Learning Without Importance Sampling Ratios, 1702. ,
Exploration and exploitation in organizational learning. Organization science, pp.71-87, 1991. ,
Fitted natural actor-critic : A new algorithm for continuous state-action MDPs, In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, vol.5212, pp.66-81, 2008. ,
Asynchronous methods for deep reinforcement learning, International Conference on Machine Learning, pp.1928-1937, 2016. ,
Human-level control through deep reinforcement learning, Nature, vol.101, issue.7540, pp.529-533, 2015. ,
DOI : 10.1016/S0004-3702(98)00023-X
Bellemare : Safe and Efficient Off-Policy Reinforcement Learning, 2016. ,
Learning automata-a survey, IEEE Transactions on systems, man, and cybernetics, issue.4, pp.323-334, 1974. ,
Source task creation for curriculum learning, Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems International Foundation for Autonomous Agents and Multiagent Systems, pp.566-574, 2016. ,
Policy invariance under reward transformations : Theory and application to reward shaping, ICML, pp.278-287, 1999. ,
Implementation of a Fast Artificial Neural Network Library (fann) Rapport technique, 2003. ,
How can we define intrinsic motivation ?, pp.93-101, 2008. ,
URL : https://hal.archives-ouvertes.fr/inria-00420175
Lagoudakis : Binary action search for learning continuous-action control policies, Proceedings of the 26th International Conference on Machine Learning (ICML), pp.793-800, 2009. ,
Relative Entropy Policy Search, Association for the Advancement of Artificial Intelligence, pp.1607-1612, 2010. ,
Policy Gradient Methods for Robotics, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.2219-2225, 2006. ,
DOI : 10.1109/IROS.2006.282564
Natural Actor-Critic, Neurocomputing, vol.71, issue.7-9, pp.1180-1190, 2008. ,
DOI : 10.1016/j.neucom.2007.11.026
La naissance de l'intelligence chez l'enfant, 1948. ,
Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, p.80, 2000. ,
Adaptive critic designs, IEEE Transactions on Neural Networks, vol.8, issue.5, pp.997-1007, 1997. ,
Neural Fitted Q Iteration ??? First Experiences with a Data Efficient Neural Reinforcement Learning Method, In Lecture Notes in Computer Science, vol.3720, pp.317-328, 2005. ,
DOI : 10.1007/11564096_32
RPROP -A Fast Adaptive Learning Algorithm, International Symposium on Computer and Information Science VII, 1992. ,
Reinforcement learning for robot soccer, Autonomous Robots, vol.8, issue.1, pp.55-73, 2009. ,
DOI : 10.1080/088395198117848
Evaluation of Policy Gradient Methods and Variants on the Cart-Pole Benchmark, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pp.254-261, 2007. ,
DOI : 10.1109/ADPRL.2007.368196
The perceptron: A probabilistic model for information storage and organization in the brain., Psychological Review, vol.65, issue.6, p.386, 1958. ,
DOI : 10.1037/h0042519
Kroese : Simulation and the Monte Carlo method, 2016. ,
On-line Q-learning using connectionist systems, 1994. ,
Stopped training and other remedies for overfitting. Computing science and statistics, pp.352-360, 1996. ,
Ioannis Antonoglou et David Silver : Prioritized Experience Replay. arXiv preprint, pp.1-23, 2015. ,
Should one compute the Temporal Difference fix point or minimize the Bellman Residual ? The unified oblique projection view, 27th International Conference on Machine Learning, 2010. ,
URL : https://hal.archives-ouvertes.fr/inria-00537403
On the use of non-stationary policies for stationary infinitehorizon Markov decision processes, Advances in Neural Information Processing Systems, pp.1826-1834, 2012. ,
URL : https://hal.archives-ouvertes.fr/hal-00758809
High-dimensional continuous control using generalized advantage estimation, 2015. ,
Searle : Minds, brains, and programs, Behavioral and brain sciences, vol.3, issue.3, pp.417-424, 1980. ,
Parameter-exploring policy gradients, Neural Networks, vol.23, issue.4, pp.551-559, 2010. ,
DOI : 10.1016/j.neunet.2009.12.004
Towards Deep Developmental Learning, IEEE Transactions on Cognitive and Developmental Systems, vol.8, issue.2, pp.99-114, 2016. ,
DOI : 10.1109/TAMD.2015.2496248
URL : https://hal.archives-ouvertes.fr/hal-01331799
Deterministic Policy Gradient Algorithms, Proceedings of the 31st International Conference on Machine Learning, pp.387-395, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-00938992
Reinforcement learning with replacing eligibility traces, Machine learning, vol.22, issue.1-3, pp.123-158, 1996. ,
Spong : Swing up control problem for the acrobot, IEEE Control Systems Magazine, vol.15, issue.1, pp.49-55, 1995. ,
Ilya Sutskever et Ruslan Salakhutdinov : Dropout : A Simple Way to Prevent Neural Networks from Overfitting, Journal of Machine Learning Research, vol.15, pp.1929-1958, 2014. ,
Reinforcement Learning for RoboCup Soccer Keepaway, Adaptive Behavior, vol.13, issue.3, pp.165-188, 2005. ,
DOI : 10.1109/9.580874
Abstract, Paladyn, Journal of Behavioral Robotics, vol.4, issue.1, pp.49-61, 2013. ,
DOI : 10.2478/pjbr-2013-0003
Learning to predict by the methods of temporal differences, Machine Learning, vol.34, issue.1, pp.9-44, 1988. ,
DOI : 10.3758/BF03205056
Introduction to reinforcement learning with function approximation, Tutorial at the Conference on Neural Information Processing Systems, 2015. ,
Reinforcement Learning : An Introduction (Adaptive Computation and Machine Learning), 1998. ,
Barto : Reinforcement learning : An introduction, 2017. ,
Csaba Szepesvári et Eric Wiewiora : Fast gradient-descent methods for temporal-difference learning with linear function approximation, Proceedings of the 26th Annual International Conference on Machine Learning, pp.993-1000, 2009. ,
An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning, The Journal of Machine Learning Research, vol.17, 2016. ,
Satinder Singh et Yishay Mansour : Policy Gradient Methods for Reinforcement Learning with Function Approximation, Advances in Neural Information Processing Systems 12, pp.1057-1063, 1999. ,
Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning, Artificial Intelligence, vol.112, issue.1-2, pp.181-211, 1999. ,
DOI : 10.1016/S0004-3702(99)00052-1
Value Iteration Networks, Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pp.2154-2162, 2016. ,
DOI : 10.24963/ijcai.2017/700
URL : https://www.ijcai.org/proceedings/2017/0700.pdf
Synthesis and stabilization of complex behaviors through online trajectory optimization, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.4906-4913, 2012. ,
DOI : 10.1109/IROS.2012.6386025
Transfer Learning for Reinforcement Learning Domains : A Survey, Journal of Machine Learning Research, vol.10, pp.1633-1685, 2009. ,
MuJoCo: A physics engine for model-based control, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.5026-5033, 2012. ,
DOI : 10.1109/IROS.2012.6386109
Computing machinery and intelligence, pp.59433-460, 1950. ,
Reinforcement Learning in Continuous State and Action Spaces, Reinforcement Learning, pp.207-251, 2012. ,
DOI : 10.1007/978-3-642-27645-3_7
Off-policy TD (?) with a true online equivalence, Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, 2014. ,
Wiering : Reinforcement learning in continuous action spaces, Proceedings of the IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning, pp.272-279, 2007. ,
True online temporal-difference learning, Journal of Machine Learning Research, vol.17, issue.145, pp.1-40, 2016. ,
A Deeper Look at Planning as Learning from Replay ,
Julian Schrittwieser et Others : Starcraft II : A new challenge for reinforcement learning. arXiv preprint, 2017. ,
Deisenroth : From Pixels to Torques : Policy Learning with Deep Dynamical Models, 2015. ,
Sample Efficient Actor-Critic with Experience Replay, 2016. ,
Dueling network architectures for deep reinforcement learning. arXiv preprint, 2015. ,
Learning to Control a 6-Degree-of-Freedom Walking Robot, EUROCON 2007, The International Conference on "Computer as a Tool", pp.698-705, 2007. ,
DOI : 10.1109/EURCON.2007.4400335
Note on a Method for Calculating Corrected Sums of Squares and Products, Technometrics, vol.1, issue.1, pp.419-420, 1962. ,
DOI : 10.1080/00401706.1962.10490022
Advanced forecasting methods for global crisis warning and models of intelligence, General Systems Yearbook, vol.22, issue.12, pp.25-38, 1977. ,
Adaptive switching circuits, 1960. ,
Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine learning, vol.8, issue.3-4, pp.229-256, 1992. ,
Neural Fitted Actor-Critic, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01350651
Off-Policy Neural Fitted Actor-Critic, Deep Reinforcement Learning Workshop, NIPS 2016, Barcelona, Spain. ,
URL : https://hal.archives-ouvertes.fr/hal-01413886
Toward a data efficient neural actorcritic, 13th European Workshop on Reinforcement Learning, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01413885
Vers des architectures acteur-critique neuronales efficaces en données, Journées Francophones sur la Planification, la Décision et l'Apprentissage pour la conduite de systèmes, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01344905
Bootstrapping $Q$ -Learning for Robotics From Neuro-Evolution Results, IEEE Transactions on Cognitive and Developmental Systems, vol.10, issue.1, 2017. ,
DOI : 10.1109/TCDS.2016.2628817
URL : https://hal.archives-ouvertes.fr/hal-01494744