Developmental reinforcement learning

Authors

Matthieu Zimmer

Abstract

Reinforcement learning allows an agent to learn a behavior that has never been previously defined by humans. The agent discovers the environment and the different consequences of its actions through its interaction: it learns from its own experience, without having pre-established knowledge of the goals or effects of its actions. This thesis tackles how deep learning can help reinforcement learning to handle continuous spaces and environments with many degrees of freedom in order to solve problems closer to reality. Indeed, neural networks have a good scalability and representativeness. They make possible to approximate functions on continuous spaces and allow a developmental approach, because they require little a priori knowledge on the domain. We seek to reduce the amount of necessary interaction of the agent to achieve acceptable behavior. To do so, we proposed the Neural Fitted Actor-Critic framework that defines several data efficient actor-critic algorithms. We examine how the agent can fully exploit the transitions generated by previous behaviors by integrating off-policy data into the proposed framework. Finally, we study how the agent can learn faster by taking advantage of the development of his body, in particular, by proceeding with a gradual increase in the dimensionality of its sensorimotor space.

Keywords

Reinforcement learning ; Actor-critic ; Neural networks ; Continuous environment ; Developmental approach ; Deep learning

Download

PDF file bib

References

S. James, Albus : A theory of cerebellar function, Mathematical Biosciences, vol.10, issue.12, pp.25-61, 1971.

. Shun-ichi-amari, Natural Gradient Works Efficiently in Learning, Neural Computation, vol.37, issue.2, pp.251-276, 1998.
DOI : 10.1103/PhysRevLett.76.2188

A. Antos, C. Szepsvari, and R. Munos, Fitted Q-iteration in continuous actionspace MDPs Advances in neural information processing systems, pp.9-16, 2008.

M. Asada, K. Hosoda, Y. Kuniyoshi, H. Ishiguro, T. Inui et al., Cognitive Developmental Robotics: A Survey, IEEE Transactions on Autonomous Mental Development, vol.1, issue.1, pp.1-44, 2009.
DOI : 10.1109/TAMD.2009.2021702

P. Auer, Using confidence bounds for exploitation-exploration trade-offs, Journal of Machine Learning Research, vol.3, pp.397-422, 2002.

A. Auger and N. Hansen, A Restart CMA Evolution Strategy With Increasing Population Size, 2005 IEEE Congress on Evolutionary Computation, pp.1769-1776, 2005.
DOI : 10.1109/CEC.2005.1554902

C. Leemon, I. Baird, and A. W. Moore, Gradient descent for general reinforcement learning, Advances in neural information processing systems, pp.968-974, 1999.

A. Baranes and P. Oudeyer, Active learning of inverse models with intrinsically motivated goal exploration in robots, Robotics and Autonomous Systems, vol.61, issue.1, pp.49-73, 2013.
DOI : 10.1016/j.robot.2012.05.008

URL : https://hal.archives-ouvertes.fr/hal-00788440

H. B. Barlow, Unsupervised Learning, Neural Computation, vol.4, issue.3, pp.295-311, 1989.
DOI : 10.1007/BF00288907

A. G. Barto, R. S. Sutton, and C. W. Anderson, Neuronlike adaptive elements that can solve difficult learning control problems, IEEE Transactions on Systems, Man, and Cybernetics, vol.13, issue.5, pp.834-846, 1983.
DOI : 10.1109/TSMC.1983.6313077

G. Marc, Y. Bellemare, J. Naddaf, M. Veness, and . Bowling, The arcade learning environment : An evaluation platform for general agents, International Joint Conference on Artificial Intelligence, pp.253-279, 2013.

G. Marc, S. Bellemare, G. Srinivasan, T. Ostrovski, D. Schaul et al., Unifying count-based exploration and intrinsic motivation, Advances in Neural Information Processing Systems, pp.1471-1479, 2016.

R. Bellman, DYNAMIC PROGRAMMING AND LAGRANGE MULTIPLIERS, Proceedings of the National Academy of Sciences, vol.42, issue.10, pp.767-769, 1956.
DOI : 10.1073/pnas.42.10.767

URL : http://doi.org/10.1073/pnas.42.10.767

R. Bellman, Dynamic Programming, 1957.

Y. Bengio, J. Louradour, R. Collobert, and J. Weston, Curriculum learning, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, pp.41-48, 2009.
DOI : 10.1145/1553374.1553380

S. Bhatnagar, D. Precup, D. Silver, R. S. Sutton, H. R. Maei et al., Convergent temporal-difference learning with arbitrary smooth function approximation, Advances in Neural Information Processing Systems, pp.1204-1212, 2009.

C. M. Bishop, Pattern recognition and machine learning, 2006.

L. Avrim, P. Blum, and . Langley, Selection of relevant features and examples in machine learning, Artificial intelligence, vol.97, issue.1, pp.245-271, 1997.

A. Justin, Boyan : Least-Squares Temporal Difference Learning, Proceedings of the 16th International Conference on Machine Learning, pp.49-56, 1998.

J. A. Boyan and A. W. Moore, Generalization in reinforcement learning : Safely approximating the value function Advances in neural information processing systems, pp.369-376, 1995.

J. Steven, A. G. Bradtke, P. Barto, and . Kaelbling, Linear least-squares algorithms for temporal difference learning, Recent Advances in Reinforcement Learning, pp.33-57

G. Brockman, V. Cheung, L. Pettersson, J. Schneider, and J. Schulman, Jie Tang et Wojciech Zaremba : OpenAI gym, 2016.

A. Rodney, Brooks : Intelligence without representation, Artificial intelligence, 1991.

S. David, D. Broomhead, and . Lowe, Radial basis functions, multi-variable functional interpolation and adaptive networks, Royal Signals and Radar Establishment Malvern, 1988.

R. Robert, F. Bush, and . Mosteller, Stochastic models for learning, 1955.

L. Bu?oniu, A. Lazaric, and M. Ghavamzadeh, Least-Squares Methods for Policy Iteration, Reinforcement Learning, pp.75-109, 2012.
DOI : 10.1007/978-3-642-27645-3_3

A. Cauchy, Méthode générale pour la résolution des systemes d'équations simultanées

T. F. Chan, G. H. Golub, and J. Randall, LeVeque : Algorithms for computing the sample variance : Analysis and recommendations. The American Statistician, pp.242-247, 1983.

Y. Cheng, H. Feng, and X. Wang, Actor-Critic Algorithm Based on Incremental Least-Squares Temporal Difference with Eligibility Trace, International Conference on Intelligent Computing, pp.183-188, 2011.
DOI : 10.1016/j.neucom.2007.11.026

R. Coulom, Reinforcement learning using neural networks, with applications to motor control, Thèse de doctorat, Institut National Polytechnique de Grenoble-INPG, 2002.
URL : https://hal.archives-ouvertes.fr/tel-00003985

T. Degris, M. White, and R. S. Sutton, Off-policy actor-critic. arXiv preprint, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00764021

P. Marc, G. Deisenroth, and . Neumann, Jan Peters et Others : A survey on policy search for robotics, Foundations and Trends in Robotics, vol.2, issue.12, pp.1-142, 2013.

P. Marc, C. E. Deisenroth, and . Rasmussen, PILCO : A model-based and data-efficient approach to policy search, International Conference on Machine Learning, pp.465-472, 2011.

S. Doncieux, Creativity : A driver for research on robotics in open environments, Intellectica, issue.65, pp.205-219, 2016.

Y. Duan, X. Chen, J. Schulman, and P. Abbeel, Benchmarking Deep Reinforcement Learning for Continuous Control, 2016.

A. Sahibsingh and . Dudani, The distance-weighted k-nearest-neighbor rule, IEEE Transactions on Systems, Man, and Cybernetics, issue.4, pp.325-327, 1976.

A. Dutech, Self-organizing developmental reinforcement learning. From Animals to Animats 12, pp.1-11, 2012.
DOI : 10.1007/978-3-642-33093-3_31

URL : https://hal.archives-ouvertes.fr/hal-00705350

J. W. Eaton, D. Bateman, S. Hauberg, and R. Wehbring, GNU Octave version 4.2.0 manual : a high-level interactive language for numerical computations. 2016. URL http

B. Efron, Bootstrap methods : another look at the jackknife. The annals of Statistics, pp.1-26, 1979.

D. Erhan, Y. Bengio, A. Courville, P. Manzagol, P. Vincent et al., Why does unsupervised pre-training help deep learning, Journal of Machine Learning Research, vol.11, issue.Feb, pp.625-660, 2010.

D. Ernst, P. Geurts, and L. Wehenkel, Tree-based batch mode reinforcement learning, Journal of Machine Learning Research, vol.6, pp.503-556, 2005.

C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha et al., Pathnet : Evolution channels gradient descent in super neural networks. arXiv preprint, 2017.

R. M. French, Catastrophic forgetting in connectionist networks, Trends in Cognitive Sciences, vol.3, issue.4, pp.128-135, 1999.
DOI : 10.1016/S1364-6613(99)01294-2

M. Geist and O. Pietquin, Revisiting Natural Actor-Critics with Value Function Approximation, International Conference on Modeling Decisions for Artificial Intelligence, pp.207-218, 2010.
DOI : 10.1007/11596448_9

URL : https://hal.archives-ouvertes.fr/hal-00553175

S. Geman, E. Bienenstock, and R. Doursat, Neural Networks and the Bias/Variance Dilemma, Neural Computation, vol.36, issue.1, pp.1-58, 1992.
DOI : 10.1162/neco.1990.2.1.1

A. Geramifard, M. Bowling, and R. S. Sutton, Incremental least-squares temporal difference learning, Proceedings of the National Conference on Artificial Intelligence, p.356, 1999.

F. Girosi, M. Jones, and T. Poggio, Regularization Theory and Neural Networks Architectures, Neural Computation, vol.26, issue.3, pp.219-269, 1995.
DOI : 10.1016/0893-6080(90)90004-5

URL : http://www.ai.mit.edu/projects/cbcl/publications/ps/GirJonPogNC.ps.gz

I. Goodfellow, Y. Bengio, and A. Courville, Deep learning, 2016.

G. J. Gordon, Stable Function Approximation in Dynamic Programming, Proceedings of the twelfth international conference on machine learning, pp.261-268, 1995.
DOI : 10.1016/B978-1-55860-377-6.50040-2

S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine, Q-Prop : Sample-Efficient Policy Gradient with An Off-Policy Critic. arXiv preprint, 2016.

S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, B. Schölkopf et al., Interpolated Policy Gradient : Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning. arXiv preprint, 2017.

S. Gu and T. Lillicrap, Ilya Sutskever et Sergey Levine : Continuous Deep Q-Learning with Model-based Acceleration, 2016.

F. Guerin, Learning like a baby: a survey of artificial intelligence approaches, The Knowledge Engineering Review, vol.15, issue.02, pp.209-236, 2011.
DOI : 10.1016/S0378-4754(97)00057-8

H. Hachiya, T. Akiyama, M. Sugiayma, and J. Peters, Adaptive importance sampling for value function approximation in off-policy reinforcement learning, Neural Networks, vol.22, issue.10, pp.1399-1410, 2009.
DOI : 10.1016/j.neunet.2009.01.002

R. Hafner and M. Riedmiller, Reinforcement learning in feedback control, Machine Learning, pp.137-169, 2011.
DOI : 10.1109/TIE.2007.896488

A. Halevy, P. Norvig, and F. Pereira, The Unreasonable Effectiveness of Data, IEEE Intelligent Systems, vol.24, issue.2, pp.8-12, 2009.
DOI : 10.1109/MIS.2009.36

N. Hansen and A. Ostermeier, Completely Derandomized Self-Adaptation in Evolution Strategies, Evolutionary Computation, vol.9, issue.2, pp.159-195, 2001.
DOI : 10.1016/0004-3702(95)00124-7

M. Hausknecht and P. Stone, Deep Reinforcement Learning in Parameterized Action Space. arXiv preprint, 2016.

P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup et al., Deep Reinforcement Learning that Matters ArXiv e-prints, 2017.

G. E. Hinton, Distributed representations, 1984.

S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation, vol.4, issue.8, pp.1735-1780, 1997.
DOI : 10.1016/0893-6080(88)90007-X

C. Igel and M. Hüsken, Improving the Rprop learning algorithm, International Symposium on Neural Computation, pp.115-121, 2000.

S. Ioffe and C. Szegedy, Batch Normalization : Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv preprint, 2015.

R. Islam, P. Henderson, M. Gomrokchi, and D. Precup, Reproducibility of benchmarked deep reinforcement learning tasks for continuous control, 2017.

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long et al., Caffe, Proceedings of the ACM International Conference on Multimedia, MM '14, 2014.
DOI : 10.1145/2647868.2654889

D. R. Jones, M. Schonlau, and W. J. Welch, Efficient global optimization of expensive black-box functions, Journal of Global Optimization, vol.13, issue.4, pp.455-492, 1998.
DOI : 10.1023/A:1008306431147

L. P. Kaelbling, M. Littman, and A. W. Moore, Reinforcement learning : A survey, Journal of artificial intelligence research, 1996.

M. Sham and . Kakade, A natural policy gradient, Advances in neural information processing systems, pp.1531-1538, 2002.

J. Michael, . Kearns, P. Satinder, and . Singh, Bias-Variance Error Bounds for Temporal Difference Updates, COLT, pp.142-147, 2000.

P. Diederik, J. L. Kingma, and . Ba, Adam : a Method for Stochastic Optimization, International Conference on Learning Representations, pp.1-13, 2015.

J. Kober, J. A. Bagnell, and J. Peters, Reinforcement Learning in Robotics : A Survey, International Journal of Robotics Research, 2013.

R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, Ijcai, pp.1137-1145, 1995.

R. Vijay, . Konda, and N. John, Tsitsiklis : Actor-Critic Algorithms, Neural Information Processing Systems, pp.1008-1014, 1999.

A. Laflaquière, Approche sensorimotrice de la perception de l'espace pour la robotique autonome, Thèse de doctorat, 2013.

G. Michail, R. Lagoudakis, and . Parr, Least-squares policy iteration, Journal of machine learning research, vol.4, pp.1107-1149, 2003.

A. Yann and . Lecun, Une procédure d'apprentissage pour réseau a seuil asymmetrique (a learning scheme for asymmetric threshold networks), Proceedings of Cognitiva 85, 1985.

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez et al., Continuous control with deep reinforcement learning. arXiv preprint, 2015.

J. Long and . Lin, Self-improving reactive agents based on reinforcement learning, planning and teaching, Machine Learning, pp.293-321, 1992.

R. Ashique, H. Mahmood, R. S. Yu, and . Sutton, Multi-step Off-policy Learning Without Importance Sampling Ratios, 1702.

J. G. March, Exploration and exploitation in organizational learning. Organization science, pp.71-87, 1991.

S. Francisco, M. Melo, and . Lopes, Fitted natural actor-critic : A new algorithm for continuous state-action MDPs, In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, vol.5212, pp.66-81, 2008.

V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap et al., Asynchronous methods for deep reinforcement learning, International Conference on Machine Learning, pp.1928-1937, 2016.

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness et al., Human-level control through deep reinforcement learning, Nature, vol.101, issue.7540, pp.529-533, 2015.
DOI : 10.1016/S0004-3702(98)00023-X

R. Munos, T. Stepleton, A. Harutyunyan, and G. Marc, Bellemare : Safe and Efficient Off-Policy Reinforcement Learning, 2016.

S. Kumpati, . Narendra, A. L. Mandayam, and . Thathachar, Learning automata-a survey, IEEE Transactions on systems, man, and cybernetics, issue.4, pp.323-334, 1974.

S. Narvekar, J. Sinapov, M. Leonetti, and P. Stone, Source task creation for curriculum learning, Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems International Foundation for Autonomous Agents and Multiagent Systems, pp.566-574, 2016.

Y. Andrew, D. Ng, S. Harada, and . Russell, Policy invariance under reward transformations : Theory and application to reward shaping, ICML, pp.278-287, 1999.

S. Nissen, Implementation of a Fast Artificial Neural Network Library (fann) Rapport technique, 2003.

P. Oudeyer and F. Kaplan, How can we define intrinsic motivation ?, pp.93-101, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00420175

J. Pazis and G. Michail, Lagoudakis : Binary action search for learning continuous-action control policies, Proceedings of the 26th International Conference on Machine Learning (ICML), pp.793-800, 2009.

J. Peters and K. Mülling-et-yasemin-altun, Relative Entropy Policy Search, Association for the Advancement of Artificial Intelligence, pp.1607-1612, 2010.

J. Peters and S. Schaal, Policy Gradient Methods for Robotics, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.2219-2225, 2006.
DOI : 10.1109/IROS.2006.282564

J. Peters and S. Schaal, Natural Actor-Critic, Neurocomputing, vol.71, issue.7-9, pp.1180-1190, 2008.
DOI : 10.1016/j.neucom.2007.11.026

J. Piaget, La naissance de l'intelligence chez l'enfant, 1948.

D. Precup, Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, p.80, 2000.

V. Danil, . Prokhorov, C. Donald, and . Wunsch, Adaptive critic designs, IEEE Transactions on Neural Networks, vol.8, issue.5, pp.997-1007, 1997.

M. Riedmiller, Neural Fitted Q Iteration ??? First Experiences with a Data Efficient Neural Reinforcement Learning Method, In Lecture Notes in Computer Science, vol.3720, pp.317-328, 2005.
DOI : 10.1007/11564096_32

M. Riedmiller and H. Braun, RPROP -A Fast Adaptive Learning Algorithm, International Symposium on Computer and Information Science VII, 1992.

M. Riedmiller, T. Gabel, R. Hafner, and S. Lange, Reinforcement learning for robot soccer, Autonomous Robots, vol.8, issue.1, pp.55-73, 2009.
DOI : 10.1080/088395198117848

M. Riedmiller, J. Peters, and S. Schaal, Evaluation of Policy Gradient Methods and Variants on the Cart-Pole Benchmark, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pp.254-261, 2007.
DOI : 10.1109/ADPRL.2007.368196

F. Rosenblatt, The perceptron: A probabilistic model for information storage and organization in the brain., Psychological Review, vol.65, issue.6, p.386, 1958.
DOI : 10.1037/h0042519

Y. Reuven, . Rubinstein, and P. Dirk, Kroese : Simulation and the Monte Carlo method, 2016.

A. Gavin, M. Rummery, and . Niranjan, On-line Q-learning using connectionist systems, 1994.

S. Warren and . Sarle, Stopped training and other remedies for overfitting. Computing science and statistics, pp.352-360, 1996.

T. Schaul and J. Quan, Ioannis Antonoglou et David Silver : Prioritized Experience Replay. arXiv preprint, pp.1-23, 2015.

B. Scherrer, Should one compute the Temporal Difference fix point or minimize the Bellman Residual ? The unified oblique projection view, 27th International Conference on Machine Learning, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00537403

B. Scherrer and B. Lesner, On the use of non-stationary policies for stationary infinitehorizon Markov decision processes, Advances in Neural Information Processing Systems, pp.1826-1834, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00758809

J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, High-dimensional continuous control using generalized advantage estimation, 2015.

R. John, Searle : Minds, brains, and programs, Behavioral and brain sciences, vol.3, issue.3, pp.417-424, 1980.

F. Sehnke, C. Osendorfer, T. Rückstieß, and A. Graves, Parameter-exploring policy gradients, Neural Networks, vol.23, issue.4, pp.551-559, 2010.
DOI : 10.1016/j.neunet.2009.12.004

O. Sigaud and A. Droniou, Towards Deep Developmental Learning, IEEE Transactions on Cognitive and Developmental Systems, vol.8, issue.2, pp.99-114, 2016.
DOI : 10.1109/TAMD.2015.2496248

URL : https://hal.archives-ouvertes.fr/hal-01331799

D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra et al., Deterministic Policy Gradient Algorithms, Proceedings of the 31st International Conference on Machine Learning, pp.387-395, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00938992

P. Satinder, R. S. Singh, and . Sutton, Reinforcement learning with replacing eligibility traces, Machine learning, vol.22, issue.1-3, pp.123-158, 1996.

W. Mark, Spong : Swing up control problem for the acrobot, IEEE Control Systems Magazine, vol.15, issue.1, pp.49-55, 1995.

G. E. Bibliographie-nitish-srivastava, A. Hinton, and . Krizhevsky, Ilya Sutskever et Ruslan Salakhutdinov : Dropout : A Simple Way to Prevent Neural Networks from Overfitting, Journal of Machine Learning Research, vol.15, pp.1929-1958, 2014.

P. Stone, R. S. Sutton, and G. Kuhlmann, Reinforcement Learning for RoboCup Soccer Keepaway, Adaptive Behavior, vol.13, issue.3, pp.165-188, 2005.
DOI : 10.1109/9.580874

F. Stulp and O. Sigaud, Abstract, Paladyn, Journal of Behavioral Robotics, vol.4, issue.1, pp.49-61, 2013.
DOI : 10.2478/pjbr-2013-0003

R. S. Sutton, Learning to predict by the methods of temporal differences, Machine Learning, vol.34, issue.1, pp.9-44, 1988.
DOI : 10.3758/BF03205056

R. S. Sutton, Introduction to reinforcement learning with function approximation, Tutorial at the Conference on Neural Information Processing Systems, 2015.

S. Richard, A. G. Sutton, and . Barto, Reinforcement Learning : An Introduction (Adaptive Computation and Machine Learning), 1998.

S. Richard, . Sutton, and G. Andrew, Barto : Reinforcement learning : An introduction, 2017.

R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, and D. Silver, Csaba Szepesvári et Eric Wiewiora : Fast gradient-descent methods for temporal-difference learning with linear function approximation, Proceedings of the 26th Annual International Conference on Machine Learning, pp.993-1000, 2009.

R. S. Sutton, A. R. Mahmood, and M. White, An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning, The Journal of Machine Learning Research, vol.17, 2016.

R. S. Sutton and D. Mcallester, Satinder Singh et Yishay Mansour : Policy Gradient Methods for Reinforcement Learning with Function Approximation, Advances in Neural Information Processing Systems 12, pp.1057-1063, 1999.

R. S. Sutton, D. Precup, and S. Singh, Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning, Artificial Intelligence, vol.112, issue.1-2, pp.181-211, 1999.
DOI : 10.1016/S0004-3702(99)00052-1

A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel, Value Iteration Networks, Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pp.2154-2162, 2016.
DOI : 10.24963/ijcai.2017/700

URL : https://www.ijcai.org/proceedings/2017/0700.pdf

Y. Tassa, T. Erez, and E. Todorov, Synthesis and stabilization of complex behaviors through online trajectory optimization, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.4906-4913, 2012.
DOI : 10.1109/IROS.2012.6386025

E. Matthew, P. Taylor, and . Stone, Transfer Learning for Reinforcement Learning Domains : A Survey, Journal of Machine Learning Research, vol.10, pp.1633-1685, 2009.

E. Todorov, T. Erez, and Y. Tassa, MuJoCo: A physics engine for model-based control, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.5026-5033, 2012.
DOI : 10.1109/IROS.2012.6386109

A. M. Turing, Computing machinery and intelligence, pp.59433-460, 1950.

H. Van-hasselt, Reinforcement Learning in Continuous State and Action Spaces, Reinforcement Learning, pp.207-251, 2012.
DOI : 10.1007/978-3-642-27645-3_7

H. Van-hasselt, A. Mahmood, and R. S. Sutton, Off-policy TD (?) with a true online equivalence, Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, 2014.

H. Van-hasselt and A. Marco, Wiering : Reinforcement learning in continuous action spaces, Proceedings of the IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning, pp.272-279, 2007.

H. Van-seijen, A. Mahmood, P. M. Pilarski, C. Marlos, R. S. Machado et al., True online temporal-difference learning, Journal of Machine Learning Research, vol.17, issue.145, pp.1-40, 2016.

H. Vanseijen and R. S. Sutton, A Deeper Look at Planning as Learning from Replay

O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets et al., Julian Schrittwieser et Others : Starcraft II : A new challenge for reinforcement learning. arXiv preprint, 2017.

N. Wahlström, T. B. Schön, and P. Marc, Deisenroth : From Pixels to Torques : Policy Learning with Deep Dynamical Models, 2015.

Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos et al., Sample Efficient Actor-Critic with Experience Replay, 2016.

D. Freitas, Dueling network architectures for deep reinforcement learning. arXiv preprint, 2015.

P. Wawrzy?ski, Learning to Control a 6-Degree-of-Freedom Walking Robot, EUROCON 2007, The International Conference on "Computer as a Tool", pp.698-705, 2007.
DOI : 10.1109/EURCON.2007.4400335

B. Welford, Note on a Method for Calculating Corrected Sums of Squares and Products, Technometrics, vol.1, issue.1, pp.419-420, 1962.
DOI : 10.1080/00401706.1962.10490022

J. Paul and . Werbos, Advanced forecasting methods for global crisis warning and models of intelligence, General Systems Yearbook, vol.22, issue.12, pp.25-38, 1977.

B. Widrow-et-marcian, E. Hoff, S. Univ, . Stanford, and . Labs, Adaptive switching circuits, 1960.

R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine learning, vol.8, issue.3-4, pp.229-256, 1992.

M. Zimmer, Y. Boniface and A. Dutech, Neural Fitted Actor-Critic, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01350651

M. Zimmer, Y. Boniface and A. Dutech, Off-Policy Neural Fitted Actor-Critic, Deep Reinforcement Learning Workshop, NIPS 2016, Barcelona, Spain.
URL : https://hal.archives-ouvertes.fr/hal-01413886

M. Zimmer, Y. Boniface and A. Dutech, Toward a data efficient neural actorcritic, 13th European Workshop on Reinforcement Learning, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01413885

M. Zimmer, Y. Boniface and A. Dutech, Vers des architectures acteur-critique neuronales efficaces en données, Journées Francophones sur la Planification, la Décision et l'Apprentissage pour la conduite de systèmes, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01344905

M. Zimmer and S. Doncieux, Bootstrapping $Q$ -Learning for Robotics From Neuro-Evolution Results, IEEE Transactions on Cognitive and Developmental Systems, vol.10, issue.1, 2017.
DOI : 10.1109/TCDS.2016.2628817

URL : https://hal.archives-ouvertes.fr/hal-01494744