资源预览内容
第1页 / 共30页
第2页 / 共30页
第3页 / 共30页
第4页 / 共30页
第5页 / 共30页
第6页 / 共30页
第7页 / 共30页
第8页 / 共30页
第9页 / 共30页
第10页 / 共30页
亲,该文档总共30页,到这儿已超出免费预览范围,如果喜欢就下载吧!
资源描述
Introduction to Deep Reinforcement Learning,Yen-Chen Wu 2015/12/11,Outline,Reinforcement Learning Markov Decision Process How to Solve MDPs DP MC TD Q-learning (DQN) Paper Review,Reinforcement Learning,Branches of Machine Learning,What makes different?,There is no supervisor, only a reward signal Feedback is delayed, not instantaneous Time really matters (sequential, non i.i.d data) Agents actions affect the subsequent data it receives,Goal: Maximize Cumulative Reward,Actions may have long term consequences Reward may be delayed It may be better to sacrifice immediate reward to gain more long-term reward,Agent & Enviroment, Defense Attack Jump,Full observability vs Partial observability Learning and Planning Exploration and Exploitation Prediction and Control,Markov Decision Process,Markov Processes Markov Reward Processes Markov Decision Processes,Markov Process,Markov Reward Processes,Markov Decision Process,Markov Decision Process(MDP),S : finite set of states (observations) A : finite set of actions P : transition probability R : immediate reward : discount factor Goal : Choose policy Maximize expected return :,How to Solve MDP,Dynamic Programming Monte-Carlo Temporal-Difference Q-Learning,Model-based,Dynamic Programming Evaluate policy Update policy,Model Free,Unknown Transition Probability & Reward MC vs TD,Model Free: Q-learning,Instead of tabular optimal action-value function (Q-learning) = Bellman equation,Basic idea : iterative update (lack of generalization) In practical : function approximator Linear ? Using DNN !,Deep Q-network (DQN),Video,Deep Q-Network,compute Q-values for all actions,Input : 84x84x4,Convolves 32 filters of 8x8 with stride 4 Convolves 64 filters of 4x4 with stride 2 Convolves 64 filters of 3x3 with stride 1,Full-connected 512 nodes,Output a node for each action,Update DQN,Loss function Gradient,Two Technique,Experience Replay Experience Pooled Memory Data efficiency (bootstrap) Avoid correlation between samples (variance between batches) Off policy is suitable for Q-learning Random sampled mini-batch Prioritized sweeping (active learning) Separate Target Network more stable than online learning,DEMO,Paper review,Paper list,Massively Parallel Methods for Deep Reinforcement Learning Continuous control with deep reinforcement learning Deep Reinforcement Learning with Double Q-learning Policy Distillation Dueling Network Architectures for Deep Reinforcement Learning Multiagent Cooperation and Competition with Deep Reinforcement Learning,Massively Parallel Methods for Deep Reinforcement Learning Arun Nair arXiv:1507.04296,DDPG (Deterministic Policy Gradient),DDAC (Deep Deterministic Actor-Critic),Continuous control with deep reinforcement learning Timothy P. Lillicrap arXiv:1509.02971 https:/goo.gl/J4PIAz,Double Q-learning,Policy Distillation,Soft target,Dueling Network,Multiagent,
收藏 下载该资源
网站客服QQ:2055934822
金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号