温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,汇文网负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
网站客服:3074922707
MIT深度学习与自动驾驶第二讲-76页
2
MIT
深度
学习
自动
驾驶
第二
76
Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving Cars6.S094:Deep Learning for Self-Driving CarsLearning to Move:Deep Reinforcement Learning for Motion Planningcars.mit.edu每日免费获取报告1、每日微信群内分享5+最新重磅报告;2、每日分享当日华尔街日报、金融时报;3、每周分享经济学人4、每月汇总500+份当月重磅报告(增值服务)扫一扫二维码关注公号回复:研究报告加入“起点财经”微信群。Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsAdministrative Website:cars.mit.edu Contact Email:deepcarsmit.edu Required:Create an account on the website.Follow the tutorial for each of the 2 projects.Recommended:Ask questions Win competition!Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsScheduleLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsDeepTraffic:Solving Traffic with Deep Reinforcement LearningLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsSupervised LearningUnsupervised LearningSemi-SupervisedLearningReinforcementLearningStandard supervised learning pipeline:Types of machine learning:References:81Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsPerceptron:Weighing the EvidenceReferences:78EvidenceDecisionsLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsPerceptron:Implement a NAND Gate Universality:NAND gates are functionally complete,meaning we can build any logical function out of them.References:79Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsPerceptron:Implement a NAND Gate011101(-2)*0+(-2)*1+3=1(-2)*1+(-2)*0+3=1001110(-2)*0+(-2)*0+3=3(-2)*1+(-2)*1+3=-1Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsPerceptron NAND GateReferences:80Both circuits can represent arbitrary logical functions:But“perceptron circuits”can learnLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsThe Process of Learning:Small Change in Weights Small Change in OutputReferences:80Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsThe Process of Learning:Small Change in Weights Small Change in OutputReferences:80This requires a“smoothness”PerceptronNeuronSmoothness of activation function means:the output is a linear function of the weights and biasLearning is the process of gradually adjusting the weights to achieve any gradual change in the output.Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsCombining Neurons into LayersFeed Forward Neural NetworkRecurrent Neural Network-Have state memory-Are hard to trainLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsTask:Classify and Image of a NumberReferences:80Input:(28x28)Network:Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsTask:Classify and Image of a NumberReferences:63,80Ground truth for“6”:“Loss”function:Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsPhilosophical Motivation for Reinforcement LearningTakeaway from Supervised Learning:Neural networks are great at memorization and not(yet)great at reasoning.Hope for Reinforcement Learning:Brute-force propagation of outcomes to knowledge about states and actions.This is a kind of brute-force“reasoning”.Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsAgent and Environment At each step the agent:Executes action Receives observation(new state)Receives reward The environment:Receives action Emits observation(new state)Emits rewardReferences:80Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsReinforcement LearningReinforcement learning is a general-purpose framework for decision-making:An agent operates in an environment:Atari Breakout An agent has the capacity to act Each action influences the agents future state Success is measured by a reward signal Goal is to select actions to maximize future rewardReferences:85Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsMarkov Decision Process0,0,1,1,1,2,1,1,stateactionrewardTerminalstateReferences:84Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsMajor Components of an RL AgentAn RL agent may include one or more of these components:Policy:agents behavior function Value function:how good is each state and/or action Model:agents representation of the environment0,0,1,1,1,2,1,1,stateactionrewardTerminalstateLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsRobot in a Room+1-1STARTactions:UP,DOWN,LEFT,RIGHTUP80%move UP10%move LEFT10%move RIGHT reward+1 at 4,3,-1 at 4,2 reward-0.04 for each step whats the strategy to achieve max reward?what if the actions were deterministic?Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsIs this a solution?+1-1 only if actions deterministic not in this case(actions are stochastic)solution/policy mapping from each state to an actionLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsOptimal policy+1-1Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsReward for each step-2+1-1Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsReward for each step:-0.1+1-1Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsReward for each step:-0.04+1-1Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsReward for each step:-0.01+1-1Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsReward for each step:+0.01+1-1Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsValue Function Futurereward=1 +2 +3 +=+1+2+Discounted future reward(environment is stochastic)=+1 +2+2+=+(+1 +(+2 +)=+1 A good strategy for an agent would be to always choose an action that maximizes the(discounted)futurerewardReferences:84Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsQ-Learning State value function:V(s)Expected return when starting in s and following State-action value function:Q(s,a)Expected return when starting in s,performing a,and following Useful for finding the optimal policy Can estimate from experience(Monte Carlo)Pick the best action using Q(s,a)Q-learning:off-policy Use any policy to estimate Q that maximizes future reward:Q directly approximates Q*(Bellman optimality equation)Independent of the policy being followed Only requirement:keep updating each(s,a)pairsasrLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsQ-LearningsasrNew StateOld StateRewardLearning RateDiscount FactorLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsExploration vs ExploitationKey ingredient of Reinforcement LearningDeterministic/greedy policy wont explore all actionsDont know anything about the environment at the beginningNeed to try all actions to find the optimal oneMaintain explorationUse soft policies instead:(s,a)0(for all s,a)-greedy policyWith probability 1-perform the optimal/greedy actionWith probability perform a random actionWill keep exploring the environmentSlowly move it towards greedy policy:-0Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsQ-Learning:Value IterationNew StateOld StateRewardLearning RateDiscount FactorReferences:84Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsQ-Learning:Representation Matters In practice,Value Iteration is impractical Very limited states/actions Cannot generalize to unobservedstates Think about the Breakout game State:screenpixels Image size:(resized)Consecutive 4images Grayscale with 256 graylevelsrows in theQ-table!References:83,84Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsPhilosophical Motivation for Deep Reinforcement LearningTakeaway from Supervised Learning:Neural networks are great at memorization and not(yet)great at reasoning.Hope for Reinforcement Learning:Brute-force propagation of outcomes to knowledge about states and actions.This is a kind of brute-force“reasoning”.Hope for Deep Learning+Reinforcement Learning:General purpose artificial intelligence through efficient generalizable learning of the optimal thing to do given a formalized set of actions and states(possibly huge).Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsDeep Q-LearningUse a function(with parameters)to approximate the Q-function Linear Non-linear:Q-NetworkReferences:83Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsDeep Q-Network:AtariMnih et al.Playing atari with deep reinforcement learning.2013.References:83Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsDeep Q-Network Training Bellman Equation:Loss function(squared error):References:83Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsDeep Q-Network TrainingGiven a transition,the Q-table update rule in the previous algorithm must be replaced with the following:Do a feedforward pass for the current state s to get predicted Q-values for all actions Do a feedforward pass for the next state s and calculate maximum overall network outputs maxaQ(s,a)Set Q-value target for action to r+maxaQ(s,a)(use the max calculated in step 2).For all other actions,set the Q-value target to the same as originally returned from step 1,making the error 0 for those outputs.Update the weights using backpropagation.References:83Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsExploration vs ExploitationKey ingredient of Reinforcement LearningDeterministic/greedy policy wont explore all actionsDont know anything about the environment at the beginningNeed to try all actions to find the optimal oneMaintain explorationUse soft policies instead:(s,a)0(for all s,a)-greedy policyWith probability 1-perform the optimal/greedy actionWith probability perform a random actionWill keep exploring the environmentSlowly move it towards greedy policy:-0Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsAtari Breakout A few tricks needed,most importantly:experience replayReferences:83Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsDeep Q-Learning AlgorithmLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsAtari BreakoutReferences:85After120 Minutesof TrainingAfter10 Minutesof TrainingAfter240 Minutesof TrainingLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsDQN Results in AtariReferences:83Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsGorila(General Reinforcement LearningArchitecture)10 x faster than Nature DQN on 38 out of 49 Atari gamesApplied to recommender systems within GoogleNair et al.Massively parallel methods for deep reinforcement learning.(2015).Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsThe Game of TrafficOpen Question(Again):Is driving closer to chess or to everyday conversation?Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsDeepTraffic:Solving Traffic with Deep Reinforcement LearningGoal:Achieve the highest average speed over a long period of time.Requirement for Students:Follow tutorial to achieve a speed of 65mphLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsThe Road,The Car,The SpeedState Representation:Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsSimulation SpeedLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsDisplay OptionsLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsSafety SystemLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsDriving/LearningLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsLearning InputLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsLearning InputLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsLearning InputLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsEvaluation Scoring:Average Speed Method:Collect average speed Ten runs,about 30(simulated)minutes of game each Result:median speed of the 10 runs Done server side after you submit (no cheating possible!(we also look at the code)You can try it locally to get an estimate Uses exactly the same evaluation procedure/code But:some influence of randomness Our number is what counts in the end!Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsEvaluation(Locally).Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsCoding/Changing the Net LayoutWatch out:kills trained state!Lex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsTraining Done on separate thread(Web Workers,yay!)Separate simulation,resets,state,etc.A lot faster(1000 fps+)Net state gets shipped to the main simulation from time to time You get to see the improvements/learning liveLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsTrainingLex Fridman:fridmanmit.eduWebsite:cars.mit.eduJanuary2017Course 6.S094:Deep Learning for Self-Driving CarsLoading/Savi