Deep Learning
Deep
Learning
Deep
Learning
Deep LearningIan GoodfellowYoshua BengioAaron CourvilleContentsWebsiteviiAcknowledgmentsviiiNotationxi1Introduction11.1Who Should Read This Book?.81.2Historical Trends in Deep Learning.11IApplied Math and Machine Learning Basics292Linear Algebra312.1Scalars,Vectors,Matrices and Tensors.312.2Multiplying Matrices and Vectors.342.3Identity and Inverse Matrices.362.4Linear Dependence and Span.372.5Norms.392.6Special Kinds of Matrices and Vectors.402.7Eigendecomposition.422.8Singular Value Decomposition.442.9The Moore-Penrose Pseudoinverse.452.10The Trace Operator.462.11The Determinant.472.12Example:Principal Components Analysis.483Probability and Information Theory533.1Why Probability?.54iCONTENTS3.2Random Variables.563.3Probability Distributions.563.4Marginal Probability.583.5Conditional Probability.593.6The Chain Rule of Conditional Probabilities.593.7Independence and Conditional Independence.603.8Expectation,Variance and Covariance.603.9Common Probability Distributions.623.10Useful Properties of Common Functions.673.11Bayes Rule.703.12Technical Details of Continuous Variables.713.13Information Theory.723.14Structured Probabilistic Models.754Numerical Computation804.1Overflow and Underflow.804.2Poor Conditioning.824.3Gradient-Based Optimization.824.4Constrained Optimization.934.5Example:Linear Least Squares.965Machine Learning Basics985.1Learning Algorithms.995.2Capacity,Overfitting and Underfitting.1105.3Hyperparameters and Validation Sets.1205.4Estimators,Bias and Variance.1225.5Maximum Likelihood Estimation.1315.6Bayesian Statistics.1355.7Supervised Learning Algorithms.1395.8Unsupervised Learning Algorithms.1455.9Stochastic Gradient Descent.1505.10Building a Machine Learning Algorithm.1525.11Challenges Motivating Deep Learning.154IIDeep Networks:Modern Practices1656Deep Feedforward Networks1676.1Example:Learning XOR.1706.2Gradient-Based Learning.176iiCONTENTS6.3Hidden Units.1906.4Architecture Design.1966.5Back-Propagation and Other Differentiation Algorithms.2036.6Historical Notes.2247Regularization for Deep Learning2287.1Parameter Norm Penalties.2307.2Norm Penalties as Constrained Optimization.2377.3Regularization and Under-Constrained Problems.2397.4Dataset Augmentation.2407.5Noise Robustness.2427.6Semi-Supervised Learning.2447.7Multi-Task Learning.2457.8Early Stopping.2467.9Parameter Tying and Parameter Sharing.2517.10Sparse Representations.2537.11Bagging and Other Ensemble Methods.2557.12Dropout.2577.13Adversarial Training.2677.14Tangent Distance,Tangent Prop,and Manifold Tangent Classifier 2688Optimization for Training Deep Models2748.1How Learning Differs from Pure Optimization.2758.2Challenges in Neural Network Optimization.2828.3Basic Algorithms.2948.4Parameter Initialization Strategies.3018.5Algorithms with Adaptive Learning Rates.3068.6Approximate Second-Order Methods.3108.7Optimization Strategies and Meta-Algorithms.3189Convolutional Networks3319.1The Convolution Operation.3329.2Motivation.3369.3Pooling.3409.4Convolution and Pooling as an Infinitely Strong Prior.3469.5Variants of the Basic Convolution Function.3489.6Structured Outputs.3599.7Data Types.3619.8Efficient Convolution Algorithms.3639.9Random or Unsupervised Features.364iiiCONTENTS9.10The Neuroscientific Basis for Convolutional Networks.3659.11Convolutional Networks and the History of Deep Learning.37210 Sequence Modeling:Recurrent and Recursive Nets37410.1Unfolding Computational Graphs.37610.2Recurrent Neural Networks.37910.3Bidirectional RNNs.39610.4Encoder-Decoder Sequence-to-Sequence Architectures.39710.5Deep Recurrent Networks.39910.6Recursive Neural Networks.40110.7The Challenge of Long-Term Dependencies.40310.8Echo State Networks.40610.9Leaky Units and Other Strategies for Multiple Time Scales.40910.10 The Long Short-Term Memory and Other Gated RNNs.41110.11 Optimization for Long-Term Dependencies.41510.12 Explicit Memory.41911 Practical methodology42411.1Performance Metrics.42511.2Default Baseline Models.42811.3Determining Whether to Gather More Data.42911.4Selecting Hyperparameters.43011.5Debugging Strategies.43911.6Example:Multi-Digit Number Recognition.44312 Applications44612.1Large Scale Deep Learning.44612.2Computer Vision.45512.3Speech Recognition.46112.4Natural Language Processing.46412.5Other Applications.480IIIDeep Learning Research48913 Linear Factor Models49213.1Probabilistic PCA and Factor Analysis.49313.2Independent Component Analysis(ICA).49413.3Slow Feature Analysis.49613.4Sparse Coding.499ivCONTENTS13.5Manifold Interpretation of PCA.50214 Autoencoders50514.1Undercomplete Autoencoders.50614.2Regularized Autoencoders.50714.3Representational Power,Layer Size and Depth.51114.4Stochastic Encoders and Decoders.51214.5Denoising Autoencoders.51314.6Learning Manifolds with Autoencoders.51814.7Contractive Autoencoders.52414.8Predictive Sparse Decomposition.52614.9Applications of Autoencoders.52715 Representation Learning52915.1Greedy Layer-Wise Unsupervised Pretraining.53115.2Transfer Learning and Domain Adaptation.53915.3Semi-Supervised Disentangling of Causal Factors.54415.4Distributed Representation.54915.5Exponential Gains from Depth.55615.6Providing Clues to Discover Underlying Causes.55716 Structured Probabilistic Models for Deep Learning56116.1The Challenge of Unstructured Modeling.56216.2Using Graphs to Describe Model Structure.56616.3Sampling from Graphical Models.58316.4Advantages of Structured Modeling.58416.5Learning about Dependencies.58516.6Inference and Approximate Inference.58616.7The Deep Learning Approach to Structured Probabilistic Models58717 Monte Carlo Methods59317.1Sampling and Monte Carlo Methods.59317.2Importance Sampling.59517.3Markov Chain Monte Carlo Methods.59817.4Gibbs Sampling.60217.5The Challenge of Mixing between Separated Modes.60218 Confronting the Partition Function60818.1The Log-Likelihood Gradient.60918.2Stochastic Maximum Likelihood and Contrastive Divergence.610vCONTENTS18.3Pseudolikelihood.61818.4Score Matching and Ratio Matching.62018.5Denoising Score Matching.62218.6Noise-Contrastive Estimation.62318.7Estimating the Partition Function.62619 Approximate inference63419.1Inference as Optimization.63619.2Expectation Maximization.63719.3MAP Inference and Sparse Coding.63819.4Variational Inference and Learning.64119.5Learned Approximate Inference.65320 Deep Generative Models65620.1Boltzmann Machines.65620.2Restricted Boltzmann Machines.65820.3Deep Belief Networks.66220.4Deep Boltzmann Machines.66520.5Boltzmann Machines for Real-Valued Data.67820.6Convolutional Boltzmann Machines.68520.7Boltzmann Machines for Structured or Sequential Outputs.68720.8Other Boltzmann Machines.68820.9Back-Propagation through Random Operations.68920.10 Directed Generative Nets.69420.11 Drawing Samples from Autoencoders.71220.12 Generative Stochastic Networks.71620.13 Other Generation Schemes.71720.14 Evaluating Generative Models.71920.15 Conclusion.721Bibliography723Index780viWebsitewww.deeplearningbook.orgThis book is accompanied by the above website.The website provides avariety of supplementary material,including exercises,lecture slides,corrections ofmistakes,and other resources that should be useful to both readers and instructors.viiAcknowledgmentsThis book would not have been possible without the contributions of many people.We would like to thank those who commented on our proposal for the bookand helped plan its contents and organization:Guillaume Alain,Kyunghyun Cho,alar Glehre,David Krueger,Hugo Larochelle,Razvan Pascanu and ThomasRohe.We would like to thank the people who offered feedback on the content of thebook itself.Some offered feedback on many chapters:Martn Abadi,GuillaumeAlain,Ion Androutsopoulos,Fred Bertsch,Olexa Bilaniuk,Ufuk Can Biici,MatkoBonjak,John Boersma,Greg Brockman,Pierre Luc Carrier,Sarath Chandar,Pawel Chilinski,Mark Daoust,Oleg Dashevskii,Laurent Dinh,Stephan Dreseitl,Jim Fan,Miao Fan,Meire Fortunato,Frdric Francis,Nando de Freitas,alarGlehre,Jurgen Van Gael,Javier Alonso Garca,Jonathan Hunt,Gopi Jeyaram,Chingiz Kabytayev,Lukasz Kaiser,Varun Kanade,Akiel Khan,John King,DiederikP.Kingma,Yann LeCun,Rudolf Mathey,Matas Mattamala,Abhinav Maurya,Kevin Murphy,Oleg Mrk,Roman Novak,Augustus Q.Odena,Simon Pavlik,Karl Pichotta,Kari Pulli,Tapani Raiko,Anurag Ranjan,Johannes Roith,HalisSak,Csar Salgado,Grigory Sapunov,Mike Schuster,Julian Serban,Nir Shabat,Ken Shirriff,Scott Stanley,David Sussillo,Ilya Sutskever,Carles Gelada Sez,Graham Taylor,Valentin Tolmer,An Tran,Shubhendu Trivedi,Alexey Umnov,Vincent Vanhoucke,Marco Visentini-Scarzanella,David Warde-Farley,DustinWebb,Kelvin Xu,Wei Xue,Li Yao,Zygmunt Zajc and Ozan alayan.We would also like to thank those who provided us with useful feedback onindividual chapters:Chapter,:Yusuf Akgul,Sebastien Bratieres,Samira Ebrahimi,1 IntroductionCharlie Gorichanaz,Brendan Loudermilk,Eric Morris,Cosmin Prvulescuand Alfredo Solano.Chapter,:Amjad Almahairi,Nikola Bani,Kevin Bennett,2 Linear AlgebraviiiCONTENTSPhilippe Castonguay,Oscar Chang,Eric Fosler-Lussier,Sergey Oreshkov,Istvn Petrs,Dennis Prangle,Thomas Rohe,Colby Toland,MassimilianoTomassoli,Alessandro Vitale and Bob Welland.Chapter,:John Philip Anderson,Kai3 Probability and Information TheoryArulkumaran,Vincent Dumoulin,Rui Fa,Stephan Gouws,Artem Oboturov,Antti Rasmus,Andre Simpelo,Alexey Surkov and Volker Tresp.Chapter,:Tran Lam An,Ian Fischer,and Hu4 Numerical ComputationYuhuang.Chapter,:Dzmitry Bahdanau,Nikhil Garg,5 Machine Learning BasicsMakoto Otsuka,Bob Pepin,Philip Popien,Emmanuel Rayner,Kee-BongSong,Zheng Sun and Andy Wu.Chapter,6 Deep Feedforward Networks:Uriel Berdugo,Fabrizio Bottarel,Elizabeth Burl,Ishan Durugkar,JeffHlywa,Jong Wook Kim,David Kruegerand Aditya Kumar Praharaj.Chapter,:Inkyu Lee,Sunil Mohan and7 Regularization for Deep LearningJoshua Salisbury.Chapter,8 Optimization for Training Deep Models:Marcel Ackermann,Rowel Atienza,Andrew Brock,Tegan Maharaj,James Martens and KlausStrobl.Chapter,9 Convolutional Networks:Martn Arjovsky,Eugene Brevdo,EricJensen,Asifullah Khan,Mehdi Mirza,Alex Paino,Eddie Pierce,MarjorieSayer,Ryan Stout and Wentao Wu.Chapter,10 Sequence Modeling:Recurrent and Recursive Nets:GkenEraslan,Steven Hickson,Razvan Pascanu,Lorenzo von Ritter,Rui Rodrigues,Mihaela Rosca,Dmitriy Serdyuk,Dongyu Shi and Kaiyu Yang.Chapter,:Daniel Beckstein.11 Practical methodology Chapter,:George Dahl and Ribana Roscher.12 Applications Chapter,:Kunal Ghosh.15 Representation LearningChapter,:Minh L16 Structured Probabilistic Models for Deep Learningand Anton Varfolom.Chapter,18 Confronting the Partition Function:Sam Bowman.ixCONTENTSChapter,20 Deep Generative Models:Nicolas Chapados,Daniel Galvez,Wenming Ma,Fady Medhat,Shakir Mohamed and Grgoire Montavon.Bibliography:Leslie N.Smith.We also want to thank those who allowed us to reproduce images,figures ordata from their publications.We indicate their contributions in the figure captionsthroughout the text.We would like to thank Ians wife Daniela Flori Goodfellow for patientlysupporting Ian during the writing of the book as well as for help with proofreading.We would like to thank the Google Brain team for providing an intellectualenvironment where Ian could devote a tremendous amount of time to writing thisbook and receive feedback and guidance from colleagues.We would especially liketo thank Ians former manager,Greg Corrado,and his current manager,SamyBengio,for their support of this project.Finally,we would like to thank GeoffreyHinton for encouragement when writing was difficult.xNotationThis section provides a concise reference describing the notation used throughoutthis book.If you are unfamiliar with any of the corresponding mathematicalconcepts,this notation reference may seem intimidating.However,do not despair,we describe most of these ideas in chapters 2-4.Numbers and ArraysaA scalar(integer or real)aA vectorAA matrixAA tensorInIdentity matrix withrows andcolumnsnnIIdentity matrix with dimensionality implied bycontexte()iStandard basis vector 0,.,0,1,0,.,0 with a1 at position idiag()aA square,diagonal matrix with diagonal entriesgiven by aaA scalar random variableaA vector-valued random variableAA matrix-valued random variablexiCONTENTSSets and GraphsAA setRThe set of real numbers0 1,The set containing 0 and 10 1,.,nThe set of all integers betweenand0na,bThe real interval includingandab(a,bThe real interval excludingbut includingabA BSet subtraction,i.e.,the set containing the ele-ments ofthat are not inABGA graphPaG(xi)The parents of xiin GIndexingaiElementiof vectora,with indexing starting at 1aiAll elements of vectorexcept for elementaiAi,jElementof matrixi,jAAi,:Rowof matrixiAA:,iColumnof matrixiAAi,j,kElementof a 3-D tensor()i,j,kAA:,i2-D slice of a 3-D tensoraiElementof the random vectoriaLinear Algebra OperationsATranspose of matrix AA+Moore-Penrose pseudoinverse of AABElement-wise(Hadamard)product ofandABdet()ADeterminant of AxiiCONTENTSCalculusdydxDerivative ofwith respect to.yxyxPartial derivative ofwith respect toyxxyGradient ofwith respect toyxXyMatrix derivatives ofwith respect toyXXyTensor containing derivatives ofywith respect toXfxJacobian matrix J Rmnof f:Rn Rm2xfff()(x or H)()xThe Hessian matrix ofat input point xZfd()xxDefinite integral over the entire domain of xZSfd()xxDefinite integral with respect toover the setxSProbability and Information Theorya bThe random variables a and b are independenta bcThey are are conditionally independent given c|P()aA probability distribution over a discrete variablep()aA probability distribution over a continuous vari-able,or over a variable whose type has not beenspecifiedaRandom variable a has distribution PPExP()()()()f xor Ef xExpectation of f x with respect to P xVar()f xVariance ofunderxf x()P()Cov()()f x,g xCovariance ofandunderxf x()g x()P()H()xShannon entropy of the random variable xDKL()P QkKullback-Leibler divergence of P and QN(;)x,Gaussian distribution overxwith meanandcovariance xiiiCONTENTSFunctionsff:ABThe functionwith domainand rangeABfgfgComposition of the functionsandf(;)x A function ofxparametrized by.Sometimeswe just writef(x)and ignore the argumenttolighten notation.logxxNatural logarithm of x()Logistic sigmoid,11+exp()x xx()log(1+exp(Softplus,)|xpLpnorm of x|xL2norm of xx+Positive part of,i.e.,xmax(0),x1conditionis 1 if the condition is true,0 otherwiseSometimes we use a functionfwhose argument is a scalar,but apply it to a vector,matrix,or tensor:f(x),f(X),orf(X).This means to applyfto the arrayelement-wise.For example,ifC=(X),then Ci,j,k=(Xi,j,k)for all valid valuesof,and.i jkDatasets and distributionspdataThe data generating distribution pdataThe empirical distribution defined by the trainingsetXA set of training examplesx()iThe-th example(input)from a datasetiy()ior y()iThe target associated withx()ifor supervised learn-ingXThemnmatrix with input examplex()iin rowXi,:xivChapter 1IntroductionInventors have long dreamed of creating machines that think.This desire datesback to at least the time of ancient Greece.The mythical figures Pygmal