- Python Deep Learning Cookbook
- Indra den Bakker
- 410字
- 2021-07-02 15:43:15
Getting started with activation functions
If we only use linear activation functions, a neural network would represent a large collection of linear combinations. However, the power of neural networks lies in their ability to model complex nonlinear behavior. We briefly introduced the non-linear activation functions sigmoid and ReLU in the previous recipes, and there are many more popular nonlinear activation functions, such as ELU, Leaky ReLU, TanH, and Maxout.
There is no general rule as to which activation works best for the hidden units. Deep learning is a relatively new field and most results are obtained by trial and error instead of mathematical proofs. For the output unit, we use a single output unit and a linear activation function for regression tasks. For classification tasks with n classes, we use n output nodes and a softmax activation function. The softmax function forces the network to output probabilities between 0 and 1 for mutually exclusive classes and the probabilities sum up to 1. For binary classification, we can also use a single output node and a sigmoid activation function to output probabilities.
Choosing the correct activation function for the hidden units can be crucial. In the backward pass, the updates are dependent on the derivative of the activation function. For deep neural networks, the gradients of the updated weights can go to zero in the first couple of layers (also known as the vanishing gradients problem) or can grow exponentially big (also known as the exploding gradients problem). This holds especially when activation functions have a derivative that only takes on small values (for example the sigmoid activation function) or activation functions that have a derivative that can take values that are larger than 1.
Activation functions such as the ReLU prevents such cases. The ReLU has a derivative of 1 when the output is positive and is 0 otherwise. When using a ReLU activation function, a sparse network is generated with a relatively small number of activated connections. The loss that is passed through the network seems more useful in such cases. In some cases, the ReLU causes too many of the neurons to die; in such cases, you should try a variant such as Leaky ReLU. In our next recipe, we will compare the difference in results between a sigmoid and a ReLU activation function when classifying handwritten digits with a deep FNN.
- Practical Data Analysis Cookbook
- Scratch 3.0少兒編程與邏輯思維訓(xùn)練
- 薛定宇教授大講堂(卷Ⅳ):MATLAB最優(yōu)化計算
- Web Application Development with MEAN
- 單片機(jī)C語言程序設(shè)計實(shí)訓(xùn)100例
- PHP從入門到精通(第4版)(軟件開發(fā)視頻大講堂)
- 第一行代碼 C語言(視頻講解版)
- Building Serverless Architectures
- 深入實(shí)踐Kotlin元編程
- Principles of Strategic Data Science
- Spring 5 Design Patterns
- Python網(wǎng)絡(luò)爬蟲技術(shù)與應(yīng)用
- 3ds Max印象 電視欄目包裝動畫與特效制作
- 算法圖解
- Java EE 8 and Angular