官术网_书友最值得收藏!

Different types of activation function

We now know that multi-layer networks can classify linearly inseparable classes. But to do this, they need to satisfy one more condition. If the neurons don't have activation functions, their output would be the weighted sum of the inputs,  , which is a linear function. Then the entire neural network, that is, a composition of neurons, becomes a composition of linear functions, which is also a linear function. This means that even if we add hidden layers, the network will still be equivalent to a simple linear regression model, with all its limitations. To turn the network into a non-linear function, we'll use non-linear activation functions for the neurons. Usually, all neurons in the same layer have the same activation function, but different layers may have different activation functions. The most common activation functions are as follows:

  • : This function lets the activation value go through and is called the identity function.
  • : This function activates the neuron; if the activation is above a certain value, it's called the threshold activity function.
  • : This function is one of the most commonly used, as its output is bounded between 0 and 1, and it can be interpreted stochastically as the probability of the neuron activating. It's commonly called the logistic function, or the logistic sigmoid.
  • : This activation function is called bipolar sigmoid, and it's simply a logistic sigmoid rescaled and translated to have a range in (-1, 1).
  • : This activation function is called the hyperbolic tangent (or tanh).
  • : This activation function is probably the closest to its biological counterpart. It's a mix of the identity and the threshold function, and it's called the rectifier, or ReLU, as in Rectified Linear Unit. There are variations on the ReLU, such as Noisy ReLU, Leaky ReLU, and ELU (Exponential Linear Unit). 

The identity activation function, or the threshold function, was widely used at the inception of neural networks with implementations such as the perceptron or the Adaline (adaptive linear neuron), but subsequently lost traction in favor of the logistic sigmoid, the hyperbolic tangent, or the ReLU and its variations. The latter three activation functions differ in the following ways:

  • Their range is different.
  • Their derivatives behave differently during training.

The range for the logistic function is (0,1), which is one reason why this is the preferred function for stochastic networks, in other words, networks with neurons that may activate based on a probability function. The hyperbolic function is very similar to the logistic function, but its range is (-1, 1). In contrast, the ReLU has a range of (0, ∞).

But let's look at the derivative (or the gradient) for each of the three functions, which is important for the training of the network. This is similar to how, in the linear regression example that we introduced in Chapter 1Machine Learning – an Introduction, we were trying to minimize the function, following it along the direction opposite to its derivative.

For a logistic function f, the derivative is f * (1-f), while if f is the hyperbolic tangent, its derivative is (1+f) * (1-f).

We can quickly calculate the derivative of the logistic sigmoid by simply noticing that the derivative with respect to activation   a  of the     function is given by the following: 

If f is the ReLU, the derivative is much simpler, that is,  . Later in the book, we'll see the deep networks exhibit the vanishing gradients problem, and the advantage of the ReLU is that its derivative is constant and does not tend to zero as a becomes large.

主站蜘蛛池模板: 蒙城县| 玉龙| 达日县| 荣昌县| 芮城县| 杭锦旗| 苗栗县| 冀州市| 清水河县| 茂名市| 民勤县| 麦盖提县| 吕梁市| 华亭县| 望城县| 东源县| 分宜县| 阳高县| 伊吾县| 西贡区| 万盛区| 西华县| 珠海市| 瑞安市| 神池县| 蒙阴县| 房山区| 武清区| 昆山市| 右玉县| 达尔| 平顶山市| 若羌县| 宜兴市| 武汉市| 志丹县| 循化| 报价| 兰坪| 嘉定区| 辉县市|