官术网_书友最值得收藏!

Optimizers

Optimizers define how a neural network learns. They define the value of parameters during the training such that the loss function is at its lowest.

Gradient descent is an optimization algorithm for finding the minima of a function or the minimum value of a cost function. This is useful to us as we want to minimize the cost function. So, to find the local minimum, we take steps proportional to the negative of the gradient.

Let's go through a very simple example in one dimension, shown in the following plot:

Fig 2.17: Gradient descent

On the axis, we have the cost (the result of the cost function), and on the axis, we have the particular weight we are trying to choose (we chose the random weight). The weight minimizes the cost function and we can see that, basically, the parameter value is at the bottom of the parabola. We have to minimize the value of the cost function to the minimum value. Finding the minimum is really simple for one dimension, but in our case, we have a lot more parameters, and we can't do this visually. We are going to use linear algebra and a deep learning library, where we can get the best parameters for minimizing the cost function.

Now, let's see how we can quickly adjust the optimal parameters or weights across our entire network. This is where we need backpropagation.

Backpropagation is used to calculate the error contribution from each neuron after a batch of data is processed. It relies heavily on the chain rule to go back through the network and calculate these errors. Backpropagation works by calculating the error at the output and then updates the weight back through the network layers. It requires a known desired output for each input value.

One of the problems with gradient descent is that the weight is only updated after seeing the entire dataset, so the gradient below is typically large and reaching the loss at the minima is really difficult. One of the solutions to this is updating the parameter more frequently, as in the case of another optimizer called stochastic gradient descent. S tochastic gradient descent updates the weight after seeing each data point instead of the whole dataset. It may have noise, however, as it is influenced by every single sample. Due to this, we use mini-batch gradient descent, which updates the parameters after only a few samples. You can read more about optimizers in the An Overview of Gradient Descent Optimization Algorithms paper   ( https://arxiv.org/pdf/1609.04747.pdf ). Another way of decreasing the noise of stochastic gradient descent is to use Adam optimizers. Adam is one of the more popular optimizers; it is an adaptive learning rate method and computes individual learning rates for different parameters. You can check out this paper on Adam optimizers: Adam: A Method for Stochastic Optimization ( https://arxiv.org/abs/1412.6980 ).

In the next section, we will learn about hyperparameters, which help tweak neural networks so that they can learn features more effectively.

主站蜘蛛池模板: 南郑县| 大冶市| 宁海县| 铜鼓县| 乌拉特后旗| 元朗区| 甘南县| 万安县| 南昌市| 夏河县| 厦门市| 合山市| 岑巩县| 麻阳| 长丰县| 枣阳市| 宁远县| 银川市| 彩票| 浦东新区| 扬州市| 龙泉市| 铁力市| 和田县| 遂宁市| 天祝| 青海省| 南涧| 连江县| 宁陕县| 镇平县| 郓城县| 凤阳县| 通道| 渑池县| 昌黎县| 蒲江县| 亳州市| 麻栗坡县| 河源市| 龙南县|