官术网_书友最值得收藏!

Optimizers

Optimizers define how a neural network learns. They define the value of parameters during the training such that the loss function is at its lowest.

Gradient descent is an optimization algorithm for finding the minima of a function or the minimum value of a cost function. This is useful to us as we want to minimize the cost function. So, to find the local minimum, we take steps proportional to the negative of the gradient.

Let's go through a very simple example in one dimension, shown in the following plot:

Fig 2.17: Gradient descent

On the axis, we have the cost (the result of the cost function), and on the axis, we have the particular weight we are trying to choose (we chose the random weight). The weight minimizes the cost function and we can see that, basically, the parameter value is at the bottom of the parabola. We have to minimize the value of the cost function to the minimum value. Finding the minimum is really simple for one dimension, but in our case, we have a lot more parameters, and we can't do this visually. We are going to use linear algebra and a deep learning library, where we can get the best parameters for minimizing the cost function.

Now, let's see how we can quickly adjust the optimal parameters or weights across our entire network. This is where we need backpropagation.

Backpropagation is used to calculate the error contribution from each neuron after a batch of data is processed. It relies heavily on the chain rule to go back through the network and calculate these errors. Backpropagation works by calculating the error at the output and then updates the weight back through the network layers. It requires a known desired output for each input value.

One of the problems with gradient descent is that the weight is only updated after seeing the entire dataset, so the gradient below is typically large and reaching the loss at the minima is really difficult. One of the solutions to this is updating the parameter more frequently, as in the case of another optimizer called stochastic gradient descent. S tochastic gradient descent updates the weight after seeing each data point instead of the whole dataset. It may have noise, however, as it is influenced by every single sample. Due to this, we use mini-batch gradient descent, which updates the parameters after only a few samples. You can read more about optimizers in the An Overview of Gradient Descent Optimization Algorithms paper   ( https://arxiv.org/pdf/1609.04747.pdf ). Another way of decreasing the noise of stochastic gradient descent is to use Adam optimizers. Adam is one of the more popular optimizers; it is an adaptive learning rate method and computes individual learning rates for different parameters. You can check out this paper on Adam optimizers: Adam: A Method for Stochastic Optimization ( https://arxiv.org/abs/1412.6980 ).

In the next section, we will learn about hyperparameters, which help tweak neural networks so that they can learn features more effectively.

主站蜘蛛池模板: 信阳市| 昌都县| 巍山| 黑龙江省| 天祝| 故城县| 洪雅县| 松阳县| SHOW| 富川| 芦溪县| 遵义市| 于都县| 贡山| 龙游县| 泸定县| 盱眙县| 玉溪市| 南投县| 鄂州市| 岢岚县| 奉化市| 滁州市| 内乡县| 仁化县| 玛多县| 车险| 松阳县| 杭锦旗| 莆田市| 县级市| 会泽县| 万宁市| 汉川市| 巢湖市| 会同县| 宝丰县| 江山市| 无极县| 香格里拉县| 淮阳县|