官术网_书友最值得收藏!

Optimizers

Optimizers define how a neural network learns. They define the value of parameters during the training such that the loss function is at its lowest.

Gradient descent is an optimization algorithm for finding the minima of a function or the minimum value of a cost function. This is useful to us as we want to minimize the cost function. So, to find the local minimum, we take steps proportional to the negative of the gradient.

Let's go through a very simple example in one dimension, shown in the following plot:

Fig 2.17: Gradient descent

On the axis, we have the cost (the result of the cost function), and on the axis, we have the particular weight we are trying to choose (we chose the random weight). The weight minimizes the cost function and we can see that, basically, the parameter value is at the bottom of the parabola. We have to minimize the value of the cost function to the minimum value. Finding the minimum is really simple for one dimension, but in our case, we have a lot more parameters, and we can't do this visually. We are going to use linear algebra and a deep learning library, where we can get the best parameters for minimizing the cost function.

Now, let's see how we can quickly adjust the optimal parameters or weights across our entire network. This is where we need backpropagation.

Backpropagation is used to calculate the error contribution from each neuron after a batch of data is processed. It relies heavily on the chain rule to go back through the network and calculate these errors. Backpropagation works by calculating the error at the output and then updates the weight back through the network layers. It requires a known desired output for each input value.

One of the problems with gradient descent is that the weight is only updated after seeing the entire dataset, so the gradient below is typically large and reaching the loss at the minima is really difficult. One of the solutions to this is updating the parameter more frequently, as in the case of another optimizer called stochastic gradient descent. S tochastic gradient descent updates the weight after seeing each data point instead of the whole dataset. It may have noise, however, as it is influenced by every single sample. Due to this, we use mini-batch gradient descent, which updates the parameters after only a few samples. You can read more about optimizers in the An Overview of Gradient Descent Optimization Algorithms paper   ( https://arxiv.org/pdf/1609.04747.pdf ). Another way of decreasing the noise of stochastic gradient descent is to use Adam optimizers. Adam is one of the more popular optimizers; it is an adaptive learning rate method and computes individual learning rates for different parameters. You can check out this paper on Adam optimizers: Adam: A Method for Stochastic Optimization ( https://arxiv.org/abs/1412.6980 ).

In the next section, we will learn about hyperparameters, which help tweak neural networks so that they can learn features more effectively.

主站蜘蛛池模板: 雷波县| 宁陕县| 惠来县| 临西县| 平阳县| 喀喇沁旗| 广丰县| 施秉县| 奇台县| 衡南县| 胶州市| 洪湖市| 龙山县| 新密市| 宾川县| 沙洋县| 扎赉特旗| 安西县| 如皋市| 祥云县| 金华市| 崇明县| 库尔勒市| 西丰县| 西充县| 娄烦县| 汝南县| 随州市| 拜城县| 年辖:市辖区| 永安市| 东海县| 岑溪市| 从江县| 大邑县| 永平县| 城口县| 邵阳县| 个旧市| 柏乡县| 易门县|