官术网_书友最值得收藏!

Optimizers

Optimizers define how a neural network learns. They define the value of parameters during the training such that the loss function is at its lowest.

Gradient descent is an optimization algorithm for finding the minima of a function or the minimum value of a cost function. This is useful to us as we want to minimize the cost function. So, to find the local minimum, we take steps proportional to the negative of the gradient.

Let's go through a very simple example in one dimension, shown in the following plot:

Fig 2.17: Gradient descent

On the axis, we have the cost (the result of the cost function), and on the axis, we have the particular weight we are trying to choose (we chose the random weight). The weight minimizes the cost function and we can see that, basically, the parameter value is at the bottom of the parabola. We have to minimize the value of the cost function to the minimum value. Finding the minimum is really simple for one dimension, but in our case, we have a lot more parameters, and we can't do this visually. We are going to use linear algebra and a deep learning library, where we can get the best parameters for minimizing the cost function.

Now, let's see how we can quickly adjust the optimal parameters or weights across our entire network. This is where we need backpropagation.

Backpropagation is used to calculate the error contribution from each neuron after a batch of data is processed. It relies heavily on the chain rule to go back through the network and calculate these errors. Backpropagation works by calculating the error at the output and then updates the weight back through the network layers. It requires a known desired output for each input value.

One of the problems with gradient descent is that the weight is only updated after seeing the entire dataset, so the gradient below is typically large and reaching the loss at the minima is really difficult. One of the solutions to this is updating the parameter more frequently, as in the case of another optimizer called stochastic gradient descent. S tochastic gradient descent updates the weight after seeing each data point instead of the whole dataset. It may have noise, however, as it is influenced by every single sample. Due to this, we use mini-batch gradient descent, which updates the parameters after only a few samples. You can read more about optimizers in the An Overview of Gradient Descent Optimization Algorithms paper   ( https://arxiv.org/pdf/1609.04747.pdf ). Another way of decreasing the noise of stochastic gradient descent is to use Adam optimizers. Adam is one of the more popular optimizers; it is an adaptive learning rate method and computes individual learning rates for different parameters. You can check out this paper on Adam optimizers: Adam: A Method for Stochastic Optimization ( https://arxiv.org/abs/1412.6980 ).

In the next section, we will learn about hyperparameters, which help tweak neural networks so that they can learn features more effectively.

主站蜘蛛池模板: 买车| 肥西县| 连州市| 武威市| 孝感市| 肃南| 沾化县| 察隅县| 建宁县| 阿坝县| 库伦旗| 绍兴市| 师宗县| 连南| 杭锦旗| 靖江市| 宝丰县| 东安县| 贵溪市| 麦盖提县| 阜新市| 松溪县| 禹州市| 惠州市| 湘阴县| 罗平县| 饶平县| 上高县| 南康市| 安吉县| 彰化市| 会泽县| 平顶山市| 阜阳市| 平顶山市| 竹山县| 祁门县| 舟曲县| 蒲江县| 建瓯市| 澳门|