官术网_书友最值得收藏!

The chain rule

One of the fundamental principles to compute backpropagation is the chain rule, which is a more generic form of the delta rule that we saw for the perceptron.

The chain rule uses the property of derivatives to calculate the result of the composition of more functions. By putting neurons in series, we are effectively creating a composition of functions; therefore, we can apply the chain rule formula:

In this particular case, we want to find the weight that minimizes our error function. To do that, we derive our error function in respect to the weights, and we follow the direction of the descending gradient. So, if we consider the neuron j, we will see that its input comes from the previous part of the network, which we can denote with networkj. The output of the neuron will be denoted with oj; therefore, applying the chain rule, we will obtain the following formula:

Let's focus on every single element of this equation. The first factor is exactly what we had before with the perceptron; therefore, we get the following formula:

This is because in this case, oj is also the output of the neurons in the next layer that we can denote with L. If we denote the number of neurons in a given layer with l, we will have the following formula:

That's where the delta rule that we used previously comes from.

When it's not the output neuron that we are deriving, the formula is more complex, as we need to consider each single neuron as it might be connected with a different part of the network. In that case, we have the following formula:

Then, we need to derive the output representation we found in respect to the rest of the network. In this case, the activation function is a sigmoid; therefore, the derivative is pretty easy to calculate:

The derivative of the input of neuron oj (networkj) with respect to the weight that connects the neuron with our neuron j is simply the partial derivative of the activation function. In the last element, only one term depends on wij; therefore, everything else becomes 0:

Now, we can see the general case of the delta rule:

Here, we denote the following formula:

Now, the gradient descent technique wants to move our weights one step toward the direction of the gradient. This one step is something it's up to us to define, depending on how fast we want the algorithm to converge and how close we want to go to the local minima. If we take too large of a step, it's unlikely that we will find the minima, and if we take too small of a step, it will take too much time to find it:

We mentioned that with gradient descent, we are not guaranteed to find a local minima, and this is because of the non-convexity of error functions in neural networks. How well we explore the error space will depend on parameters such as the step size and the learning rate, but also on how well we created the dataset.

Unfortunately, at the moment, there is no formula that guarantees a good way to explore the error function. It's a process that still requires a bit of craftsmanship, and because of that, some theoretical purists look at deep learning as an inferior technique, preferring the more complete statistical formulations. But if we choose to look at the other side of the matter, this can be seen as a great opportunity for researchers to advance the field. The growth of deep learning in practical applications is what has driven the success of the field, demonstrating that the current limitations are not major drawbacks.

主站蜘蛛池模板: 株洲市| 涟源市| 灌南县| 锡林浩特市| 手机| 星子县| 寿光市| 余江县| 明溪县| 宜阳县| 临高县| 道孚县| 聂拉木县| 会宁县| 安泽县| 中西区| 中卫市| 禄丰县| 青神县| 辛集市| 天祝| 邯郸县| 木里| 镇原县| 上林县| 鹰潭市| 大城县| 华宁县| 郁南县| 乌拉特前旗| 西宁市| 嘉鱼县| 太仓市| 漾濞| 安吉县| 宝兴县| 大庆市| 安西县| 固阳县| 梁河县| 广河县|