書名： Python Deep Learning
作者名： Ivan Vasilev Daniel Slater Gianmario Spacagna Peter Roelants Valentino Zocca
本章字數： 934字
更新時間： 2021-07-02 14:31:06

Backpropagation

So far, we have learned how to update the weights of 1-layer networks with gradient descent. We started by comparing the output of the network (that is, the output of the output layer) with the target value, and then we updated the weights accordingly. But, in a multi-layer network, we can only apply this technique for the weights that connect the final hidden layer to the output layer. That's because we don't have any target values for the outputs of the hidden layers. What we'll do instead is calculate the error in the final hidden layer and estimate what it would be in the previous layer. We'll propagate that error back from the last layer to the first layer; hence, we get the name backpropagation. Backpropagation is one of the most difficult algorithms to understand, but all you need is some knowledge of basic differential calculus and the chain rule.

Let's first introduce some notation:

We'll define w_ij as the weight between the i-th neuron of layer l, and the j-th neuron of layer l+1.
In other words, we use subscripts i and j, where the element with subscript i belongs to the layer preceding the layer containing the element with subscript j.
In a multi-layer network, l and l+1 can be any two consecutive layers, including input, hidden, and output layers.
Note that the letter y is used to denote both input and output values. y_i is the input to the next layer l+1, and it's also the output of the activation function of layer l:

In this example, layer 1 represents the input, layer 2 the output, and w_ij connects the y_i activation in layer 1 to the inputs of the j-th neuron of layer 2

We'll denote the cost function (error) with J, the activation value x?w with a, and the activation function (sigmoid, ReLU, and so on) output with y.
To recap the chain rule, for F(x) = f(g(x)), we have . In our case, a_j is a function of the weights, w_*j, y_j is a function of a_j, and J is function of y_j. Armed with this great knowledge and using the preceding notation, we can write the following for the last layer of our neural network (using partial derivatives):

Since we know that, we have the following:

If y is the logistic sigmoid, we'll get the same result that we have already calculated at the end of the Logistic regression section. We also know the cost function and we can calculate all the partial derivatives.

For the previous (hidden) layers, the same formula holds:

Even though we have several layers, we always concentrate on pairs of successive layers and, perhaps abusing the notation somewhat, we always have a "first" (or input) layer, and a "second" (or output) layer, as in the preceding diagram.

We know that , and we also know that is the derivative of the activation function, which we can calculate. Then, all we need to do is calculate the derivative. Let's note that this is the derivative of the error with respect to the activation function in the "second" layer. We can now calculate all the derivatives, starting from the last layer and moving backward, because the following applies:

We can calculate this derivative for the last layer.
We have a formula that allows us to calculate the derivative for one layer, assuming we can calculate the derivative for the next.

In the following equation, y_i is the output of the first layer (and input for the second), while y_j is the output of the second layer. Applying the chain rule, we have the following:

The sum over j reflects the fact that in the feedforward part, the output y_i is fed to all neurons in the second layer; therefore, they all contribute to y _iwhen the error is propagated backward.

Once again, we can calculate both and ; once we know , we can calculate . Since we can calculate for the last layer, we can move backward and calculate for any layer, and therefore for any layer.
To summarize, if we have a sequence of layers where the following applies:

We then have these two fundamental equations:

By using these two equations, we can calculate the derivatives for the cost with respect to each layer. If we set , then δ_j represents the variation in cost with respect to the activation value, and we can think of δ_j as the error at neuron y_j.

We can rewrite these equations as follows:

This implies that . These two equations give an alternate view of backpropagation, as there is a variation in cost with respect to the activation value.

It provides a formula to calculate this variation for any layer once we know the variation for the following layer:

We can combine these equations to show the following:

The update rule for the weights of each layer is given by the following equation:

官术网_书友最值得收藏!

Python Deep Learning

Backpropagation