- Python Deep Learning
- Ivan Vasilev Daniel Slater Gianmario Spacagna Peter Roelants Valentino Zocca
- 934字
- 2021-07-02 14:31:06
Backpropagation
So far, we have learned how to update the weights of 1-layer networks with gradient descent. We started by comparing the output of the network (that is, the output of the output layer) with the target value, and then we updated the weights accordingly. But, in a multi-layer network, we can only apply this technique for the weights that connect the final hidden layer to the output layer. That's because we don't have any target values for the outputs of the hidden layers. What we'll do instead is calculate the error in the final hidden layer and estimate what it would be in the previous layer. We'll propagate that error back from the last layer to the first layer; hence, we get the name backpropagation. Backpropagation is one of the most difficult algorithms to understand, but all you need is some knowledge of basic differential calculus and the chain rule.
Let's first introduce some notation:
- We'll define wij as the weight between the i-th neuron of layer l, and the j-th neuron of layer l+1.
- In other words, we use subscripts i and j, where the element with subscript i belongs to the layer preceding the layer containing the element with subscript j.
- In a multi-layer network, l and l+1 can be any two consecutive layers, including input, hidden, and output layers.
- Note that the letter y is used to denote both input and output values. yi is the input to the next layer l+1, and it's also the output of the activation function of layer l:

- We'll denote the cost function (error) with J, the activation value x?w with a, and the activation function (sigmoid, ReLU, and so on) output with y.
- To recap the chain rule, for F(x) = f(g(x)), we have
. In our case, aj is a function of the weights, w*j, yj is a function of aj, and J is function of yj. Armed with this great knowledge and using the preceding notation, we can write the following for the last layer of our neural network (using partial derivatives):

- Since we know that
, we have the following:

If y is the logistic sigmoid, we'll get the same result that we have already calculated at the end of the Logistic regression section. We also know the cost function and we can calculate all the partial derivatives.
- For the previous (hidden) layers, the same formula holds:

We know that , and we also know that
is the derivative of the activation function, which we can calculate. Then, all we need to do is calculate the derivative
. Let's note that this is the derivative of the error with respect to the activation function in the "second" layer. We can now calculate all the derivatives, starting from the last layer and moving backward, because the following applies:
- We can calculate this derivative for the last layer.
- We have a formula that allows us to calculate the derivative for one layer, assuming we can calculate the derivative for the next.
- In the following equation, yi is the output of the first layer (and input for the second), while yj is the output of the second layer. Applying the chain rule, we have the following:

- Once again, we can calculate both
and
; once we know
, we can calculate
. Since we can calculate
for the last layer, we can move backward and calculate
for any layer, and therefore
for any layer.
- To summarize, if we have a sequence of layers where the following applies:

We then have these two fundamental equations:


By using these two equations, we can calculate the derivatives for the cost with respect to each layer. If we set , then δj represents the variation in cost with respect to the activation value, and we can think of δj as the error at neuron yj.
- We can rewrite these equations as follows:

This implies that . These two equations give an alternate view of backpropagation, as there is a variation in cost with respect to the activation value.
- It provides a formula to calculate this variation for any layer once we know the variation for the following layer:


- We can combine these equations to show the following:

- The update rule for the weights of each layer is given by the following equation:

- 大學計算機基礎(第三版)
- 深度實踐OpenStack:基于Python的OpenStack組件開發
- Building a RESTful Web Service with Spring
- Learning AWS Lumberyard Game Development
- 64位匯編語言的編程藝術
- Java Web開發技術教程
- Learn React with TypeScript 3
- Web Development with MongoDB and Node(Third Edition)
- Instant Nancy Web Development
- PHP編程基礎與實踐教程
- Arduino計算機視覺編程
- Extending Unity with Editor Scripting
- QlikView Unlocked
- ROS機器人編程實戰
- Head First Kotlin程序設計