官术网_书友最值得收藏!

Backpropagation

So far, we have learned how to update the weights of 1-layer networks with gradient descent. We started by comparing the output of the network (that is, the output of the output layer) with the target value, and then we updated the weights accordingly. But, in a multi-layer network, we can only apply this technique for the weights that connect the final hidden layer to the output layer. That's because we don't have any target values for the outputs of the hidden layers. What we'll do instead is calculate the error in the final hidden layer and estimate what it would be in the previous layer. We'll propagate that error back from the last layer to the first layer; hence, we get the name backpropagation. Backpropagation is one of the most difficult algorithms to understand, but all you need is some knowledge of basic differential calculus and the chain rule. 

Let's first introduce some notation: 

  1. We'll define wij as the weight between the i-th neuron of layer l, and the j-th neuron of layer l+1.
  2. In other words, we use subscripts i and j, where the element with subscript i belongs to the layer preceding the layer containing the element with subscript j.
  3. In a multi-layer network, l and l+1 can be any two consecutive layers, including input, hidden, and output layers.
  4. Note that the letter y is used to denote both input and output values. yi is the input to the next layer l+1, and it's also the output of the activation function of layer l:
In this example, layer 1 represents the input, layer 2 the output, and  wij connects the  yi activation in layer 1 to the inputs of the j-th neuron of layer 2
  1. We'll denote the cost function (error) with J, the activation value x?w with a, and the activation function (sigmoid, ReLU, and so on) output with y.
  2. To recap the chain rule, for F(x) = f(g(x))we have  . In our case, aj is a function of the weights, w*jyj is a function of aj, and J is function of yj. Armed with this great knowledge and using the preceding notationwe can write the following for the last layer of our neural network (using partial derivatives):
  1. Since we know that, we have the following:

If y is the logistic sigmoid, we'll get the same result that we have already calculated at the end of the Logistic regression section. We also know the cost function and we can calculate all the partial derivatives.

  1. For the previous (hidden) layers, the same formula holds:
Even though we have several layers, we always concentrate on pairs of successive layers and, perhaps abusing the notation somewhat, we always have a "first" (or input) layer, and a "second" (or output) layer, as in the preceding diagram.

We know that  , and we also know that is the derivative of the activation function, which we can calculate. Then, all we need to do is calculate the derivative. Let's note that this is the derivative of the error with respect to the activation function in the "second" layer. We can now calculate all the derivatives, starting from the last layer and moving backward, because the following applies:

  • We can calculate this derivative for the last layer.
  • We have a formula that allows us to calculate the derivative for one layer, assuming we can calculate the derivative for the next.
  1. In the following equation, yi is the output of the first layer (and input for the second), while yj is the output of the second layer. Applying the chain rule, we have the following:
The sum over   j  reflects the fact that in the feedforward part, the output  yi is fed to all neurons in the second layer; therefore, they all contribute to  y i when the error is propagated backward.
  1. Once again, we can calculate both  and  ; once we know  we can calculate  . Since we can calculate  for the last layer, we can move backward and calculate  for any layer, and therefore  for any layer.
  2. To summarize, if we have a sequence of layers where the following applies:

We then have these two fundamental equations:

By using these two equations, we can calculate the derivatives for the cost with respect to each layer. If we set  , then δj represents the variation in cost with respect to the activation value, and we can think of δj as the error at neuron yj.

  1. We can rewrite these equations as follows:

This implies that  . These two equations give an alternate view of backpropagation, as there is a variation in cost with respect to the activation value.

  1. It provides a formula to calculate this variation for any layer once we know the variation for the following layer:
  1. We can combine these equations to show the following:
  1. The update rule for the weights of each layer is given by the following equation:
主站蜘蛛池模板: 台北县| 盐亭县| 安平县| 米泉市| 同仁县| 信丰县| 津南区| 台中县| 竹溪县| 平顶山市| 苏尼特左旗| 辉南县| 泾源县| 孟津县| 凤山市| 会昌县| 凤冈县| 阿城市| 从江县| 衢州市| 商都县| 治县。| 绥棱县| 贺州市| 华池县| 镇雄县| 昔阳县| 洮南市| 峡江县| 大埔县| 积石山| 阿拉尔市| 巴里| 扬中市| 分宜县| 高青县| 三明市| 即墨市| 吉安县| 广宗县| 漳平市|