官术网_书友最值得收藏!

The backpropagation method of training neural networks

In the backpropagation method, neural networks are trained through the gradient descent technique, where the combined weights vector, W, is updated iteratively, as follows:

Here, η is the learning rate, W(t+1) and W(t) are the weight vectors at iterations (t+1)  and (t), respectively, and ?C(W(t)) is the gradient of the cost function or the error function, with respect to the weight vector, W, at iteration (t). The previous algorithm for an individual weight or bias generalized by w ∈ W can be represented as follows:

  

As you can gather from the previous expressions, the heart of the gradient descent method of learning relies on computing the gradient of the cost function or the error function, with respect to each weight.

From the chain rule of differentiation, we know that if we have y = f(x), z = f(y), then the following is true:

This expression can be generalized to any number of variables. Now, let's take a look at a very simple neural network, as illustrated in the following diagram, in order to understand the backpropagation algorithm:

Figure 1.10: A network illustrating backpropagation

Let the input to the network be a two-dimensional vector, x = [x1 x2]T, and the corresponding output label and prediction be  and , respectively. Also, let's assume that all of the activation units in the neural network are sigmoids. Let the generalized weight connecting any unit i in layer (l-1) to unit j in layer l be denoted by , while the bias in any unit i in layer l should be denoted by . Let's derive the gradient for one data point; the total gradient can be computed as the sum of all of the data points used in training (or in a mini-batch). If the output is continuous, then the loss function, C, can be chosen as the square of the error in prediction:

The weights and biases of the network, cumulatively represented by the set W, can be determined by minimizing the cost function with respect to the W vector, which is as follows:

To perform the minimization of the cost function iteratively through gradient descent, we need to compute the gradient of the cost function with respect to each weight, w ∈ W, as follows:

Now that we have everything that we need, let's compute the gradient of the cost function, C, with respect to the weight, . Using the chain rule of differentiation, we get the following:

Now let's look at the following formula:

As you can see in the previous expression, the derivative is nothing but the error in prediction. Generally, the output unit activation function is linear in the case of regression problems, and hence the following expression applies:

  

So, if we were to compute the gradient of the cost function with respect to the total input at the output unit, it would be . This is still equal to the error in prediction of the output.

The total input at the output unit, as a function of the incoming weights and activations, can be expressed as follows:

This means that,  and the derivative of the cost function with respect to the weight, , contributing to the input of the output layer is given via the following:

As you can see, the error is backpropagated in computing the gradient of the cost function, with respect to the weights in the layers preceding the final output layer. This becomes more obvious when we compute the gradient of the cost function with respect to the generalized weight, . Let's take the weight corresponding to j=1 and k=2; that is, . The gradient of the cost function, C, with respect to this weight can be expressed as follows: 

Now, , which means that, .

So, once we have figured out the gradient of the cost function with respect to the total input to a neuron as , the gradient of any weight, w, contributing to the total input, s, can be obtained by simply multiplying the activation, z, associated with the weight. 

Now, the gradient of the cost function with respect to the total input, , can be derived by chain rule again, as follows:

Since all of the units of the neural network (except for the output unit) are sigmoid activation functions, the following is the case:

     

Combining (1), (2), and (3), we get the following:

In the preceding derived gradient expressions, you can see that the error in prediction, , is backpropagated by combining it with the relevant activations and weights (as per the chain rule of differentiation) for computing the gradients of the weights at each layer, hence, the name backpropagation in AI nomenclature.

主站蜘蛛池模板: 海林市| 陵川县| 虎林市| 全椒县| 杂多县| 泰来县| 嵊泗县| 永修县| 商水县| 绥阳县| 彰武县| 翁牛特旗| 绥宁县| 水城县| 郸城县| 嘉禾县| 郸城县| 政和县| 汉川市| 营山县| 乐安县| 社旗县| 江油市| 新乡市| 侯马市| 长岭县| 台东县| 洛川县| 资中县| 吴川市| 乌兰浩特市| 张家港市| 栾川县| 平遥县| 万盛区| 赞皇县| 高州市| 邯郸市| 大冶市| 阿克苏市| 丰城市|