官术网_书友最值得收藏!

  • Python Deep Learning
  • Ivan Vasilev Daniel Slater Gianmario Spacagna Peter Roelants Valentino Zocca
  • 902字
  • 2021-07-02 14:31:05

Linear regression

We have already introduced linear regression in Chapter 1, Machine Learning – an IntroductionTo recap, regarding utilization of the vector notation, the output of a linear regression algorithm is a single value, y , and is equal to the dot product of the input values x and the weights w As we now know, linear regression is a special case of a neural network; that is, it's a single neuron with the identity activation function. In this section, we'll learn how to train linear regression with gradient descent and, in the following sections, we'll extend it to training more complex models. You can see how the gradient descent works in the following code block:

At first, this might look scary, but fear not! Behind the scenes, it's very simple and straightforward mathematics (I know that sounds even scarier!). But let's not lose sight of our goal, which is to adjust the weights, w, in a way that will help the algorithm to predict the target values. To do this, first we need to know how the output yi differs from the target value ti for each sample of the training dataset (we use superscript notation to mark the i-th sample). We'll use the mean-squared error loss function (MSE), which is equal to the mean value of the squared differences yi - ti for all samples (the total number of samples in the training set is n). We'll denote MSE with J for ease of use and, to underscore that, we can use other loss functions. Each yi is a function of w, and therefore, J is also a function of wAs we mentioned previously, the loss function J represents a hypersurface of dimension equal to the dimension of w (we are implicitly also considering the bias). To illustrate this, imagine that we have only one input value, xand a single weight, w. We can see how the MSE changes with respect to w in the following diagram

MSE diagram

Our goal is to minimize J, which means finding such w, where the value of J is at its global minimum. To do this, we need to know whether J increases or decreases when we modify w, or, in other words, the first derivative (or gradient) of J with respect to w:

  1. In the general case, where we have multiple inputs and weights, we can calculate the partial derivative with respect to each weight wj using the following formula:
  1. And to move toward the minimum, we need to move in the opposite direction set by    for each wj.
  2. Let's calculate the derivative:

If  ,  then   and, therefore,

The notation can sometimes be confusing, especially the first time you encounter it. The input is given by the vectors   xi, where the superscript indicates the  i-th example. Since   x  and   w  are vectors, the subscript indicates the   j-th coordinate of the vector.   yi  then represents the output of the neural network given the input   xi, while   ti  represents the target, that is, the desired value corresponding to the input   xi.
  1. Now, that we have calculated the partial derivatives, we'll update the weights with the following update rule:

We can see that η is the learning rate. The learning rate determines the ratio by which the weight adjusts as new data arrives.

  1. We can write the update rule in matrix form as follows:

Here, ?, also called nabla, represents the vector of partial derivatives.

is a vector of partial derivatives. Instead of writing the update rule for w  separately for each of its components,   wj, we can use the matrix form where, instead of writing the partial derivative, for each occurrence of   j, we use  ? to indicate each partial derivative for each   j.

You may have noticed that in order to update the weights, we accumulate the error across all training samples. In reality, there are big datasets, and iterating over them for just one update would make training impractically slow. One solution to this problem is the stochastic (or online) gradient descent (SGD) algorithm, which works in the same way as regular gradient descent, but updates the weights after every training sample. However, SGD is prone to noise in the data. If a sample is an outlier, we risk increasing the error instead of decreasing it. A good compromise between the two is the mini-batch gradient descent, which accumulates the error for every n samples or mini-batches and performs one weight update. In practice, you'll almost always use mini-batch gradient descent. 

Before we move to the next section, we should mention that besides the global minimum, the loss function might have multiple local minimums and minimizing its value is not as trivial, as in this example. 

主站蜘蛛池模板: 吉木乃县| 容城县| 醴陵市| 吴旗县| 绥江县| 德阳市| 韶关市| 婺源县| 灵石县| 连山| 宁明县| 临澧县| 阜南县| 怀来县| 杂多县| 岫岩| 长海县| 叶城县| 双峰县| 巴林左旗| 长宁县| 北辰区| 许昌县| 苏尼特左旗| 承德县| 安溪县| 泸西县| 萝北县| 微山县| 甘泉县| 合作市| 彭州市| 阜新| 育儿| 绩溪县| 商丘市| 浦县| 剑阁县| 平武县| 扶沟县| 盘山县|