官术网_书友最值得收藏!

Scoring regressors using mean squared error, explained variance, and R squared

When it comes to regression models, our metrics, as shown earlier, don't work anymore. After all, we are now predicting continuous output values, not distinct classification labels. Fortunately, scikit-learn provides some other useful scoring functions:

  • mean_squared_error: The most commonly used error metric for regression problems is to measure the squared error between the predicted and the true target value for every data point in the training set, averaged across all the data points.
  • explained_variance_score: A more sophisticated metric is to measure to what degree a model can explain the variation or dispersion of the test data. Often, the amount of explained variance is measured using the correlation coefficient.
  • r2_score: The R2 score (pronounced R squared) is closely related to the explained variance score, but uses an unbiased variance estimation. It is also known as the coefficient of determination.

Let's create another mock-up dataset. Let's say we are observing data that looks like a sin as a function of x values. We start by generating 100 equally spaced x values between 0 and 10:

In [19]: x = np.linspace(0, 10, 100)

However, real data is always noisy. To honor this fact, we want the target values y_true to be noisy, too. We do this by adding noise to the sin function:

In [20]: y_true = np.sin(x) + np.random.rand(x.size) - 0.5

Here, we use NumPy's rand function to add noise in the range [0, 1], but then center the noise around 0 by subtracting 0.5. Hence, we effectively jitter every data point either up or down by a maximum of 0.5.

Let's assume our model was smart enough to figure out the sin(x) relationship. Hence, the predicted y values are given as follows:

In [21]: y_pred = np.sin(x)

What does this data look like? We can use Matplotlib to visualize them:

In [22]: import matplotlib.pyplot as plt
... plt.style.use('ggplot')
... %matplotlib inline
In [23]: plt.plot(x, y_pred, linewidth=4, label='model')
... plt.plot(x, y_true, 'o', label='data')
... plt.xlabel('x')
... plt.ylabel('y')
... plt.legend(loc='lower left')
Out[23]: <matplotlib.legend.Legend at 0x265fbeb9f98>

This will produce the following line plot:

Predicted y-values and ground truth data

The most straightforward metric to determine how good our model predictions are is the mean squared error. For each data point, we look at the difference between the predicted and the actual y value, and then square it. We then compute the average of this squared error over all the data points:

In [24]: mse = np.mean((y_true - y_pred) ** 2)
... mse
Out[24]: 0.085318394808423778

For our convenience, scikit-learn provides its own implementation of the mean squared error:

In [25]: metrics.mean_squared_error(y_true, y_pred)
Out[25]: 0.085318394808423778

Another common metric is to measure the scatter or variation in the data: if every data point was equal to the mean of all the data points, we would have no scatter or variation in the data, and we could predict all future data points with a single data value. This would be the world's most boring machine learning problem. Instead, we find that the data points often follow some unknown, hidden relationship, that we would like to uncover. In the previous example, this would be the y=sin(x) relationship, which causes the data to be scattered.

We can measure how much of that scatter in the data (or variance) we can explain. We do this by calculating the variance that still exists between the predicted and the actual labels; this is all the variance our predictions could not explain. If we normalize this value by the total variance in the data, we get what is known as the fraction of variance unexplained:

In [26]: fvu = np.var(y_true - y_pred) / np.var(y_true)
... fvu
Out[26]: 0.16397032626629501

Because this metric is a fraction, its values must lie between 0 and 1. We can subtract this fraction from 1 to get the fraction of variance explained:

In [27]: fve = 1.0 - fvu
... fve
Out[27]: 0.83602967373370496

Let's verify our math with scikit-learn:

In [28]: metrics.explained_variance_score(y_true, y_pred)
Out[28]: 0.83602967373370496

Spot on! Finally, we can calculate what is known as the coefficient of determination, or R2 (pronounced R squared). R2 is closely related to the fraction of variance explained, and compares the mean squared error calculated earlier to the actual variance in the data:

In [29]: r2 = 1.0 - mse / np.var(y_true)
... r2
Out[29]: 0.8358169419264746

The same value can be obtained with scikit-learn:

In [30]: metrics.r2_score(y_true, y_pred)
Out[30]: 0.8358169419264746

The better our predictions fit the data, in comparison to taking the simple average, the closer the value of the R2 score will be to 1. The R2 score can take on negative values, as model predictions can be arbitrarily worse than 1. A constant model that always predicts the expected value of y, independent of the input x, would get a R2 score of 0:

In [31]: metrics.r2_score(y_true, np.mean(y_true) * np.ones_like(y_true))
Out[31]: 0.0
主站蜘蛛池模板: 应用必备| 南丹县| 英山县| 无棣县| 永兴县| 鄂温| 赣榆县| 柳江县| 奎屯市| 额尔古纳市| 岱山县| 南华县| 柘城县| 黄骅市| 廉江市| 高雄县| 梁平县| 杨浦区| 衡山县| 丘北县| 大冶市| 石楼县| 乾安县| 得荣县| 天镇县| 汝南县| 宽甸| 平陆县| 林甸县| 梅河口市| 马山县| 竹山县| 鹤山市| 淳安县| 浦城县| 会理县| 蓬安县| 廊坊市| 喜德县| 穆棱市| 喀什市|