- Machine Learning for OpenCV
- Michael Beyeler
- 803字
- 2021-07-02 19:47:21
Scoring regressors using mean squared error, explained variance, and R squared
When it comes to regression models, our metrics, as shown earlier, don't work anymore. After all, we are now predicting continuous output values, not distinct classification labels. Fortunately, scikit-learn provides some other useful scoring functions:
- mean_squared_error: The most commonly used error metric for regression problems is to measure the squared error between the predicted and the true target value for every data point in the training set, averaged across all the data points.
- explained_variance_score: A more sophisticated metric is to measure to what degree a model can explain the variation or dispersion of the test data. Often, the amount of explained variance is measured using the correlation coefficient.
- r2_score: The R2 score (pronounced R squared) is closely related to the explained variance score, but uses an unbiased variance estimation. It is also known as the coefficient of determination.
Let's create another mock-up dataset. Let's say we are observing data that looks like a sin as a function of x values. We start by generating 100 equally spaced x values between 0 and 10:
In [19]: x = np.linspace(0, 10, 100)
However, real data is always noisy. To honor this fact, we want the target values y_true to be noisy, too. We do this by adding noise to the sin function:
In [20]: y_true = np.sin(x) + np.random.rand(x.size) - 0.5
Here, we use NumPy's rand function to add noise in the range [0, 1], but then center the noise around 0 by subtracting 0.5. Hence, we effectively jitter every data point either up or down by a maximum of 0.5.
Let's assume our model was smart enough to figure out the sin(x) relationship. Hence, the predicted y values are given as follows:
In [21]: y_pred = np.sin(x)
What does this data look like? We can use Matplotlib to visualize them:
In [22]: import matplotlib.pyplot as plt
... plt.style.use('ggplot')
... %matplotlib inline
In [23]: plt.plot(x, y_pred, linewidth=4, label='model')
... plt.plot(x, y_true, 'o', label='data')
... plt.xlabel('x')
... plt.ylabel('y')
... plt.legend(loc='lower left')
Out[23]: <matplotlib.legend.Legend at 0x265fbeb9f98>
This will produce the following line plot:

The most straightforward metric to determine how good our model predictions are is the mean squared error. For each data point, we look at the difference between the predicted and the actual y value, and then square it. We then compute the average of this squared error over all the data points:
In [24]: mse = np.mean((y_true - y_pred) ** 2)
... mse
Out[24]: 0.085318394808423778
For our convenience, scikit-learn provides its own implementation of the mean squared error:
In [25]: metrics.mean_squared_error(y_true, y_pred)
Out[25]: 0.085318394808423778
Another common metric is to measure the scatter or variation in the data: if every data point was equal to the mean of all the data points, we would have no scatter or variation in the data, and we could predict all future data points with a single data value. This would be the world's most boring machine learning problem. Instead, we find that the data points often follow some unknown, hidden relationship, that we would like to uncover. In the previous example, this would be the y=sin(x) relationship, which causes the data to be scattered.
We can measure how much of that scatter in the data (or variance) we can explain. We do this by calculating the variance that still exists between the predicted and the actual labels; this is all the variance our predictions could not explain. If we normalize this value by the total variance in the data, we get what is known as the fraction of variance unexplained:
In [26]: fvu = np.var(y_true - y_pred) / np.var(y_true)
... fvu
Out[26]: 0.16397032626629501
Because this metric is a fraction, its values must lie between 0 and 1. We can subtract this fraction from 1 to get the fraction of variance explained:
In [27]: fve = 1.0 - fvu
... fve
Out[27]: 0.83602967373370496
Let's verify our math with scikit-learn:
In [28]: metrics.explained_variance_score(y_true, y_pred)
Out[28]: 0.83602967373370496
Spot on! Finally, we can calculate what is known as the coefficient of determination, or R2 (pronounced R squared). R2 is closely related to the fraction of variance explained, and compares the mean squared error calculated earlier to the actual variance in the data:
In [29]: r2 = 1.0 - mse / np.var(y_true)
... r2
Out[29]: 0.8358169419264746
The same value can be obtained with scikit-learn:
In [30]: metrics.r2_score(y_true, y_pred)
Out[30]: 0.8358169419264746
The better our predictions fit the data, in comparison to taking the simple average, the closer the value of the R2 score will be to 1. The R2 score can take on negative values, as model predictions can be arbitrarily worse than 1. A constant model that always predicts the expected value of y, independent of the input x, would get a R2 score of 0:
In [31]: metrics.r2_score(y_true, np.mean(y_true) * np.ones_like(y_true))
Out[31]: 0.0
- Java多線程編程實戰指南:設計模式篇(第2版)
- iOS面試一戰到底
- OpenDaylight Cookbook
- Intel Galileo Essentials
- INSTANT OpenCV Starter
- Learning Spring 5.0
- Oracle Database 12c Security Cookbook
- Python機器學習編程與實戰
- 零基礎學單片機C語言程序設計
- Java編程的邏輯
- Python程序設計開發寶典
- 邊玩邊學Scratch3.0少兒趣味編程
- Mastering JavaScript
- Dart:Scalable Application Development
- Mastering Node.js