官术网_书友最值得收藏!

Evaluating the model

We have used a learning algorithm to estimate a model's parameters from training data. How can we assess whether our model is a good representation of the real relationship? Let's assume that you have found another page in your pizza journal. We will use this page's entries as a test set to measure the performance of our model. We have added a fourth column; it contains the prices predicted by our model.

Several measures can be used to assess our model's predictive capability. We will evaluate our pizza price predictor using a measure called R-squared. Also known as the coefficient of determination, R-squared measures how close the data are to a regression line. There are several methods for calculating R-squared. In the case of simple linear regression, R-squared is equal to the square of the Pearson product-moment correlation coefficient (PPMCC), or Pearson's r. Using this method, R-squared must be a positive number between zero and one. This method is intuitive; if R-squared describes the proportion of variance in the response variable that is explained by the model, it cannot be greater than one or less than zero. Other methods, including the method used by scikit-learn, do not calculate R-squared as the square of Pearson's r. Using these methods, R-squared can be negative if the model performs extremely poorly. It is important to note the limitations of performance metrics. R-squared in particular is sensitive to outliers, and can spuriously increase when features are added to the model.

We will follow the method used by scikit-learn to calculate R-squared for our pizza price predictor. First we must measure the total sum of squares. yi is the observed value of the response variable for the ith test instance, and is the mean of the observed values of the response variable.

Next we must find the RSS. Recall that this is also our cost function.

Finally, we can find R-squared using the following:

The R-squared score of 0.662 indicates that a large proportion of the variance in the test instances' prices is explained by the model. Now let's confirm our calculation using scikit-learn. The score method of LinearRegression returns the model's R-squared value, as seen in the following example:

# In[1]: 
import numpy as np
from sklearn.linear_model import LinearRegression

X_train = np.array([6, 8, 10, 14, 18]).reshape(-1, 1)
y_train = [7, 9, 13, 17.5, 18]

X_test = np.array([8, 9, 11, 16, 12]).reshape(-1, 1)
y_test = [11, 8.5, 15, 18, 11]

model = LinearRegression()
model.fit(X_train, y_train)
r_squared = model.score(X_test, y_test)
print(r_squared )

# Out[1]:
0.6620
主站蜘蛛池模板: 台南县| 南乐县| 渭南市| 莱阳市| 云龙县| 宜良县| 凤冈县| 南和县| 板桥市| 石阡县| 安福县| 楚雄市| 诏安县| 富蕴县| 河南省| 砀山县| 锦州市| 青神县| 武邑县| 南丹县| 达拉特旗| 元氏县| 瑞安市| 建宁县| 冷水江市| 九江市| 淄博市| 六盘水市| 耒阳市| 石林| 苍南县| 图们市| 锡林郭勒盟| 贵阳市| 承德市| 金华市| 光山县| 绥阳县| 稷山县| 盐池县| 定结县|