官术网_书友最值得收藏!

Linear regression with scikit-learn and higher dimensionality

scikit-learn offers the class LinearRegression, which works with n-dimensional spaces. For this purpose, we're going to use the Boston dataset:

from sklearn.datasets import load_boston

>>> boston = load_boston()

>>> boston.data.shape
(506L, 13L)
>>> boston.target.shape
(506L,)

It has 506 samples with 13 input features and one output. In the following figure, there' a collection of the plots of the first 12 features:

When working with datasets, it's useful to have a tabular view to manipulate data. pandas is a perfect framework for this task, and even though it's beyond the scope of this book, I suggest you create a data frame with the command  pandas.DataFrame(boston.data, columns=boston.feature_names) and use Jupyter to visualize it. For further information, refer to Heydt M., Learning pandas - Python Data Discovery and Analysis Made Easy, Packt.

There are different scales and outliers (which can be removed using the methods studied in the previous chapters), so it's better to ask the model to normalize the data before processing it. Moreover, for testing purposes, we split the original dataset into training (90%) and test (10%) sets:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

>>> X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, test_size=0.1)

>>> lr = LinearRegression(normalize=True)
>>> lr.fit(X_train, Y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)

When the original data set isn't large enough, splitting it into training and test sets may reduce the number of samples that can be used for fitting the model. k-fold cross-validation can help in solving this problem with a different strategy. The whole dataset is split into k folds using always k-1 folds for training and the remaining one to validate the model. K iterations will be performed, using always a different validation fold. In the following figure, there's an example with 3 folds/iterations:

In this way, the final score can be determined as average of all values and all samples are selected for training k-1 times.

To check the accuracy of a regression, scikit-learn provides the internal method score(X, y) which evaluates the model on test data:

>>> lr.score(X_test, Y_test)
0.77371996006718879

So the overall accuracy is about 77%, which is an acceptable result considering the non-linearity of the original dataset, but it can be also influenced by the subdivision made by train_test_split (like in our case). Instead, for k-fold cross-validation, we can use the function cross_val_score(), which works with all the classifiers. The scoring parameter is very important because it determines which metric will be adopted for tests. As LinearRegression works with ordinary least squares, we preferred the negative mean squared error, which is a cumulative measure that must be evaluated according to the actual values (it's not relative). 

from sklearn.model_selection import cross_val_score

>>> scores = cross_val_score(lr, boston.data, boston.target, cv=7, scoring='neg_mean_squared_error')
array([ -11.32601065, -10.96365388, -32.12770594, -33.62294354,
-10.55957139, -146.42926647, -12.98538412])

>>> scores.mean()
-36.859219426420601
>>> scores.std()
45.704973900600457

Another very important metric used in regressions is called the coefficient of determination or R2. It measures the amount of variance on the prediction which is explained by the dataset. We define residuals, the following quantity:

In other words, it is the difference between the sample and the prediction. So the R2 is defined as follows:

For our purposes, R2 values close to 1 mean an almost perfect regression, while values close to 0 (or negative) imply a bad model. Using this metric is quite easy with cross-validation:

>>> cross_val_score(lr, X, Y, cv=10, scoring='r2')
0.75
主站蜘蛛池模板: 伊川县| 高阳县| 平乡县| 哈巴河县| 乾安县| 滨州市| 兴城市| 潜山县| 奈曼旗| 平乐县| 大同市| 罗城| 黄大仙区| 长寿区| 平凉市| 将乐县| 齐齐哈尔市| 瑞安市| 黄浦区| 山丹县| 海南省| 泾阳县| 东乌| 文昌市| 偃师市| 滦南县| 弥勒县| 神农架林区| 井陉县| 七台河市| 临漳县| 黄骅市| 天柱县| 商洛市| 突泉县| 兴山县| 贡嘎县| 池州市| 济源市| 延庆县| 佛学|