官术网_书友最值得收藏!

Linear regression with scikit-learn and higher dimensionality

scikit-learn offers the class LinearRegression, which works with n-dimensional spaces. For this purpose, we're going to use the Boston dataset:

from sklearn.datasets import load_boston

>>> boston = load_boston()

>>> boston.data.shape
(506L, 13L)
>>> boston.target.shape
(506L,)

It has 506 samples with 13 input features and one output. In the following figure, there' a collection of the plots of the first 12 features:

When working with datasets, it's useful to have a tabular view to manipulate data. pandas is a perfect framework for this task, and even though it's beyond the scope of this book, I suggest you create a data frame with the command  pandas.DataFrame(boston.data, columns=boston.feature_names) and use Jupyter to visualize it. For further information, refer to Heydt M., Learning pandas - Python Data Discovery and Analysis Made Easy, Packt.

There are different scales and outliers (which can be removed using the methods studied in the previous chapters), so it's better to ask the model to normalize the data before processing it. Moreover, for testing purposes, we split the original dataset into training (90%) and test (10%) sets:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

>>> X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, test_size=0.1)

>>> lr = LinearRegression(normalize=True)
>>> lr.fit(X_train, Y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)

When the original data set isn't large enough, splitting it into training and test sets may reduce the number of samples that can be used for fitting the model. k-fold cross-validation can help in solving this problem with a different strategy. The whole dataset is split into k folds using always k-1 folds for training and the remaining one to validate the model. K iterations will be performed, using always a different validation fold. In the following figure, there's an example with 3 folds/iterations:

In this way, the final score can be determined as average of all values and all samples are selected for training k-1 times.

To check the accuracy of a regression, scikit-learn provides the internal method score(X, y) which evaluates the model on test data:

>>> lr.score(X_test, Y_test)
0.77371996006718879

So the overall accuracy is about 77%, which is an acceptable result considering the non-linearity of the original dataset, but it can be also influenced by the subdivision made by train_test_split (like in our case). Instead, for k-fold cross-validation, we can use the function cross_val_score(), which works with all the classifiers. The scoring parameter is very important because it determines which metric will be adopted for tests. As LinearRegression works with ordinary least squares, we preferred the negative mean squared error, which is a cumulative measure that must be evaluated according to the actual values (it's not relative). 

from sklearn.model_selection import cross_val_score

>>> scores = cross_val_score(lr, boston.data, boston.target, cv=7, scoring='neg_mean_squared_error')
array([ -11.32601065, -10.96365388, -32.12770594, -33.62294354,
-10.55957139, -146.42926647, -12.98538412])

>>> scores.mean()
-36.859219426420601
>>> scores.std()
45.704973900600457

Another very important metric used in regressions is called the coefficient of determination or R2. It measures the amount of variance on the prediction which is explained by the dataset. We define residuals, the following quantity:

In other words, it is the difference between the sample and the prediction. So the R2 is defined as follows:

For our purposes, R2 values close to 1 mean an almost perfect regression, while values close to 0 (or negative) imply a bad model. Using this metric is quite easy with cross-validation:

>>> cross_val_score(lr, X, Y, cv=10, scoring='r2')
0.75
主站蜘蛛池模板: 长沙市| 江津市| 札达县| 瑞金市| 大姚县| 江门市| 尉氏县| 上栗县| 姜堰市| 辽宁省| 牡丹江市| 克山县| 康保县| 保靖县| 嵊州市| 奈曼旗| 奎屯市| 永川市| 安图县| 库车县| 蒙阴县| 错那县| 佛学| 汉源县| 建昌县| 舒城县| 勐海县| 隆回县| 天台县| 徐闻县| 襄樊市| 安远县| 大兴区| 九江市| 永和县| 阿勒泰市| 阜城县| 阿克苏市| 正镶白旗| 沈阳市| 桑日县|