官术网_书友最值得收藏!

Linear regression with scikit-learn and higher dimensionality

scikit-learn offers the class LinearRegression, which works with n-dimensional spaces. For this purpose, we're going to use the Boston dataset:

from sklearn.datasets import load_boston

>>> boston = load_boston()

>>> boston.data.shape
(506L, 13L)
>>> boston.target.shape
(506L,)

It has 506 samples with 13 input features and one output. In the following figure, there' a collection of the plots of the first 12 features:

When working with datasets, it's useful to have a tabular view to manipulate data. pandas is a perfect framework for this task, and even though it's beyond the scope of this book, I suggest you create a data frame with the command  pandas.DataFrame(boston.data, columns=boston.feature_names) and use Jupyter to visualize it. For further information, refer to Heydt M., Learning pandas - Python Data Discovery and Analysis Made Easy, Packt.

There are different scales and outliers (which can be removed using the methods studied in the previous chapters), so it's better to ask the model to normalize the data before processing it. Moreover, for testing purposes, we split the original dataset into training (90%) and test (10%) sets:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

>>> X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, test_size=0.1)

>>> lr = LinearRegression(normalize=True)
>>> lr.fit(X_train, Y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)

When the original data set isn't large enough, splitting it into training and test sets may reduce the number of samples that can be used for fitting the model. k-fold cross-validation can help in solving this problem with a different strategy. The whole dataset is split into k folds using always k-1 folds for training and the remaining one to validate the model. K iterations will be performed, using always a different validation fold. In the following figure, there's an example with 3 folds/iterations:

In this way, the final score can be determined as average of all values and all samples are selected for training k-1 times.

To check the accuracy of a regression, scikit-learn provides the internal method score(X, y) which evaluates the model on test data:

>>> lr.score(X_test, Y_test)
0.77371996006718879

So the overall accuracy is about 77%, which is an acceptable result considering the non-linearity of the original dataset, but it can be also influenced by the subdivision made by train_test_split (like in our case). Instead, for k-fold cross-validation, we can use the function cross_val_score(), which works with all the classifiers. The scoring parameter is very important because it determines which metric will be adopted for tests. As LinearRegression works with ordinary least squares, we preferred the negative mean squared error, which is a cumulative measure that must be evaluated according to the actual values (it's not relative). 

from sklearn.model_selection import cross_val_score

>>> scores = cross_val_score(lr, boston.data, boston.target, cv=7, scoring='neg_mean_squared_error')
array([ -11.32601065, -10.96365388, -32.12770594, -33.62294354,
-10.55957139, -146.42926647, -12.98538412])

>>> scores.mean()
-36.859219426420601
>>> scores.std()
45.704973900600457

Another very important metric used in regressions is called the coefficient of determination or R2. It measures the amount of variance on the prediction which is explained by the dataset. We define residuals, the following quantity:

In other words, it is the difference between the sample and the prediction. So the R2 is defined as follows:

For our purposes, R2 values close to 1 mean an almost perfect regression, while values close to 0 (or negative) imply a bad model. Using this metric is quite easy with cross-validation:

>>> cross_val_score(lr, X, Y, cv=10, scoring='r2')
0.75
主站蜘蛛池模板: 松溪县| 封开县| 邯郸县| 浑源县| 永吉县| 河北区| 白水县| 鄂伦春自治旗| 大新县| 始兴县| 漯河市| 通榆县| 五家渠市| 噶尔县| 贺州市| 江永县| 四子王旗| 田东县| 泽普县| 乡城县| 岳阳市| 静安区| 寿光市| 同心县| 咸宁市| 东乡族自治县| 紫金县| 东光县| 绥中县| 浠水县| 江陵县| 谢通门县| 平泉县| 股票| 黄梅县| 合山市| 田阳县| 泰宁县| 鹤峰县| 樟树市| 衡阳市|