官术网_书友最值得收藏!

Evaluating the model

To evaluate an algorithm, it's necessary to judge the performance of the algorithm on data that was not used to train the model. For this reason, it's common to split the data in the training and test set. The training set is used to train the model, which means that it's used to find the parameters of our algorithm. For example, training a decision tree will determine the values and variables that will create the split of the branches of the tree. The test set must remain totally hidden from the training. That means that all operations such as features engineering or feature scaling must be trained on the training set only and applied to the test set, as in the following example.

Usually, the training set will be 70-80% of the dataset, while the test set will be the rest:

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn import datasets

# import some data
iris = datasets.load_iris()

X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3, random_state=0)

scaler = preprocessing.StandardScaler().fit(X_train)
X_train_transformed = scaler.transform(X_train)
X_test_transformed = scaler.transform(X_train)

clf = LinearRegression().fit(X_train_transformed, y_train)

predictions = clf.predict(X_test_transformed)

print('Predictions: ', predictions)

The most common way to evaluate a supervised learning algorithm offline is cross-validation. This technique consists of dividing the dataset into test and training multiple times and use one part for training and one for testing.

This allows to not only check for overfitting but also to evaluate the variance in our loss

For problems where it's not possible to randomly divide the data, such as in a time series, scikit-learn has other splitting methods, such as the TimeSeriesSplit class.

In Keras, it's possible to specify a simple way to split in train/test directly during fit:

hist = model.fit(x, y, validation_split=0.2)

If the data does not fit in memory, it's also possible to use train_on_batch and test_on_batch.

For image data, in Keras, it is also possible to use the folder structure to create train and test and specify the labels. To accomplish this, it is important to use the flow_from_directory function, which will load the data with the labels and train/test split as specified. We will need to have the following directory structure:

data/
train/
category1/
001.jpg
002.jpg
...
category2/
003.jpg
004.jpg
...
validation/
category1/
0011.jpg
0022.jpg
...
category2/
0033.jpg
0044.jpg
...

Use the following function:

flow_from_directory(directory, target_size=(96, 96), color_mode='rgb', classes=None, class_mode='categorical', batch_size=128, shuffle=True, seed=11, save_to_dir=None, save_prefix='output', save_format='jpg', follow_links=False, subset=None, interpolation='nearest')
主站蜘蛛池模板: 安阳市| 蓬莱市| 湟中县| 巴南区| 吐鲁番市| 平顺县| 和林格尔县| 鱼台县| 博白县| 大埔县| 庐江县| 丰宁| 阿勒泰市| 黔江区| 南召县| 阜宁县| 库伦旗| 林芝县| 北宁市| 乃东县| 北辰区| 永春县| 沧州市| 平原县| 博乐市| 锡林浩特市| 遵义县| 沁水县| 金湖县| 凯里市| 天全县| 手机| 杨浦区| 阜南县| 疏附县| 浦城县| 吉安市| 唐河县| 偃师市| 永善县| 武冈市|