官术网_书友最值得收藏!

Running the algorithm

The previous results are quite good, based on our testing set of data, based on the testing set. However, what happens if we get lucky and choose an easy testing set? Alternatively, what if it was particularly troublesome? We can discard a good model due to poor results resulting from such an unlucky split of our data.

The cross-fold validation framework is a way to address the problem of choosing a single testing set and is a standard best-practice methodology in data mining. The process works by doing many experiments with different training and testing splits, but using each sample in a testing set only once. The procedure is as follows:

  1. Split the entire dataset into several sections called folds.
  2. For each fold in the data, execute the following steps:
    1. Set that fold aside as the current testing set
    2. Train the algorithm on the remaining folds
    3. Evaluate on the current testing set
  1. Report on all the evaluation scores, including the average score.

In this process, each sample is used in the testing set only once, reducing (but not eliminating) the likelihood of choosing lucky testing sets.

Throughout this book, the code examples build upon each other within a chapter. Each chapter's code should be entered into the same Jupyter Notebook unless otherwise specified in-text.

The scikit-learn library contains a few cross-fold validation methods. A helper function is given that performs the preceding procedure. We can import it now in our Jupyter Notebook:

from sklearn.cross_validation import cross_val_score

By cross_val_score uses a specific methodology called Stratified K-Fold to create folds that have approximately the same proportion of classes in each fold, again reducing the likelihood of choosing poor folds. Stratified K-Fold is a great default -we won't mess with it right now.

Next, we use this new function to evaluate our model using cross-fold validation:

scores = cross_val_score(estimator, X, y, scoring='accuracy') 
average_accuracy = np.mean(scores) * 100
print("The average accuracy is {0:.1f}%".format(average_accuracy))

Our new code returns a slightly more modest result of 82.3 percent, but it is still quite good considering we have not yet tried setting better parameters. In the next section, we will see how we would go about changing the parameters to achieve a better outcome.

It is quite natural for variation in results when performing data mining, and attempting to repeat experiments. This is due to variations in how the folds are created and randomness inherent in some classification algorithms. We can deliberately choose to replicate an experiment exactly by setting the random state (which we will do in later chapters). In practice, it's a good idea to rerun experiments multiple times to get a sense of the average result and the spread of the results (the mean and standard deviation) across all experiments.

主站蜘蛛池模板: 上栗县| 连江县| 如东县| 崇仁县| 杂多县| 十堰市| 甘肃省| 盐山县| 遂昌县| 雷州市| 眉山市| 玛多县| 杭锦旗| 蕲春县| 新乡市| 东辽县| 徐汇区| 岳阳县| 平邑县| 云安县| 瑞金市| 仙桃市| 洮南市| 阿尔山市| 东丰县| 汽车| 特克斯县| 策勒县| 恩平市| 健康| 江西省| 马边| 清涧县| 普格县| 黑龙江省| 平利县| 石景山区| 涟源市| 建平县| 阳新县| 吉木萨尔县|