- Hands-On Machine Learning with scikit:learn and Scientific Python Toolkits
- Tarek Amr
- 1026字
- 2021-06-18 18:24:29
Getting a more reliable score
The Iris dataset is a small set of just 150 samples. When we randomly split it into training and test sets, we ended up with 45 instances in the test set. With such a small number, we may have variations in the distribution of our targets. For example, when I randomly split the data, I got 13 samples from class 0 and 16 samples from each one of the two other classesin my test set. Knowing that predicting class 0 is easier than the other two classes in this particular dataset, we can tell that if I was luckier and had more samples of class 0 in the test set, I'd have had a higher score. Furthermore, decision trees are very sensitive to data changes, and you may get a very different tree with every slight change in your training data.
What to do now to get a more reliable score
A statistician would say let's run the whole process of data splitting, training, and predicting, more than once, and get the distribution of the different accuracy scores we get each time
. The following code does exactly that for 100 iterations:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# A list to store the score from each iteration
accuracy_scores = []
After importing the required modules and defining an accuracy_scores list to store the scores we are going get with each iteration, it is time to write a for loop to freshly split the data and recalculate the classifier's accuracy with each iteration:
for _ in range(100):
# At each iteration we freshly split our data
df_train, df_test = train_test_split(df, test_size=0.3)
x_train = df_train[iris.feature_names]
x_test = df_test[iris.feature_names]
y_train = df_train['target']
y_test = df_test['target']
# We then create a new classifier
clf = DecisionTreeClassifier()
# And use it for training and prediction
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
# Finally, we append the score to our list
accuracy_scores.append(round(accuracy_score(y_test, y_pred), 3))
# Better convert accuracy_scores from a list into a series
# Pandas series provides statistical methods to use later
accuracy_scores = pd.Series(accuracy_scores)
The following snippet lets us plot the accuracy's distribution using a box plot:
accuracy_scores.plot(
title='Distribution of classifier accuracy',
kind='box',
)
print(
'Average Score: {:.3} [5th percentile: {:.3} & 95th percentile: {:.3}]'.format(
accuracy_scores.mean(),
accuracy_scores.quantile(.05),
accuracy_scores.quantile(.95),
)
)
This will give us the following graphical analysis of the accuracy. Your results might vary slightlydue to the random split of the training and test sets and the random initial settings of the decision trees. Almost all of the scikit-learn modules support a pseudo-random number generator that can be initialized via a random_state hyperparameter. This can be used to enforce code reproducibility. Nevertheless, I deliberately ignored it this time to show how the model's results may vary from one run to the other, and to show the importance of estimating the distributions of your models' errors via iterations:

Box plots are good at showing distributions. Rather than having a single number, we now have an estimation of the best- and the worst-case scenarios of our classifier's performance.
ShuffleSplit
Generating different train and test splits is called cross-validation. This helps us have a more reliable estimation of our model's accuracy. What we did in the previous section is one of many cross-validation strategies called repeated random sub-sampling validation, or Monte Carlo cross-validation.
scikit-learn's ShuffleSplit module provides us with the functionality to perform Monte Carlo cross-validation. Rather than us splitting the data ourselves, ShuffleSplit gives us lists of indices to use for splitting our data. In the following code, we are going to use the DataFrame's loc() method and the indices we get from ShuffleSplitto randomly split the dataset into 100 training and test pairs:
import pandas as pd
from sklearn.model_selection import ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
accuracy_scores = []
# Create a shuffle split instance
rs = ShuffleSplit(n_splits=100, test_size=0.3)
# We now get 100 pairs of indices
for train_index, test_index in rs.split(df):
x_train = df.loc[train_index, iris.feature_names]
x_test = df.loc[test_index, iris.feature_names]
y_train = df.loc[train_index, 'target']
y_test = df.loc[test_index, 'target']
clf = DecisionTreeClassifier()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
accuracy_scores.append(round(accuracy_score(y_test, y_pred), 3))
accuracy_scores = pd.Series(accuracy_scores)
Alternatively, we can simplify the preceding code even further by using scikit-learn'scross_validatefunctionality. This time, we are not event splitting the data into training and test sets ourselves. We will give cross_validate thexandy values for the entire set, and then give it our ShuffleSplit instance for it to use internally to split the data. We also give it the classifier and specify what kind of scoring metric to use. When done, it will give us back a list with the calculated test set scores:
import pandas as pd
from sklearn.model_selection import ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate
clf = DecisionTreeClassifier()
rs = ShuffleSplit(n_splits=100, test_size=0.3)
x = df[iris.feature_names]
y = df['target']
cv_results = cross_validate(
clf, x, y, cv=rs, scoring='accuracy'
)
accuracy_scores = pd.Series(cv_results['test_score'])
We can plot the resulting series of accuracy scores now to get the same box plot as earlier. Cross-validation is recommended when dealing with a small dataset since a group of accuracy scores will give us a better understanding of the classifier's performance compared to a single score calculated after a single trial.
- C語言程序設計案例教程
- Advanced Splunk
- Java Web及其框架技術
- Django Design Patterns and Best Practices
- 算法精粹:經典計算機科學問題的Python實現
- 重學Java設計模式
- Web程序設計(第二版)
- Python極簡講義:一本書入門數據分析與機器學習
- The Professional ScrumMaster’s Handbook
- ScratchJr趣味編程動手玩:讓孩子用編程講故事
- Software-Defined Networking with OpenFlow(Second Edition)
- Kotlin語言實例精解
- Microsoft Windows Identity Foundation Cookbook
- PHP程序設計高級教程
- 邊做邊學深度強化學習:PyTorch程序設計實踐