官术网_书友最值得收藏!

Getting a more reliable score

The Iris dataset is a small set of just 150 samples. When we randomly split it into training and test sets, we ended up with 45 instances in the test set. With such a small number, we may have variations in the distribution of our targets. For example, when I randomly split the data, I got 13 samples from class 0 and 16 samples from each one of the two other classesin my test set. Knowing that predicting class 0 is easier than the other two classes in this particular dataset, we can tell that if I was luckier and had more samples of class 0 in the test set, I'd have had a higher score. Furthermore, decision trees are very sensitive to data changes, and you may get a very different tree with every slight change in your training data.

What to do now to get a more reliable score

A statistician would say let's run the whole process of data splitting, training, and predicting, more than once, and get the distribution of the different accuracy scores we get each time. The following code does exactly that for 100 iterations:

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# A list to store the score from each iteration
accuracy_scores = []

After importing the required modules and defining an accuracy_scores list to store the scores we are going get with each iteration, it is time to write a for loop to freshly split the data and recalculate the classifier's accuracy with each iteration:

for _ in range(100):

# At each iteration we freshly split our data
df_train, df_test = train_test_split(df, test_size=0.3)
x_train = df_train[iris.feature_names]
x_test = df_test[iris.feature_names]

y_train = df_train['target']
y_test = df_test['target']

# We then create a new classifier
clf = DecisionTreeClassifier()

# And use it for training and prediction
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

# Finally, we append the score to our list
accuracy_scores.append(round(accuracy_score(y_test, y_pred), 3))

# Better convert accuracy_scores from a list into a series
# Pandas series provides statistical methods to use later
accuracy_scores = pd.Series(accuracy_scores)

The following snippet lets us plot the accuracy's distribution using a box plot:

accuracy_scores.plot(
title='Distribution of classifier accuracy',
kind='box',
)

print(
'Average Score: {:.3} [5th percentile: {:.3} & 95th percentile: {:.3}]'.format(
accuracy_scores.mean(),
accuracy_scores.quantile(.05),
accuracy_scores.quantile(.95),
)
)

This will give us the following graphical analysis of the accuracy. Your results might vary slightlydue to the random split of the training and test sets and the random initial settings of the decision trees. Almost all of the scikit-learn modules support a pseudo-random number generator that can be initialized via a random_state hyperparameter. This can be used to enforce code reproducibility. Nevertheless, I deliberately ignored it this time to show how the model's results may vary from one run to the other, and to show the importance of estimating the distributions of your models' errors via iterations:

Box plots are good at showing distributions. Rather than having a single number, we now have an estimation of the best- and the worst-case scenarios of our classifier's performance.

If, at any point, you do not have access to NumPy, you can still calculate a sample's mean and standard deviation using the mean() and stdev() methods provided by Python's built-in statistics module. It also provides functionalities for calculating the geometric and harmonic mean, as well as the median and quantiles.

ShuffleSplit

Generating different train and test splits is called cross-validation. This helps us have a more reliable estimation of our model's accuracy. What we did in the previous section is one of many cross-validation strategies called repeated random sub-sampling validation, or Monte Carlo cross-validation.

In probability theory, the law of large numbers states that if we repeat the same experiment a large number of times, the average of the results obtained should be close to the expected outcome. The Monte Carlo methods make use of random sampling in order to repeat an experiment over and over to reach better estimates for the results, thanks to the law of large numbers. The Monte Carlo methods were made possible due to the existence of computers, and here we use the same method to repeat the training/test split over and over to reach a better estimation of the model's accuracy.

scikit-learn's ShuffleSplit module provides us with the functionality to perform Monte Carlo cross-validation. Rather than us splitting the data ourselves, ShuffleSplit gives us lists of indices to use for splitting our data. In the following code, we are going to use the DataFrame's loc() method and the indices we get from ShuffleSplitto randomly split the dataset into 100 training and test pairs:

import pandas as pd

from sklearn.model_selection import ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

accuracy_scores = []

# Create a shuffle split instance
rs = ShuffleSplit(n_splits=100, test_size=0.3)

# We now get 100 pairs of indices
for train_index, test_index in rs.split(df):

x_train = df.loc[train_index, iris.feature_names]
x_test = df.loc[test_index, iris.feature_names]

y_train = df.loc[train_index, 'target']
y_test = df.loc[test_index, 'target']

clf = DecisionTreeClassifier()

clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

accuracy_scores.append(round(accuracy_score(y_test, y_pred), 3))

accuracy_scores = pd.Series(accuracy_scores)

Alternatively, we can simplify the preceding code even further by using scikit-learn'scross_validatefunctionality. This time, we are not event splitting the data into training and test sets ourselves. We will give cross_validate thexandy values for the entire set, and then give it our ShuffleSplit instance for it to use internally to split the data. We also give it the classifier and specify what kind of scoring metric to use. When done, it will give us back a list with the calculated test set scores:

import pandas as pd

from sklearn.model_selection import ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate

clf = DecisionTreeClassifier()
rs = ShuffleSplit(n_splits=100, test_size=0.3)

x = df[iris.feature_names]
y = df['target']

cv_results = cross_validate(
clf, x, y, cv=rs, scoring='accuracy'
)

accuracy_scores = pd.Series(cv_results['test_score'])

We can plot the resulting series of accuracy scores now to get the same box plot as earlier. Cross-validation is recommended when dealing with a small dataset since a group of accuracy scores will give us a better understanding of the classifier's performance compared to a single score calculated after a single trial.

主站蜘蛛池模板: 民县| 宜兰市| 海晏县| 区。| 常熟市| 宾阳县| 尼勒克县| 彭山县| 万荣县| 渝中区| 双城市| 兴安县| 安远县| 湄潭县| 荥经县| 江安县| 万山特区| 榆林市| 梧州市| 峨眉山市| 尼勒克县| 福泉市| 平江县| 中卫市| 吴川市| 乌拉特后旗| 长泰县| 青铜峡市| 治多县| 渝中区| 长岭县| 同心县| 高要市| 江都市| 深水埗区| 重庆市| 新巴尔虎左旗| 辽宁省| 大埔县| 洪洞县| 铜鼓县|