官术网_书友最值得收藏!

Tuning the hyperparameters for higher accuracy

Now that we have learned how to evaluate the model's accuracy more reliably using the ShuffleSplit cross-validation method, it is time to test our earlier hypothesis: would a smaller tree be more accurate?

Here is what we are going to do in the following sub sections:

  1. Split the data into training and test sets.
  2. Keep the test side to one side now.
  3. Limit the tree's growth using different values of max_depth.
  4. For each max_depth setting, we will use the ShuffleSplit cross-validation method on the training set to get an estimation of the classifier's accuracy.
  5. Once we decide which value to use for max_depth, we will train the algorithm one last time on the entire training set and predict on the test set.

Splitting the data

Here is the usual code for splitting the data into training and test sets:

from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, test_size=0.25)

x_train = df_train[iris.feature_names]
x_test = df_test[iris.feature_names]

y_train = df_train['target']
y_test = df_test['target']

Trying different hyperparameter values

If we allowed our earlier treeto grow indefinitely, we would get a tree depth of 4. You can check the depth of a tree by callingclf.get_depth()once it is trained. So, it doesn't make sense to try any max_depth values above 4. Here, we are going to loop over the maximum depths from 1 to 4 and use ShuffleSplit to get the classifier's accuracy:

import pandas as pd
from sklearn.model_selection import ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate

for max_depth in [1, 2, 3, 4]:

# We initialize a new classifier each iteration with different max_depth
clf = DecisionTreeClassifier(max_depth=max_depth)
# We also initialize our shuffle splitter
rs = ShuffleSplit(n_splits=20, test_size=0.25)

cv_results = cross_validate(
clf, x_train, y_train, cv=rs, scoring='accuracy'
)
accuracy_scores = pd.Series(cv_results['test_score'])

print(
'@ max_depth = {}: accuracy_scores: {}~{}'.format(
max_depth,
accuracy_scores.quantile(.1).round(3),
accuracy_scores.quantile(.9).round(3)
)
)

We called the cross_validate() method as we did earlier, giving it the classifier's instance, as well as the ShuffleSplit instance. We also defined our evaluation score as accuracy. Finally, we print the scores we get with each iteration. We will look more at the printed values in the next section.

Comparing the accuracy scores

Since we have a list of scores for each iteration, we can calculate their mean, or, as we will do here, we will print their 10th and 90th percentiles to get an idea of the accuracy ranges versus each max_depthsetting.

Running the preceding code gave me the following results:

@ max_depth = 1: accuracy_scores: 0.532~0.646
@ max_depth = 2: accuracy_scores: 0.925~1.0
@ max_depth = 3: accuracy_scores: 0.929~1.0
@ max_depth = 4: accuracy_scores: 0.929~1.0

One thing I am sure about now is that a single-level tree (usually called a stub) is not as accurate as deeper trees. In other words, having a single decision based on whether the petal width is less than 0.8 is not enough. Allowing the tree to grow further improves the accuracy, but I can't see many differences between trees of depths 2, 3, and 4. I'd conclude that contrary to my earlier speculations, we shouldn't worry too much about overfitting here.

Here, we tried different values for a single parameter, max_depth. That's why a simple for loop over its different values was feasible. In later chapters, we will see what to do when we need to tune multiple hyperparameters at once to reach a combination that gives the best accuracy.

Finally, you can train your model once more using the entire training set and a max_depth value of, say, 3. Then, use the trained model to predict the classes for the test set in order to evaluate your final model. I won't bore you with the code for it this time as you can easily do it yourself.

In addition to printing the classifier's decision and descriptive statistics about its accuracy, it is useful to also see its decision boundaries visually. Mapping those boundaries versus the data samples helps us understand why the classifier made certain mistakes. In the next section, we are going to check the decision boundaries we got for the Iris dataset.

主站蜘蛛池模板: 大同市| 色达县| 云安县| 论坛| 柳江县| 新民市| 巴东县| 大庆市| 保德县| 东兴市| 太仆寺旗| 南部县| 龙门县| 余姚市| 永顺县| 巴东县| 永丰县| 当涂县| 和顺县| 巫溪县| 民县| 陇川县| 独山县| 保定市| 儋州市| 扎赉特旗| 广宗县| 旬阳县| 赤壁市| 香港| 永宁县| 菏泽市| 洞口县| 凉山| 哈尔滨市| 长宁区| 祁门县| 徐州市| 田阳县| 合江县| 岢岚县|