官术网_书友最值得收藏!

Visualizing the tree's decision boundaries

To be able to pick the right algorithm for the problem, it is important to have a conceptual understanding of how an algorithm makes its decision. As we already know by now, decision trees pick one feature at a time and try to split the data accordingly. Nevertheless, it is important to be able to visualize those decisions as well. Let me first plot our classes versus our features, then I will explain further:

When the tree made a decision to split the data around a petal width of 0.8, you can think of it as drawing a horizontal line in the right-hand side graph at the value of 0.8. Then, with every later split, the tree splits the space further using combinations of horizontal and vertical lines. By knowing this, you should not expect the algorithm to use curves or 45-degree lines to separate the classes.

One trick to plot the decision boundaries that a tree has after it has been trained is to use contour plots. For simplicity, let's assume we only have two features—petal length and petal width. We then generate almost all the possible values for those two features and predict the class labels for our new hypothetical data. Then, we create a contour plot using those predictions to see the boundaries between the classes. The following function, created by Richard Johanssonof the University of Gothenburg, does exactly that:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def plot_decision_boundary(clf, x, y):

feature_names = x.columns
x, y = x.values, y.values

x_min, x_max = x[:,0].min(), x[:,0].max()
y_min, y_max = x[:,1].min(), x[:,1].max()

step = 0.02

xx, yy = np.meshgrid(
np.arange(x_min, x_max, step),
np.arange(y_min, y_max, step)
)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(12,8))
plt.contourf(xx, yy, Z, cmap='Paired_r', alpha=0.25)
plt.contour(xx, yy, Z, colors='k', linewidths=0.7)
plt.scatter(x[:,0], x[:,1], c=y, edgecolors='k')
plt.title("Tree's Decision Boundaries")
plt.xlabel(feature_names[0])
plt.ylabel(feature_names[1])

This time, we will train our classifier using two features only, and then call the preceding function using the newly trained model:

x = df[['petal width (cm)', 'petal length (cm)']]
y = df['target']

clf = DecisionTreeClassifier(max_depth=3)
clf.fit(x, y)

plot_decision_boundary(clf, x, y)

Richard Johansson's functions overlay the contour graph over our samples to give us the following graph:

By seeing the decision boundaries as well as the data samples, you can make better decisions on whether one algorithm is good for the problem at hand.

Feature engineering

"Every man takes the limits of his own field of vision for the limits of the world."
Arthur Schopenhauer

On seeing the class distribution versus the petal lengths and widths, you may wonder: what if the decision trees could also draw boundaries that are at 40 degrees? Wouldn't 40-degree boundaries be more apt than those horizontal and vertical jigsaws?Unfortunately, decision trees cannot do that, but let's put the algorithm aside for a moment and think about the data instead. How about creating a new axis where the class boundaries change their orientation?

Let's create two new columns—petal length x width (cm) and sepal length x width (cm)—and see how the class distribution will look:

df['petal length x width (cm)'] = df['petal length (cm)'] * df['petal width (cm)']
df['sepal length x width (cm)'] = df['sepal length (cm)'] * df['sepal width (cm)']

The following code will plot the classes versus the newly derived features:

fig, ax = plt.subplots(1, 1, figsize=(12, 6));

h_label = 'petal length x width (cm)'
v_label = 'sepal length x width (cm)'

for c in df['target'].value_counts().index.tolist():
df[df['target'] == c].plot(
title='Class distribution vs the newly derived features',
kind='scatter',
x=h_label,
y=v_label,
color=['r', 'g', 'b'][c], # Each class different color
marker=f'${c}$', # Use class id as marker
s=64,
alpha=0.5,
ax=ax,
)

fig.show()

Running this code will produce the following graph:

This new projection looks better; it makes the data more vertically separable. Nevertheless, the proof of the pudding is still in the eating. So, let's train two classifiers—one on the original features and one on the newly derived features—and see

how their accuracies compare. The following code goes through 500 iterations, each time splitting the data randomly, and then training both models, each with its own set of features, and storing the accuracy we get with each iteration:

features_orig = iris.feature_names
features_new = ['petal length x width (cm)', 'sepal length x width (cm)']

accuracy_scores_orig = []
accuracy_scores_new = []

for _ in range(500):

df_train, df_test = train_test_split(df, test_size=0.3)

x_train_orig = df_train[features_orig]
x_test_orig = df_test[features_orig]

x_train_new = df_train[features_new]
x_test_new = df_test[features_new]

y_train = df_train['target']
y_test = df_test['target']

clf_orig = DecisionTreeClassifier(max_depth=2)
clf_new = DecisionTreeClassifier(max_depth=2)

clf_orig.fit(x_train_orig, y_train)
clf_new.fit(x_train_new, y_train)

y_pred_orig = clf_orig.predict(x_test_orig)
y_pred_new = clf_new.predict(x_test_new)

accuracy_scores_orig.append(round(accuracy_score(y_test, y_pred_orig),
3))
accuracy_scores_new.append(round(accuracy_score(y_test, y_pred_new),
3))

accuracy_scores_orig = pd.Series(accuracy_scores_orig)
accuracy_scores_new = pd.Series(accuracy_scores_new)

Then, we can use box plots to compare the accuracies of the two classifiers:

fig, axs = plt.subplots(1, 2, figsize=(16, 6), sharey=True);

accuracy_scores_orig.plot(
title='Distribution of classifier accuracy [Original Features]',
kind='box',
grid=True,
ax=axs[0]
)

accuracy_scores_new.plot(
title='Distribution of classifier accuracy [New Features]',
kind='box',
grid=True,
ax=axs[1]
)

fig.show()

Here, we put the top plots side by side to be able to compare them to each other:

Clearly, the derived features helped a bit. Its accuracy is higher on average (0.96 versus 0.93), and its lower bound is also higher.

主站蜘蛛池模板: 察隅县| 安泽县| 比如县| 滨海县| 绥棱县| 泾阳县| 绥阳县| 文水县| 开江县| 建昌县| 南雄市| 康保县| 新乡市| 五指山市| 呼和浩特市| 博爱县| 本溪| 凤山市| 运城市| 界首市| 金门县| 宜兴市| 武安市| 北京市| 承德市| 工布江达县| 阜康市| 沙田区| 拉萨市| 治县。| 晋江市| 临江市| 广宗县| 西安市| 开平市| 沧州市| 衡南县| 卫辉市| 武乡县| 武胜县| 清水河县|