官术网_书友最值得收藏!

Building decision tree regressors

Decision tree regressors work in a similar fashion to their classifier counterparts. The algorithm splits the data recursively using one feature at a time. At the end of the process, we end up with leaf nodes—that is, nodes where there are no further splits. In the case of a classifier, if, at training time, a leaf node has three instances of class A and one instance of class B, then at prediction time, if an instance lands in the same leaf node, the classifier decides that it belongs to the majority class (class A). In the case of a regressor, if, at training time, a leaf node has three instances of values 12, 10, and 8,then, at prediction time, if an instance lands in the same leaf node, the regressor predicts its value to be 10 (the average of the three values at training time).

Actually, picking the average is not always the best case. It rather depends on the splitting criterion used. In the next section, we are going to see this in more detail with the help of an example.

Predicting people's heights

Say we have two populations. Population 1 has an average height of 155 cm for females, with a standard deviation of 4, and an average height of 175 cm for males, with a standard deviation of 5. Population 2 has an average height of 165 cm for females, with a standard deviation of 15, and an average height of 185 cm for males, with a standard deviation of 12. We decide to take 200 males and 200 females from each population. To be able to simulate this, we can use a function provided by NumPy that draws random samples from a normal (Gaussian) distribution.

Here is the code for generating random samples:

# It's customary to call numpy np
import numpy as np

# We need 200 samples from each
n = 200

# From each population we get 200 male and 200 female samples
height_pop1_f = np.random.normal(loc=155, scale=4, size=n)
height_pop1_m = np.random.normal(loc=175, scale=5, size=n)
height_pop2_f = np.random.normal(loc=165, scale=15, size=n)
height_pop2_m = np.random.normal(loc=185, scale=12, size=n)

At the moment, we don't actually care about which population each sample comes from. So, we will useconcatenateto group all the males and all the females together:

# We group all females together and all males together
height_f = np.concatenate([height_pop1_f, height_pop2_f])
height_m = np.concatenate([height_pop1_m, height_pop2_m])

We then put this data into a DataFrame (df_height) to be able to deal with it easily. There, we also give a label of 1 to females and 2 to males:

df_height = pd.DataFrame(
{
'Gender': [1 for i in range(height_f.size)] +
[2 for i in range(height_m.size)],
'Height': np.concatenate((height_f, height_m))
}
)

Let's plot our fictional data using histograms to see the height distributions among each gender:

fig, ax = plt.subplots(1, 1, figsize=(10, 5))

df_height[df_height['Gender'] == 1]['Height'].plot(
label='Female', kind='hist',
bins=10, alpha=0.7, ax=ax
)
df_height[df_height['Gender'] == 2]['Height'].plot(
label='Male', kind='hist',
bins=10, alpha=0.7, ax=ax
)

ax.legend()

fig.show()

The preceding code gives us the following graph:

As you can see, the resulting distributions are notsymmetrical. Although normal distributions are symmetrical, these artificial distributions are made of two sub-distributions combined. We can use this line of code to see that their mean and median values are not equal:

df_height.groupby('Gender')[['Height']].agg([np.mean, np.median]).round(1)

Here, we have the mean and median heights for each group:

Now, we want to predict people's heights using one feature—their gender. Therefore, we are going to split our data into training and test sets and create our x and y sets, as follows:

df_train, df_test = train_test_split(df_height, test_size=0.3)
x_train, x_test = df_train[['Gender']], df_test[['Gender']]
y_train, y_test = df_train['Height'], df_test['Height']

Remember that in the case of classifications, the trees use either gini or entropy to decide the best split at each step during the training process. The goal for these criteria was to find a split where each of the two resulting sub-groups is as pure as possible. In the case of regression, we have a different goal. We want the members of each group to have target values that are as close as possible to the predictions they make. scikit-learn implements two criteria to achieve this goal:

  • Mean squared error (MSE or L2):Say after the split, we get three samples in one group with targets of 5, 5, and 8. We calculate the mean value of these three numbers (6). Then, we calculate the squared differences between each sample and the calculated mean—1, 1, and 4. We then take the mean of these squared differences, which is 2.
  • Mean absolute error (MAE or L1): Say after the split, we get three samples in one group with targets of 5, 5, and 8. We calculate the median value of these three numbers (5). Then, we calculate the absolute differences between each sample and the calculated median—0, 0, and 3. We then take the mean of these absolute differences, which is 1.

For each possible split at training time, the tree calculates either L1 or L2 for each of the expected sub-groups after the split. A split with the minimum L1 or L2 is then chosen at this step. L1 may be preferred sometimes due to its robustness to outliers. The other important difference to keep in mind is that L1 uses median while L2 uses mean in its calculations.

If, at training time, we see 10 samples with almost identical features but different targets, they may all end up together in one leaf node. Now, if we use L1 as the splitting criterion when building our regressor, then if we get a sample at prediction time with identical features to the 10 training samples, we should expect the prediction to be close to the median value of the targets of the 10 training samples. Likewise, if L2 is used for building the regressor, we should then expect the prediction of the new sample to be close to the mean value of the targets of the 10 training samples.

Let's now compare the effect of the splitting criteria on our height dataset:

from sklearn.tree import export_text
from sklearn.tree import DecisionTreeRegressor

for criterion in ['mse', 'mae']:
rgrsr = DecisionTreeRegressor(criterion=criterion)
rgrsr.fit(x_train, y_train)

print(f'criterion={criterion}:\n')
print(export_text(rgrsr, feature_names=['Gender'], spacing=3, decimals=1))

We get the following two trees depending on the chosen criterion:

criterion=mse:

|--- Gender <= 1.5
| |--- value: [160.2]
|--- Gender > 1.5
| |--- value: [180.8]

criterion=mae:

|--- Gender <= 1.5
| |--- value: [157.5]
|--- Gender > 1.5
| |--- value: [178.6]

As expected, when MSE was used, the predictions were close to the mean of each gender, while for MAE, the predictions were close to the median.

Of course, we only had one binary feature in our dataset—gender. That's why we had a very shallow tree with a single split (a stub). Actually, in this case, we do not even need to train a decision tree; we could have easily calculated the mean heights for males and females and used them as our expected values right away. The decisions made by such a shallow tree are called biased decisions. If we would have allowed each individual to express themselves using more information, rather than just their gender, then we would have been able to make more accurate predictions for each individual.

Finally, just as in the classification trees, we have the same knobs, such as max_depth, min_samples_split, and min_samples_leaf, to control the growth of a regression tree.

Regressor's evaluation

The very same MSE and MAE scores can also be used to evaluate a regressor's accuracy. We use them to compare the regressor's predictions to the actual targets in the test set. Here is the code predicting and evaluating the predictions made:

from sklearn.metrics import mean_squared_error, mean_absolute_error

y_test_pred = rgrsr.predict(x_test)
print('MSE:', mean_squared_error(y_test, y_test_pred))
print('MAE:', mean_absolute_error(y_test, y_test_pred))

Using MSE as a splitting criterion gives us an MSE of 117.2 and an MAE of 8.2, while using MAE as a splitting criterion gives us an MSE of 123.3and an MAE of 7.8. Clearly, using MAE as the splitting criterion gives a lower MAE at test time, and vice versa. In other words, if your aim is to reduce the error of your predictions based on a certain metric, it is advised to use the same metric when growing your tree at the time of training.

Setting sample weights

Both the decision tree classifiers and the regressors allow us to give more or less emphasis to the individual training samples via setting their weights while fitting. This is a common feature in many estimators, and decision trees are no exception here. To see the effect of sample weights, we are going to give 10 times more weight to users above 150 cm versus the remaining users:

rgrsr = DecisionTreeRegressor(criterion='mse')
sample_weight = y_train.apply(lambda h: 10 if h > 150 else 1)
rgrsr.fit(x_train, y_train, sample_weight=sample_weight)

Conversely, we can also give more weights to users who are 150 cm and below by changing the sample_weight calculations, as follows:

sample_weight = y_train.apply(lambda h: 10 if h <= 150 else 1)

By using the export_text() function, as we did in the previous section, we can display the resulting trees. We can see how sample_weightaffected their final structures:

Emphasis on "below 150":

|--- Gender <= 1.5
| |--- value: [150.7]
|--- Gender > 1.5
| |--- value: [179.2]

Emphasis on "above 150":

|--- Gender <= 1.5
| |--- value: [162.4]
|--- Gender > 1.5
| |--- value: [180.2]

By default, all samples are given the same weight. Weighting individual samples differently is useful when dealing with imbalanced data or imbalanced business decisions; maybe you can tolerate delaying a shipment for a new customer more than you can do for your loyal ones. In Chapter 8, Ensembles – When One Model Is Not Enough, we will also see how sample weights are an integral part of how the AdaBoost algorithm learns.

主站蜘蛛池模板: 曲阳县| 句容市| 镇巴县| 铁岭市| 桑日县| 双辽市| 荥阳市| 佛山市| 凤翔县| 苗栗市| 东兰县| 葵青区| 张家川| 塘沽区| 巴里| 兴化市| 东安县| 花莲市| 丹寨县| 武冈市| 湾仔区| 方城县| 伊金霍洛旗| 溧水县| 凤城市| 句容市| 高雄县| 东乡县| 阿勒泰市| 洞头县| 酒泉市| 吴江市| 右玉县| 长寿区| 司法| 广安市| 安新县| 集贤县| 五华县| 邵东县| 玉溪市|