- Hands-On Machine Learning with scikit:learn and Scientific Python Toolkits
- Tarek Amr
- 1569字
- 2021-06-18 18:24:30
Building decision tree regressors
Decision tree regressors work in a similar fashion to their classifier counterparts. The algorithm splits the data recursively using one feature at a time. At the end of the process, we end up with leaf nodes—that is, nodes where there are no further splits. In the case of a classifier, if, at training time, a leaf node has three instances of class A and one instance of class B, then at prediction time, if an instance lands in the same leaf node, the classifier decides that it belongs to the majority class (class A). In the case of a regressor, if, at training time, a leaf node has three instances of values 12, 10, and 8,then, at prediction time, if an instance lands in the same leaf node, the regressor predicts its value to be 10 (the average of the three values at training time).
Predicting people's heights
Say we have two populations. Population 1 has an average height of 155 cm for females, with a standard deviation of 4, and an average height of 175 cm for males, with a standard deviation of 5. Population 2 has an average height of 165 cm for females, with a standard deviation of 15, and an average height of 185 cm for males, with a standard deviation of 12. We decide to take 200 males and 200 females from each population. To be able to simulate this, we can use a function provided by NumPy that draws random samples from a normal (Gaussian) distribution.
Here is the code for generating random samples:
# It's customary to call numpy np
import numpy as np
# We need 200 samples from each
n = 200
# From each population we get 200 male and 200 female samples
height_pop1_f = np.random.normal(loc=155, scale=4, size=n)
height_pop1_m = np.random.normal(loc=175, scale=5, size=n)
height_pop2_f = np.random.normal(loc=165, scale=15, size=n)
height_pop2_m = np.random.normal(loc=185, scale=12, size=n)
At the moment, we don't actually care about which population each sample comes from. So, we will useconcatenateto group all the males and all the females together:
# We group all females together and all males together
height_f = np.concatenate([height_pop1_f, height_pop2_f])
height_m = np.concatenate([height_pop1_m, height_pop2_m])
We then put this data into a DataFrame (df_height) to be able to deal with it easily. There, we also give a label of 1 to females and 2 to males:
df_height = pd.DataFrame(
{
'Gender': [1 for i in range(height_f.size)] +
[2 for i in range(height_m.size)],
'Height': np.concatenate((height_f, height_m))
}
)
Let's plot our fictional data using histograms to see the height distributions among each gender:
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
df_height[df_height['Gender'] == 1]['Height'].plot(
label='Female', kind='hist',
bins=10, alpha=0.7, ax=ax
)
df_height[df_height['Gender'] == 2]['Height'].plot(
label='Male', kind='hist',
bins=10, alpha=0.7, ax=ax
)
ax.legend()
fig.show()
The preceding code gives us the following graph:

As you can see, the resulting distributions are notsymmetrical. Although normal distributions are symmetrical, these artificial distributions are made of two sub-distributions combined. We can use this line of code to see that their mean and median values are not equal:
df_height.groupby('Gender')[['Height']].agg([np.mean, np.median]).round(1)
Here, we have the mean and median heights for each group:

Now, we want to predict people's heights using one feature—their gender. Therefore, we are going to split our data into training and test sets and create our x and y sets, as follows:
df_train, df_test = train_test_split(df_height, test_size=0.3)
x_train, x_test = df_train[['Gender']], df_test[['Gender']]
y_train, y_test = df_train['Height'], df_test['Height']
Remember that in the case of classifications, the trees use either gini or entropy to decide the best split at each step during the training process. The goal for these criteria was to find a split where each of the two resulting sub-groups is as pure as possible. In the case of regression, we have a different goal. We want the members of each group to have target values that are as close as possible to the predictions they make. scikit-learn implements two criteria to achieve this goal:
- Mean squared error (MSE or L2):Say after the split, we get three samples in one group with targets of 5, 5, and 8. We calculate the mean value of these three numbers (6). Then, we calculate the squared differences between each sample and the calculated mean—1, 1, and 4. We then take the mean of these squared differences, which is 2.
- Mean absolute error (MAE or L1): Say after the split, we get three samples in one group with targets of 5, 5, and 8. We calculate the median value of these three numbers (5). Then, we calculate the absolute differences between each sample and the calculated median—0, 0, and 3. We then take the mean of these absolute differences, which is 1.
For each possible split at training time, the tree calculates either L1 or L2 for each of the expected sub-groups after the split. A split with the minimum L1 or L2 is then chosen at this step. L1 may be preferred sometimes due to its robustness to outliers. The other important difference to keep in mind is that L1 uses median while L2 uses mean in its calculations.
Let's now compare the effect of the splitting criteria on our height dataset:
from sklearn.tree import export_text
from sklearn.tree import DecisionTreeRegressor
for criterion in ['mse', 'mae']:
rgrsr = DecisionTreeRegressor(criterion=criterion)
rgrsr.fit(x_train, y_train)
print(f'criterion={criterion}:\n')
print(export_text(rgrsr, feature_names=['Gender'], spacing=3, decimals=1))
We get the following two trees depending on the chosen criterion:
criterion=mse:
|--- Gender <= 1.5
| |--- value: [160.2]
|--- Gender > 1.5
| |--- value: [180.8]
criterion=mae:
|--- Gender <= 1.5
| |--- value: [157.5]
|--- Gender > 1.5
| |--- value: [178.6]
As expected, when MSE was used, the predictions were close to the mean of each gender, while for MAE, the predictions were close to the median.
Of course, we only had one binary feature in our dataset—gender. That's why we had a very shallow tree with a single split (a stub). Actually, in this case, we do not even need to train a decision tree; we could have easily calculated the mean heights for males and females and used them as our expected values right away. The decisions made by such a shallow tree are called biased decisions. If we would have allowed each individual to express themselves using more information, rather than just their gender, then we would have been able to make more accurate predictions for each individual.
Finally, just as in the classification trees, we have the same knobs, such as max_depth, min_samples_split, and min_samples_leaf, to control the growth of a regression tree.
Regressor's evaluation
The very same MSE and MAE scores can also be used to evaluate a regressor's accuracy. We use them to compare the regressor's predictions to the actual targets in the test set. Here is the code predicting and evaluating the predictions made:
from sklearn.metrics import mean_squared_error, mean_absolute_error
y_test_pred = rgrsr.predict(x_test)
print('MSE:', mean_squared_error(y_test, y_test_pred))
print('MAE:', mean_absolute_error(y_test, y_test_pred))
Using MSE as a splitting criterion gives us an MSE of 117.2 and an MAE of 8.2, while using MAE as a splitting criterion gives us an MSE of 123.3and an MAE of 7.8. Clearly, using MAE as the splitting criterion gives a lower MAE at test time, and vice versa. In other words, if your aim is to reduce the error of your predictions based on a certain metric, it is advised to use the same metric when growing your tree at the time of training.
Setting sample weights
Both the decision tree classifiers and the regressors allow us to give more or less emphasis to the individual training samples via setting their weights while fitting. This is a common feature in many estimators, and decision trees are no exception here. To see the effect of sample weights, we are going to give 10 times more weight to users above 150 cm versus the remaining users:
rgrsr = DecisionTreeRegressor(criterion='mse')
sample_weight = y_train.apply(lambda h: 10 if h > 150 else 1)
rgrsr.fit(x_train, y_train, sample_weight=sample_weight)
Conversely, we can also give more weights to users who are 150 cm and below by changing the sample_weight calculations, as follows:
sample_weight = y_train.apply(lambda h: 10 if h <= 150 else 1)
By using the export_text() function, as we did in the previous section, we can display the resulting trees. We can see how sample_weightaffected their final structures:
Emphasis on "below 150":
|--- Gender <= 1.5
| |--- value: [150.7]
|--- Gender > 1.5
| |--- value: [179.2]
Emphasis on "above 150":
|--- Gender <= 1.5
| |--- value: [162.4]
|--- Gender > 1.5
| |--- value: [180.2]
By default, all samples are given the same weight. Weighting individual samples differently is useful when dealing with imbalanced data or imbalanced business decisions; maybe you can tolerate delaying a shipment for a new customer more than you can do for your loyal ones. In Chapter 8, Ensembles – When One Model Is Not Enough, we will also see how sample weights are an integral part of how the AdaBoost algorithm learns.
- Puppet 4 Essentials(Second Edition)
- Python科學計算(第2版)
- Getting Started with PowerShell
- R語言數據可視化實戰
- Python機器學習算法: 原理、實現與案例
- Learning Docker Networking
- Node.js 12實戰
- iOS開發項目化入門教程
- 高效使用Greenplum:入門、進階與數據中臺
- 深入理解Kafka:核心設計與實踐原理
- 用Python動手學統計學
- Visual FoxPro程序設計實驗教程
- PHP從入門到精通(第7版)
- Implementing NetScaler VPX?(Second Edition)
- Java編程兵書