官术网_书友最值得收藏!

Understanding model capacity trade-offs

Let's train trees with different depths: starting from 1 split, and to maximal 23 splits:

In []: 
train_losses = [] 
test_losses = [] 
for depth in xrange(1, 23): 
    tree_model.max_depth = depth 
    tree_model = tree_model.fit(X_train, y_train) 
    train_losses.append(1 - tree_model.score(X_train, y_train)) 
    test_losses.append(1 - tree_model.score(X_test, y_test)) 
figure = plt.figure()  
plt.plot(train_losses, label="training loss", linestyle='--') 
plt.plot(test_losses, label="test loss") 
plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=3, ncol=2, mode="expand", borderaxespad=0.) 
Out[]: 
Figure 2.8: Training loss versus test loss, depending on the maximum tree depth

On the axis, we've plotted the tree depth, and on the y axis, we've plotted the model's error. An interesting phenomenon that we're observing here is well familiar to any machine learning practitioner: as the model gets more complex, it gets more prone to overfitting. At first, as the model's capacity grows, both training and test loss (error) decreases, but then something strange happens: while error on the training set continues to go down, test error starts growing. This means that the model fits itself to the training examples so well, that it is not able to generalize well on unseen data anymore. That's why it's so important to have a held-out dataset, and perform your model validation on it. From the above plot, we can see that our more-or-less random choice of max_depth=4 was lucky: test error at this point became even less than training error.

主站蜘蛛池模板: 新疆| 乐昌市| 于都县| 德州市| 莆田市| 萨嘎县| 盘锦市| 醴陵市| 阿坝县| 民乐县| 若尔盖县| 汝阳县| 商洛市| 宁阳县| 祁东县| 阿拉尔市| 无为县| 锦屏县| 宕昌县| 固始县| 屯留县| 城固县| 屏东市| 哈巴河县| 阜阳市| 浦北县| 拉萨市| 永昌县| 汝南县| 且末县| 罗田县| 海伦市| 塔城市| 廊坊市| 沙湾县| 托里县| 平乐县| 江孜县| 凤城市| 乌拉特前旗| 江永县|