官术网_书友最值得收藏!

Understanding model capacity trade-offs

Let's train trees with different depths: starting from 1 split, and to maximal 23 splits:

In []: 
train_losses = [] 
test_losses = [] 
for depth in xrange(1, 23): 
    tree_model.max_depth = depth 
    tree_model = tree_model.fit(X_train, y_train) 
    train_losses.append(1 - tree_model.score(X_train, y_train)) 
    test_losses.append(1 - tree_model.score(X_test, y_test)) 
figure = plt.figure()  
plt.plot(train_losses, label="training loss", linestyle='--') 
plt.plot(test_losses, label="test loss") 
plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=3, ncol=2, mode="expand", borderaxespad=0.) 
Out[]: 
Figure 2.8: Training loss versus test loss, depending on the maximum tree depth

On the axis, we've plotted the tree depth, and on the y axis, we've plotted the model's error. An interesting phenomenon that we're observing here is well familiar to any machine learning practitioner: as the model gets more complex, it gets more prone to overfitting. At first, as the model's capacity grows, both training and test loss (error) decreases, but then something strange happens: while error on the training set continues to go down, test error starts growing. This means that the model fits itself to the training examples so well, that it is not able to generalize well on unseen data anymore. That's why it's so important to have a held-out dataset, and perform your model validation on it. From the above plot, we can see that our more-or-less random choice of max_depth=4 was lucky: test error at this point became even less than training error.

主站蜘蛛池模板: 乳山市| 林西县| 阿拉善左旗| 麦盖提县| 峨山| 大埔县| 安康市| 疏勒县| 伊宁市| 长葛市| 五寨县| 麻城市| 平湖市| 合山市| 师宗县| 新营市| 西青区| 于都县| 桃源县| 通化县| 武威市| 漳州市| 黄龙县| 垣曲县| 紫金县| 辽阳市| 北碚区| 神木县| 台江县| 化德县| 金平| 翁源县| 会泽县| 葫芦岛市| 莒南县| 成都市| 威海市| 新源县| 金华市| 霸州市| 武义县|